In [1]:
import pandas as pd
import numpy as np

np.random.seed(1349)

In [2]:
df = pd.read_csv('students.csv', index_col=0)

## Reshaping

We will talk about reshaping operations in more detail when we discuss tidy data, but for now we will focus on a couple of common operations that can be used to summarize our data by different subgroups.

### `pd.crosstab`

For an example of `.crosstab`, we will count the number of students passing math in each classroom.

In [3]:
df.head()

Unnamed: 0,name,math,english,reading,classroom
0,Martyn,78,69,84,B
1,Myatta,70,71,88,B
2,Alyx,76,62,71,B
3,Deryk,98,92,72,B
4,Jaryd,94,79,82,B


In [4]:
df['passing_math'] = np.where(df.math >=70, 'passing', 'failing')

In [5]:
df.passing_math

0     passing
1     passing
2     passing
3     passing
4     passing
5     passing
6     passing
7     passing
8     failing
9     passing
10    passing
11    passing
Name: passing_math, dtype: object

In [6]:
df.classroom

0     B
1     B
2     B
3     B
4     B
5     B
6     A
7     B
8     A
9     A
10    B
11    B
Name: classroom, dtype: object

In [7]:
df.classroom.value_counts()

B    9
A    3
Name: classroom, dtype: int64

In [8]:
df.passing_math.value_counts()

passing    11
failing     1
Name: passing_math, dtype: int64

In [9]:
#  cant do this : df.crosstab(df.passing_math, df.classroom)

In [10]:
# We will use our student grades DataFrame, df.
pd.crosstab(df.classroom, df.passing_math)


passing_math,failing,passing
classroom,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,2
B,0,9


We'll use the `pd.crosstab` function to count the number of occurances of each subgroup (i.e. each unique combination of classroom and whether or not the student is passing math):

We can also view subtotals with the `margins` set to `True`.

In [11]:
pd.crosstab(df.classroom, df.passing_math, margins=True)


passing_math,failing,passing,All
classroom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,2,3
B,0,9,9
All,1,11,12


The `.crosstab` function will let us view the numbers as percentages of the total as well by setting `normalize` to `True`.

In [12]:
pd.crosstab(df.classroom, df.passing_math, margins=True, normalize=True)

passing_math,failing,passing,All
classroom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.083333,0.166667,0.25
B,0.0,0.75,0.75
All,0.083333,0.916667,1.0


In [13]:
# crosstab will give you everything counted between the intersection of two different
# categorical variables in a dataframe

### `.pivot_table`

Here we use the `.pivot_table` method to create our summary. This method produces output similar to an excel pivot table. We must supply 3 things here:

- which values will make up the rows (the `index`)
- which values will make up the columns
- the values we are aggregating
- an aggregation method (`aggfunc`); if we can omit this, and `mean` will be used by default

For an example using the `pivot_table` method, we'll calculate the average math grade for the combination of `classroom` and `passing_math` status.

In [14]:
pd.DataFrame.pivot_table?

In [15]:
df.pivot_table(index='classroom', columns='passing_math', values='math', aggfunc='max')

passing_math,failing,passing
classroom,Unnamed: 1_level_1,Unnamed: 2_level_1
A,62.0,97.0
B,,98.0


In [16]:
df.groupby(['passing_math', 'classroom']).math.max()

passing_math  classroom
failing       A            62
passing       A            97
              B            98
Name: math, dtype: int64

Here we'll create a dataframe that represents various orders at a restaurant.

In [17]:
n = 40

orders = pd.DataFrame({
    'drink': np.random.choice(['Tea', 'Water', 'Water'], n),
    'meal': np.random.choice(['Curry', 'Yakisoba Noodle', 'Pad Thai'], n),
})
# .sample will give me n number of random rows
orders.sample(10)

Unnamed: 0,drink,meal
26,Water,Yakisoba Noodle
19,Water,Curry
17,Water,Curry
27,Water,Curry
7,Water,Pad Thai
2,Water,Pad Thai
29,Water,Pad Thai
33,Water,Yakisoba Noodle
37,Tea,Yakisoba Noodle
14,Tea,Yakisoba Noodle


#### `.map`

The `.map` method lets us use a dictionary to calculate the total price for an order; then I can save my calculations to a new column named `bill`. Let's do this step-by-step.

In [18]:
# Create a dictionary of prices for drinks and meals.

prices = {
    'Yakisoba Noodle': 9,
    'Curry': 11,
    'Pad Thai': 10,
    'Tea': 2,
    'Water': 0,
}

In [19]:
"""
Match the values in the 'drink' and 'meal' columns with the values in the 'prices' dictionary 
and perform the specified calculation. Save this calculation to a new column named 'bill'.
"""
# the dataframe we want to reference
# the column we want to reference,
# what we want to do with that Series (map)
# from what reference are we mapping? (prices)
orders['bill'] = orders.drink.map(prices) + orders.meal.map(prices)

In [20]:
orders.head()

Unnamed: 0,drink,meal,bill
0,Water,Pad Thai,10
1,Water,Curry,11
2,Water,Pad Thai,10
3,Tea,Pad Thai,12
4,Water,Pad Thai,10


Let's take a look at how many orders have each combination of meal and drink:

In [21]:
pd.crosstab(orders.drink, orders.meal, margins=True)

meal,Curry,Pad Thai,Yakisoba Noodle,All
drink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tea,2,5,4,11
Water,9,11,9,29
All,11,16,13,40


In [22]:
orders.pivot_table(index='drink', columns='meal', values='bill', aggfunc='mean')

meal,Curry,Pad Thai,Yakisoba Noodle
drink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tea,13,12,11
Water,11,10,9


And let's find out the average bill amount for each combination: 

It's interesting to note that we could find the same information with a multi-level group by:

In [23]:
orders.groupby(['meal', 'drink']).bill.agg('max')

meal             drink
Curry            Tea      13
                 Water    11
Pad Thai         Tea      12
                 Water    10
Yakisoba Noodle  Tea      11
                 Water     9
Name: bill, dtype: int64

The choice between group by and a pivot table here is mostly asthetic, and you should use whichever makes more sense to you with the problem at hand. 

### Transposing

In [24]:
orders.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
drink,Water,Water,Water,Tea,Water,Water,Water,Water,Water,Water,...,Tea,Tea,Tea,Water,Water,Water,Water,Tea,Water,Water
meal,Pad Thai,Curry,Pad Thai,Pad Thai,Pad Thai,Pad Thai,Yakisoba Noodle,Pad Thai,Pad Thai,Pad Thai,...,Pad Thai,Pad Thai,Pad Thai,Yakisoba Noodle,Curry,Yakisoba Noodle,Yakisoba Noodle,Yakisoba Noodle,Pad Thai,Yakisoba Noodle
bill,10,11,10,12,10,10,9,10,10,10,...,12,12,12,9,11,9,9,11,10,9


In [25]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math,12.0,82.0,12.210279,62.0,72.25,81.5,94.25,98.0
english,12.0,74.666667,8.855746,61.0,70.5,74.5,79.5,92.0
reading,12.0,83.25,10.190236,65.0,78.75,84.0,89.5,99.0
