# Intro to Pandas
by Ryan Orsinger

## Module 4: Aggregating
- Using `.crosstab` to count a frequency count for each category pairing
- Using `.pivot_table` to calculate aggregates of numeric values for each category pairing (same as a spreadsheet pivot table)

In [2]:
# Import pandas
import pandas as pd

# Read in some data
df = pd.read_csv("../datasets/tips.csv")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### What is `.crosstab?`
- Crosstab computes a simple cross tabulation of two (or more) factors
- Computes a frequency table of factors
- Example: counting up how many tables ate lunch or dinner for each day?
- Example: counting the number of smoking tables broken out by gender?

In [3]:
# Say we needed to get all the different days
df.day.unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [4]:
# And all the different times
df.time.unique()

array(['Dinner', 'Lunch'], dtype=object)

In [8]:
# To count Thursday Lunch, we need this compound indexing operation

# this is the number of rows that correspond to the data specified
df[(df.day == "Thur") & (df.time == "Lunch")].shape[0]

61

In [9]:
# To count Thursday Dinner, we need this compound indexing operation
# Repeat this for each day/time combination...
df[(df.day == "Thur") & (df.time == "Dinner")].shape[0]

1

In [10]:
# For another approach,
# we could run .time.value_counts() on each individual day
# But this would be get tedious, too
# Especially if the possible values are numerous
df[df.day == "Thur"].time.value_counts()

time
Lunch     61
Dinner     1
Name: count, dtype: int64

In [11]:
# Crosstab to the rescue!
# Frequency count of all days by all times
pd.crosstab(index=df.day, columns=df.time)

time,Dinner,Lunch
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,12,7
Sat,87,0
Sun,76,0
Thur,1,61


In [12]:
# Margins=True shows the row/column totals
pd.crosstab(index=df.day, columns=df.time, margins=True)

time,Dinner,Lunch,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,12,7,19
Sat,87,0,87
Sun,76,0,76
Thur,1,61,62
All,176,68,244


In [13]:
# Normalize=True show percentages instead of raw counts
pd.crosstab(index=df.day, columns=df.time, margins=True, normalize=True).round(2)

time,Dinner,Lunch,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,0.05,0.03,0.08
Sat,0.36,0.0,0.36
Sun,0.31,0.0,0.31
Thur,0.0,0.25,0.25
All,0.72,0.28,1.0


In [14]:
# We can also pass lists of series into either index or columns
pd.crosstab(index=df.day, columns=[df.time, df.smoker])

time,Dinner,Dinner,Lunch,Lunch
smoker,No,Yes,No,Yes
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Fri,3,9,1,6
Sat,45,42,0,0
Sun,57,19,0,0
Thur,1,0,44,17


## Using pivot_tables to aggregate more than counts
- Use `.pivot_table` to set up category pairings, then specify the column to measure, in aggregate, and your aggregate function(s)
- The `.pivot_table` method defaults to using the average, 
- We can specify multiple categories in the index and columns, but the results can become visually busy
- Example: for each day/time pairing, calculate the average `total_bill`
- Example: for each day/time pairing, get the average `total_bill` and `tip`
- Example: for each day/time pairing, calculate the min, median, max `tip`

In [19]:
# Without specifying a "values" column, 
# pivot_table returns the numeric average of numeric columns, broken out by each category pair
pd.pivot_table(df, index="day", columns="time")

TypeError: agg function failed [how->mean,dtype->object]

In [20]:
# Use the values argument to specify numeric column(s)
pd.pivot_table(df, index="day", columns="time", values="total_bill")

time,Dinner,Lunch
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,19.663333,12.845714
Sat,20.441379,
Sun,21.41,
Thur,18.78,17.664754


In [None]:
# Use the "values" argument to specify which columns to calculate
pd.pivot_table(df, index="day", columns="time", values=["total_bill", "tip"])

In [None]:
# Use the aggfunc argument to overwrite the default mean function
pd.pivot_table(df, values="tip", aggfunc="median", index="day", columns="time")

In [None]:
# The aggfunc argument can take a list of aggregate functions
pd.pivot_table(df, values="tip", aggfunc=["min", "median", "max"], index="day", columns="time")

## Additional Resources
- https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
- https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

## Exercises
- Use crosstab on the `tips` dataframe to count the number of differently sized tables for each time of day. *Hint* remember that `.size` is a built-in attribute on pandas objects.
- Use `pd.read_csv` and the `mpg.csv` file to create a dataframe named `mpg`.
- Use `.crosstab` to count the number of vehicles for each combination of class and drivetrain. *Hint* remember that `class` is a reserved word in Python.
- Use `.crosstab` to count the number of vehicles for each combination of manufacturer and drivetrain.
- Use `.pivot_table` and `mpg` to calculate the average highway mileage for each combination of vehicle class and drivetrain. 
- Use `.pivot_table` and `mpg` to calculate the median city mileage for each combination of manufacturer and drivetrain.

In [None]:
# Use crosstab on the tips dataframe to count the number of differently sized tables for each time of day. 
# Hint remember that .size is a built-in attribute on pandas objects.


In [None]:
# Use pd.read_csv and the mpg.csv file to create a dataframe named mpg.


In [None]:
# Use .crosstab to count the number of vehicles for each combination of class and drivetrain. 
# Hint remember that "class" is a reserved word in Python.


In [None]:
# Use .crosstab to count the number of vehicles for each combination of manufacturer and drivetrain.


In [None]:
# Use .pivot_table and mpg to calculate the average highway mileage for each combination of vehicle class and drivetrain.


In [None]:
# Use .pivot_table and mpg to calculate the median city mileage for each combination of manufacturer and drivetrain.
