# Intro to Pandas
by Ryan Orsinger

## Module 4: Aggregating (continued)
- Using `.groupby` and aggregate methods
- Understanding the `.groupby` object
- Introducing the `.agg` method
- Specifying column output
- Grouping by multiple columns

In [1]:
# Import pandas
import pandas as pd

# Read in our data
df = pd.read_csv("../datasets/tips.csv")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
# We've already worked with some aggregate functions
df.total_bill.median()

17.795

In [3]:
# Aggregate functions run on entire columns or dataframes
df.mean(numeric_only=True)

total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64

In [4]:
df.tip.min(), df.tip.max()

(1.0, 10.0)

In [5]:
# .describe is also an aggregate function, since it is host to multiple aggregate functions
df.tip.describe()

count    244.000000
mean       2.998279
std        1.383638
min        1.000000
25%        2.000000
50%        2.900000
75%        3.562500
max       10.000000
Name: tip, dtype: float64

But what do we do when we need aggregate results for each value in a categorical column?

In [6]:
# It's possible to manually create dataframes for each category
# But this can become tedious with many categories
# and with multiple columns
# Especially if we want to run the same methods on each dataframe
# etc...
thurs = df[df.day == "Thur"]
fri = df[df.day == "Fri"]
sat = df[df.day == "Sat"]
sun = df[df.day == "Sun"]

# We don't have labels with this method, unfortunately
thurs.total_bill.mean(), fri.total_bill.mean(), sat.total_bill.mean(), sun.total_bill.mean()

(17.682741935483868, 17.15157894736842, 20.441379310344825, 21.41)

In [7]:
# We calculate from the groupby object with aggregate methods (.mean, .median, etc...)
# Calculate the average total bill for each day
# The "for each" means that we're grouping by the day column
df.groupby("day").total_bill.mean()

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

In [8]:
# The groupby object is a compound entity, built for accessing with aggregate functions
df.groupby("day")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021172E01EB0>

In [9]:
# The groupby object does not print out results, 
# Underneath the hood, it is an object containing multiple tuples of dataframes for each possible categorical value
# Recommend avoiding decomposing groupby objects (this cell is to share context)
# That's what aggregate functions are for!
a, b, c, d = df.groupby("day")
a

('Fri',
      total_bill   tip     sex smoker  day    time  size
 90        28.97  3.00    Male    Yes  Fri  Dinner     2
 91        22.49  3.50    Male     No  Fri  Dinner     2
 92         5.75  1.00  Female    Yes  Fri  Dinner     2
 93        16.32  4.30  Female    Yes  Fri  Dinner     2
 94        22.75  3.25  Female     No  Fri  Dinner     2
 95        40.17  4.73    Male    Yes  Fri  Dinner     4
 96        27.28  4.00    Male    Yes  Fri  Dinner     2
 97        12.03  1.50    Male    Yes  Fri  Dinner     2
 98        21.01  3.00    Male    Yes  Fri  Dinner     2
 99        12.46  1.50    Male     No  Fri  Dinner     2
 100       11.35  2.50  Female    Yes  Fri  Dinner     2
 101       15.38  3.00  Female    Yes  Fri  Dinner     2
 220       12.16  2.20    Male    Yes  Fri   Lunch     2
 221       13.42  3.48  Female    Yes  Fri   Lunch     2
 222        8.58  1.92    Male    Yes  Fri   Lunch     1
 223       15.98  3.00  Female     No  Fri   Lunch     3
 224       13.42  1.58 

In [10]:
# We calculate from the groupby object with aggregate methods (.mean, .median, etc...)
# Calculate the average total bill for each day
# The "for each" means that we're grouping by the day column
df.groupby("day").total_bill.mean()

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

In [18]:
# Consider the following
# We get the average for each day, on all numeric columns
# Notice that each groupby result redefines what each row means
df.groupby("day")[["total_bill", "size", "tip"]].mean()

Unnamed: 0_level_0,total_bill,size,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.105263,2.734737
Sat,20.441379,2.517241,2.993103
Sun,21.41,2.842105,3.255132
Thur,17.682742,2.451613,2.771452


In [19]:
# We can also group by more than 1 column. This creates a multiple
# Without specifying the columns, we'll see all the numeric columns in the output
df.groupby(["day", "time"])[["total_bill", "size", "tip"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,size,tip
day,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,Dinner,19.663333,2.166667,2.94
Fri,Lunch,12.845714,2.0,2.382857
Sat,Dinner,20.441379,2.517241,2.993103
Sun,Dinner,21.41,2.842105,3.255132
Thur,Dinner,18.78,2.0,3.0
Thur,Lunch,17.664754,2.459016,2.767705


In [16]:
# We can also group by more than 1 column. This creates a multiple
# We can provide a list of numeric columns inside the square brackets that specify columns (making double brackets)
df.groupby(["day", "time"])[["total_bill", "tip"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip
day,time,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,Dinner,19.663333,2.94
Fri,Lunch,12.845714,2.382857
Sat,Dinner,20.441379,2.993103
Sun,Dinner,21.41,3.255132
Thur,Dinner,18.78,3.0
Thur,Lunch,17.664754,2.767705


In [20]:
# If we need to turn the groupby output into their own column names, we can use .reset_index
df.groupby(["day", "time"])[["total_bill", "tip"]].mean().reset_index()

Unnamed: 0,day,time,total_bill,tip
0,Fri,Dinner,19.663333,2.94
1,Fri,Lunch,12.845714,2.382857
2,Sat,Dinner,20.441379,2.993103
3,Sun,Dinner,21.41,3.255132
4,Thur,Dinner,18.78,3.0
5,Thur,Lunch,17.664754,2.767705


In [21]:
# does not change the original dataframe
df.groupby("day")[["total_bill", "tip"]].mean()

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,17.151579,2.734737
Sat,20.441379,2.993103
Sun,21.41,3.255132
Thur,17.682742,2.771452


In [22]:
# .describe is an aggregate function, too
df.groupby("time").total_bill.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Dinner,176.0,20.797159,9.142029,3.07,14.4375,18.39,25.2825,50.81
Lunch,68.0,17.168676,7.713882,7.51,12.235,15.965,19.5325,43.11


In [23]:
# Using the .agg method to specify multiple
df.groupby("day").total_bill.agg(["mean", "std"])

Unnamed: 0_level_0,mean,std
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,17.151579,8.30266
Sat,20.441379,9.480419
Sun,21.41,8.832122
Thur,17.682742,7.88617


In [24]:
# Using the .agg method to specify multiple
# We can cal .agg on multiple numeric columns, too
df.groupby("day")[["total_bill", "tip"]].agg(["mean", "std"])

Unnamed: 0_level_0,total_bill,total_bill,tip,tip
Unnamed: 0_level_1,mean,std,mean,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Fri,17.151579,8.30266,2.734737,1.019577
Sat,20.441379,9.480419,2.993103,1.631014
Sun,21.41,8.832122,3.255132,1.23488
Thur,17.682742,7.88617,2.771452,1.240223


In [26]:
# Since the output is a dataframe, we can transpose it, if doing so makes for easier reading
# Transpose is adding the 'T' to the end of our expression here and what
# it does is essentially it flips the dataframe columns and index
df.groupby("day")[["total_bill", "tip"]].agg(["mean", "std"]).T

Unnamed: 0,day,Fri,Sat,Sun,Thur
total_bill,mean,17.151579,20.441379,21.41,17.682742
total_bill,std,8.30266,9.480419,8.832122,7.88617
tip,mean,2.734737,2.993103,3.255132,2.771452
tip,std,1.019577,1.631014,1.23488,1.240223


## The forms of .groupby

| specific example    |  general form    |
| ---- | ---- |
|`df.groupby("day").mean()` | `df.groupby("categorical_column").aggregate_function()`     |
| `df.groupby("day").total_bill.mean()`     | `df.groupby("categorical_column").numeric_column.aggregate_function()`     |
| `df.groupby("day")["tip"].median()`     | `df.groupby("categoryA")["numeric_columnA"].aggregate_function()`     |
| `df.groupby("day")[["total_bill", "tip"]].min()`     | `df.groupby("categoryA")[["numeric_columnA", "numeric_columnB"]].aggregate_function()`     |
| `df.groupby(["day", "time"]).mean()`     | `df.groupby(["categoryA", "categoryB").aggregate_function()` |
| `df.groupby("day").agg(["min", "median", "max"])`    | `df.groupby("category").agg(["min", "median", "max"])`     |
| `df.groupby("day")[["total_bill", "tip"]].agg(["min", "median", "max"])`    | `df.groupby("category")[["numericA", "numericB"]].agg(["min", "median", "max"])`     |

## Additional Resource
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html
- Further reading on the multi-index https://pandas.pydata.org/docs/user_guide/advanced.html from grouping by multiple columns

## Exercises
- Use the "mpg.csv" dataset to create a dataframe named `mpg`
- Group by manufacturer and obtain the highest `hwy` mileage for each manufacturer
- Group by the manufacturer and obtain the average `hwy` and `cty` mileage
- Group by the number of cylinders and get the average displacement for each cylinder
- Group by the vehicle class, then calculate the average and standard deviation of `hwy` mileage
- Which vehicle class has the largest standard deviation of hwy mileage?

In [27]:
# Use the "mpg.csv" dataset to create a dataframe named `mpg`
mpg = pd.read_csv("../datasets/mpg.csv")
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [28]:
# Group by manufacturer and obtain the highest hwy mileage for each manufacturer
mpg.groupby("manufacturer").hwy.max()

manufacturer
audi          31
chevrolet     30
dodge         24
ford          26
honda         36
hyundai       31
jeep          22
land rover    18
lincoln       18
mercury       19
nissan        32
pontiac       28
subaru        27
toyota        37
volkswagen    44
Name: hwy, dtype: int64

In [29]:
# Group by the manufacturer and obtain the average hwy and cty mileage
mpg.groupby("manufacturer")[["hwy", "cty"]].mean()

Unnamed: 0_level_0,hwy,cty
manufacturer,Unnamed: 1_level_1,Unnamed: 2_level_1
audi,26.444444,17.611111
chevrolet,21.894737,15.0
dodge,17.945946,13.135135
ford,19.36,14.0
honda,32.555556,24.444444
hyundai,26.857143,18.642857
jeep,17.625,13.5
land rover,16.5,11.5
lincoln,17.0,11.333333
mercury,18.0,13.25


In [30]:
# Group by the number of cylinders and get the average displacement for each cylinder
mpg.groupby("cyl").displ.mean()

cyl
4    2.145679
5    2.500000
6    3.408861
8    5.132857
Name: displ, dtype: float64

In [31]:
# Group by the vehicle class, then calculate the average and standard deviation of hwy mileage
mpg.groupby("class").hwy.agg(["std", "mean"])

Unnamed: 0_level_0,std,mean
class,Unnamed: 1_level_1,Unnamed: 2_level_1
2seater,1.30384,24.8
compact,3.78162,28.297872
midsize,2.13593,27.292683
minivan,2.062655,22.363636
pickup,2.27428,16.878788
subcompact,5.375012,28.142857
suv,2.977973,18.129032


In [45]:
# Which vehicle class has the largest standard deviation of hwy mileage?
# the sort_values function will only work if there are multiple aggregations in the agg func
mpg.groupby("class").hwy.agg(["std", "mean"]).sort_values(by="std", ascending=False)

Unnamed: 0_level_0,std,mean
class,Unnamed: 1_level_1,Unnamed: 2_level_1
subcompact,5.375012,28.142857
compact,3.78162,28.297872
suv,2.977973,18.129032
pickup,2.27428,16.878788
midsize,2.13593,27.292683
minivan,2.062655,22.363636
2seater,1.30384,24.8
