# The three levels of aggregation

In the previous section, the groupby object only finishes the first step of split-apply-combine paradigm. The complexity of implementing the rest may vary. Here's the three situations from simple to complex.

1. If we intend to calculate mean, max, or size, one may just invoke them on the groupby object. It's part of groupby object and has been optimized.
2. If the method exists for each dataframe chunk, like quantile, but is not part of groupby object, we can still just call it. Internally, groupby slices up the series/dataframes and calls quantile function for each piece, and stiches them together into the result. 
3. If we want a customized method that is neither part of groupby nor part of dataframe/series, we got to write our own aggregation function before pass it to the agg method.

There is technical differences between the first two situations but to the end user, they should look the same.

In [1]:
import pandas as pd 
tips = pd.read_csv('tips.csv')
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [2]:
example_of_scenario_1 = tips.groupby(['day']).mean()[['total_bill','tip']]
example_of_scenario_1

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,17.151579,2.734737
Sat,20.441379,2.993103
Sun,21.41,3.255132
Thur,17.682742,2.771452


In [3]:
example_of_scenario_2 = tips.groupby(['day']).quantile(0.9)[['total_bill','tip']]
example_of_scenario_2.columns.names =['']
example_of_scenario_2

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,27.618,4.06
Sat,31.894,4.802
Sun,33.765,5.035
Thur,28.316,4.92


In [11]:
def find_range(arr):
    return arr.max() - arr.min()
example_of_scenario_3 = tips.groupby(['day']).agg(find_range)[['total_bill','tip']]
example_of_scenario_3.head()

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,34.42,3.73
Sat,47.74,9.0
Sun,40.92,5.49
Thur,35.6,5.45


# More operations and more variables

This section settles two major questions:
1. what if there is more than one operation we desire to impose for one variable
2. what if for different variable we want to define a different operations

In [6]:
import numpy as np 
solution_to_question_1 = tips.groupby(['day']).agg(['std', np.mean, find_range])[['total_bill', 'tip']]
solution_to_question_1

Unnamed: 0_level_0,total_bill,total_bill,total_bill,tip,tip,tip
Unnamed: 0_level_1,std,mean,find_range,std,mean,find_range
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,8.30266,17.151579,34.42,1.019577,2.734737,3.73
Sat,9.480419,20.441379,47.74,1.631014,2.993103,9.0
Sun,8.832122,21.41,40.92,1.23488,3.255132,5.49
Thur,7.88617,17.682742,35.6,1.240223,2.771452,5.45


In [12]:
solution_to_question_1_custname = tips.groupby(['day']).agg([('standard dev', np.std), ('mean',np.mean), ('range',find_range)])[['total_bill', 'tip']]
solution_to_question_1_custname

Unnamed: 0_level_0,total_bill,total_bill,total_bill,tip,tip,tip
Unnamed: 0_level_1,standard dev,mean,range,standard dev,mean,range
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,8.30266,17.151579,34.42,1.019577,2.734737,3.73
Sat,9.480419,20.441379,47.74,1.631014,2.993103,9.0
Sun,8.832122,21.41,40.92,1.23488,3.255132,5.49
Thur,7.88617,17.682742,35.6,1.240223,2.771452,5.45


To recap, we simply need to pass a list of operations into the agg method. It allows a mixture of built-in functions (you can call by giving a string ('mean', 'std' for instance)), numpy aggregation functions and even customized funtions. 

Moreover, one can give the function a name by passing in a tuple. The first arg of the tuple is name the second is the operation (function). The point of this is to make the resulting columns more readable. To be more specific, if there is only one operation for one column, the column name will remain the same, but as long as there is more than one operation, the column names in the output will become hierarchical and multi-indexing comes into effect. This means the original name becomes level 0 (top level) and the columns generaetd by given operations are going to have level 1 names.

Note that function used here has to be aggregation function, which produces scalar values from arrays. 

In [8]:
case_2_2 = tips.groupby(['day', 'smoker']).agg({'total_bill': [('max_value',np.max)], 'tip': [('range',find_range),('mean', 'mean')]})

In [9]:
case_2_2

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,max_value,range,mean
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Fri,No,22.75,2.0,2.8125
Fri,Yes,40.17,3.73,2.714
Sat,No,48.33,8.0,3.102889
Sat,Yes,50.81,9.0,2.875476
Sun,No,48.17,4.99,3.167895
Sun,Yes,45.35,5.0,3.516842
Thur,No,41.19,5.45,2.673778
Thur,Yes,43.11,3.0,3.03


The answer is pass in a dictionary. The keys are names of columns to be aggregated, and the values are operations. If there is more than one operation, simply pass a list; if there is a need to customize the name of column, pass a tuple instead.