# How to use Apply 

The most generic way to handle the split-apply-combine paradigm is, unsurprisingly, apply() function. We are supposed to pass a function here, which does not have to return a scalar. 

In [1]:
import pandas as pd 
tips = pd.read_csv('tips.csv')
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


The same dataset before, but we want the top five tippers. More importantly, we want the top two tipers for both smoker group and non-smoker group. Hence, we first write a function to find out the top five, before we apply such a function to each group. 

In [2]:
def get_first_two(df): #return a dataframe of size 2
    return df.sort_values(['tip'])[-2:]

In [3]:
top_tipper_by_group = tips.groupby(['smoker']).apply(get_first_two)
top_tipper_by_group

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
No,23,39.42,7.58,No,Sat,Dinner,4
No,212,48.33,9.0,No,Sat,Dinner,4
Yes,183,23.17,6.5,Yes,Sun,Dinner,4
Yes,170,50.81,10.0,Yes,Sat,Dinner,3


Note that we dont have to combine different groups manually ever, since apply() is going to handle this under the hood. However, in this case agg() (discussed in last section) is not going to work here because this is not an aggregation operation.

The function inside apply can actually return anything. If it's some type that pandas does not understand, it simply packs it into a dictionary where the key is the group key, which is yes and no in our case. If both are dataframe, then pandas will stack it up with a concatenate operation. 

Here's one last trick. If you want to pass in a function but the function does not have a pd version, but can only call from a dataframe instance, then you can use lambda function to get around. For instace, there is df.describe() to get numeric summaries but there is no such function as pd.describe() without calling from an instance. .info() is another example. Here is the soluton. 

In [4]:
def customized_describe(x): # this is awkward
    return x.describe()
tips.groupby(['smoker']).apply(customized_describe)
tips.groupby(['smoker']).apply(lambda x: x.describe()) #much better!

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,count,151.0,151.0,151.0
No,mean,19.188278,2.991854,2.668874
No,std,8.255582,1.37719,1.017984
No,min,7.25,1.0,1.0
No,25%,13.325,2.0,2.0
No,50%,17.59,2.74,2.0
No,75%,22.755,3.505,3.0
No,max,48.33,9.0,6.0
Yes,count,93.0,93.0,93.0
Yes,mean,20.756344,3.00871,2.408602


There is some little detail in the end. If we use an optimized operation like groupby follow by a mean() call on the groupby object, then the group key is going to be removed since it's already part of the index now. 

In [5]:
tips.groupby('smoker').max() #no smoker column

Unnamed: 0_level_0,total_bill,tip,day,time,size
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,48.33,9.0,Thur,Lunch,6
Yes,50.81,10.0,Thur,Lunch,5


You don't see smoker here any more in the column. However, if you call a general purpose apply function, the behavior is different. Here is a demo

In [6]:
import numpy as np
tips.groupby('smoker').apply(np.max) #see how we got smoker column here!

Unnamed: 0_level_0,total_bill,tip,smoker,day,time,size
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,48.33,9.0,No,Thur,Lunch,6
Yes,50.81,10.0,Yes,Thur,Lunch,5
