# Groupby ( split->apply->combine )
## 1. Introduction
When we deal with a datafram, we may also want to partition data into groups. For example, we may have a long list of stududents. We may want to know the average GPA of studnets from each major. If we look closer at this process, me may separate them into 3 key steps.

1. SPLIT
2. APPLY
3. COMBINE

**SPLIT** is about partition the whole dataframe into smaller groups based on the criteria we want to look into. This depends on the datafrade and the situation. We may want to group people based on their nationality. We may want to groups movies based on the genres. The columns which hold categorical data are good candidates to be used.

**APPLY** is to perform actions/calculations with the data points of each group. If we want to know the average GPA, after splitting them into groups, we can pass each group to the function to perform the calculation.

**COMBINE** is to put the results together. This way, we can compare/contrast the result easity. We may use seborn to visualize the data.

Since this process is very common and one may need to perform this frequently, pandas provides mechanism to perform this task. It is a method called **groupby**.

# 2. Aggregate
As we learn this in Statistics class, when we have a group of data, we may want to have one value that represents the whole data points. The value is called **a point estimate**. For example, we may use mean or median or mode as a point estimate. There are also other situations that we would like to calcuate a single value from the group of data points, such as min or max or your own way of extracting/calculating the value out. The process of calculating a value from data points is called **aggregation**.

In [1]:
import pandas as pd
import seaborn as sns

df = sns.load_dataset("tips")
df.head(3)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [32]:
ser = df.groupby("sex").tip.mean()
#ser = df.groupby("sex")["tip"].mean() # same as above
ser.describe

<bound method NDFrame.describe of sex
Male      3.089618
Female    2.833448
Name: tip, dtype: float64>

As we can see that this is very easy to use. However, we have to always pay attention to the way we interpret the result.
If we want to know which gender pay more tips, averaging the tips may not be a good approach. We may want to calculate the proportion of the tip with respect to the total bill.

In [5]:
df["tip_pct"] = df["tip"]/df["total_bill"]
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587


In [6]:
ser = df.groupby("sex").tip_pct.mean()
ser

sex
Male      0.157651
Female    0.166491
Name: tip_pct, dtype: float64

We can also think about **groupby** as the way to segment the dataset based on the unique values of a particular column.

In [9]:
sexCategory = df.sex.unique()
sexCategory

['Female', 'Male']
Categories (2, object): ['Male', 'Female']

### Manual Split

In [10]:
df_Female = df[ df.sex=="Female"]
df_Male = df[df.sex=="Male"]

### Manual Apply

In [11]:
female_avg_tips = df_Female["tip"].mean()
female_avg_tips

2.833448275862069

In [12]:
male_avg_tips = df_Male["tip"].mean()
male_avg_tips

3.0896178343949052

### Manual Combine

In [36]:
ser = pd.Series( {'Female':female_avg_tips, 'Male': male_avg_tips}, name = 'tip')
ser.index.name = 'Sex'
ser

Sex
Female    2.833448
Male      3.089618
Name: tip, dtype: float64

## Build-in Aggretion Methods

In [20]:
df.groupby('sex').tip.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Male,157.0,3.089618,1.489102,1.0,2.0,3.0,3.76,10.0
Female,87.0,2.833448,1.159495,1.0,2.0,2.75,3.5,6.5


There are various functions from Pandas and numpy that we can use.
From the textbook:
![build in function](./assets/BuildinAggFunc.png)



## Custom Function (aggregate)
We can use our own custom function as well.

In [37]:
def my_lucky_num(values):
    n = len(values)
    x = n % 10
    y = int(max(values))
    return x + y

In [34]:
# Let's see what should be the value if we call the function above
n = len(df[df.sex=='Male'])
n

157

In [35]:
n%10

7

In [36]:
df[df.sex=='Male'].tip.max()

10.0

In [37]:
# Therefore, our aggregate is
# 7 + 10..which is 17

In [38]:
# Now, let's call and see the result
df.groupby("sex").tip.agg(my_lucky_num)

sex
Male      17.0
Female    13.0
Name: tip, dtype: float64

#### If we need to apply many build in functions with different columns, we can put them together using a dictionary.

In [38]:
df_dict = df.groupby('sex').agg({
    'tip':'max',
    'tip_pct':'mean',
    'total_bill':'var'
})
df_dict

Unnamed: 0_level_0,tip,tip_pct,total_bill
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,10.0,0.157651,85.497185
Female,6.5,0.166491,64.147429


## 3. Transform
We may need to transform data form on scale to another scale. The frequently used method in Statistics is to transform from x scale to z scale.

In [39]:
def to_z_scale(x):
    return (x-x.mean()) / x.std()

In [40]:
tip_z = df.groupby("sex").tip_pct.transform(to_z_scale)
tip_z

0     -1.995908
1      0.044630
2      0.137961
3     -0.275868
4     -0.367005
         ...   
239    0.714386
240   -1.732318
241   -1.071789
242   -0.917694
243   -0.125790
Name: tip_pct, Length: 244, dtype: float64

## 4. Filter
Sometimes, we just want group that meets certain requirements. We can extract only those groups.

In [45]:
df.shape

(244, 8)

In [41]:
tip_days = df.groupby('day').count()
print(tip_days)

      total_bill  tip  sex  smoker  time  size  tip_pct
day                                                    
Thur          62   62   62      62    62    62       62
Fri           19   19   19      19    19    19       19
Sat           87   87   87      87    87    87       87
Sun           76   76   76      76    76    76       76


In [46]:
df_busy_smokey_day = df.groupby('day').filter(lambda x: x['smoker'].count() >= 30)
df_busy_smokey_day

# notice that x above is a portion of dataframe...segmented by day
# So, we can count any column. Here, we pic x['smoker']
# if that value is >= 30 rows, it is in.
# The day where smoker is less than 30 won't be in
# Here, Fri has smoker just 19, it won't be in the busy

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.139780
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204


In [47]:
df_busy_smokey_day

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.139780
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204


In [48]:
df_busy_smokey_day.groupby('day').count()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,time,size,tip_pct
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Thur,62,62,62,62,62,62,62
Fri,0,0,0,0,0,0,0
Sat,87,87,87,87,87,87,87
Sun,76,76,76,76,76,76,76


Notice the difference. The rows that belong to a group of Friday whose number of members is less than 30 cannot come down the filter. 

## Another example

In [49]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.139780
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204


In [50]:
df_good_pay_day = df.groupby('day').tip.mean()
print(df_good_pay_day)

day
Thur    2.771452
Fri     2.734737
Sat     2.993103
Sun     3.255132
Name: tip, dtype: float64


In [51]:
df_good_pay_day = df.groupby('day').filter(lambda x: x['tip'].mean() >= 2.8)
df_good_pay_day.shape

(163, 8)

In [52]:
df_good_pay_day.groupby('day').tip.mean()

day
Thur         NaN
Fri          NaN
Sat     2.993103
Sun     3.255132
Name: tip, dtype: float64