# 7.1 分组、应用和聚合


![Split_apply_combine](Split_apply_combine.jpg)

# 7.2 Pandas中的Groupby操作

本节主要用Seaborn自带的tips数据集为例对GroupBy进行讲解。

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
df_tips = sns.load_dataset('tips')

In [3]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## 7.2.1 单列数据分组统计

 如果我们想按照不同性别来对数据进行统计，我们需要创建一个DataFrameGroupBy对象：

In [4]:
temp = df_tips.groupby(by='sex')

In [5]:
temp.size()

sex
Male      157
Female     87
dtype: int64

In [6]:
for group in temp:
    print(group)

('Male',      total_bill   tip   sex smoker  day    time  size
1         10.34  1.66  Male     No  Sun  Dinner     3
2         21.01  3.50  Male     No  Sun  Dinner     3
3         23.68  3.31  Male     No  Sun  Dinner     2
5         25.29  4.71  Male     No  Sun  Dinner     4
6          8.77  2.00  Male     No  Sun  Dinner     2
..          ...   ...   ...    ...  ...     ...   ...
236       12.60  1.00  Male    Yes  Sat  Dinner     2
237       32.83  1.17  Male    Yes  Sat  Dinner     2
239       29.03  5.92  Male     No  Sat  Dinner     3
241       22.67  2.00  Male    Yes  Sat  Dinner     2
242       17.82  1.75  Male     No  Sat  Dinner     2

[157 rows x 7 columns])
('Female',      total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
11        35.26  5.00  Female     No   Sun  Dinner     4
14        14.83  3.02  Female     No   Sun  Dinner     2
16        10.33  1.67 

In [7]:
# 除了上述方式遍历组，还可以用get_group()函数获取组
group = temp.get_group('Female').head()

In [8]:
group

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
11,35.26,5.0,Female,No,Sun,Dinner,4
14,14.83,3.02,Female,No,Sun,Dinner,2
16,10.33,1.67,Female,No,Sun,Dinner,3


In [9]:
group = temp.get_group('Male').head()

In [10]:
group

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2


事实上，上面的size()函数，就是DataFrameGroupBy对象给我们提供的聚合函数，其他的聚合函数还有：
    *  sum() 求和  
    *  mean() 求平均值  
    *  count() 统计所有非空值  
    *  size() 统计所有值  
    *  max()  
    *  min()  
    *  std() 计算标准差  

>除了直接使用聚合函数，我们还可以使用agg()函数来进行分组统计。

In [11]:
# 比如求mean(),max(),min()在男女之间比较
tips = temp['tip'].agg(['mean','max','min'])

In [12]:
tips = tips.reset_index()
tips

Unnamed: 0,sex,mean,max,min
0,Male,3.089618,10.0,1.0
1,Female,2.833448,6.5,1.0


In [13]:
# 还可以用元组的方式来写：
tips = temp['tip'].agg([('tips_mean','mean'),('tips_max','max'),('tips_min','min')])

In [14]:
tips = tips.reset_index()
tips

Unnamed: 0,sex,tips_mean,tips_max,tips_min
0,Male,3.089618,10.0,1.0
1,Female,2.833448,6.5,1.0


## 7.2.2 多列数据的分组统计

> 上一节是通过将sex设为by来分组的，如果想要同时基于sex和day来分组统计的话，可以采用如下代码

In [15]:
temp = df_tips.groupby(['sex','day'])

In [16]:
temp.size()

sex     day 
Male    Thur    30
        Fri     10
        Sat     59
        Sun     58
Female  Thur    32
        Fri      9
        Sat     28
        Sun     18
dtype: int64

In [17]:
temp['total_bill'].agg(['mean','max','min']).reset_index()

Unnamed: 0,sex,day,mean,max,min
0,Male,Thur,18.714667,41.19,7.51
1,Male,Fri,19.857,40.17,8.58
2,Male,Sat,20.802542,50.81,7.74
3,Male,Sun,21.887241,48.17,7.25
4,Female,Thur,16.715312,43.11,8.35
5,Female,Fri,14.145556,22.75,5.75
6,Female,Sat,19.680357,44.3,3.07
7,Female,Sun,19.872222,35.26,9.6


In [18]:
temp['tip'].agg(['mean','max','min']).reset_index()

Unnamed: 0,sex,day,mean,max,min
0,Male,Thur,2.980333,6.7,1.44
1,Male,Fri,2.693,4.73,1.5
2,Male,Sat,3.083898,10.0,1.0
3,Male,Sun,3.220345,6.5,1.32
4,Female,Thur,2.575625,5.17,1.25
5,Female,Fri,2.781111,4.3,1.0
6,Female,Sat,2.801786,6.5,1.0
7,Female,Sun,3.367222,5.2,1.01


In [19]:
temp['size'].agg(['mean','max','min']).reset_index()

Unnamed: 0,sex,day,mean,max,min
0,Male,Thur,2.433333,6,2
1,Male,Fri,2.1,4,1
2,Male,Sat,2.644068,5,2
3,Male,Sun,2.810345,6,2
4,Female,Thur,2.46875,6,1
5,Female,Fri,2.111111,3,2
6,Female,Sat,2.25,4,1
7,Female,Sun,2.944444,5,2


In [20]:
## 我们还可以用agg函数分别对tip和total_bill列进行不同的操作：
temp = df_tips.groupby(['sex'])

In [21]:
t = temp.agg(
    {'tip':[('avg_tip','mean'),('max_tip','max')],'total_bill':[('avg_total_bill','mean')]}
)

t.head()

Unnamed: 0_level_0,tip,tip,total_bill
Unnamed: 0_level_1,avg_tip,max_tip,avg_total_bill
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,3.089618,10.0,20.744076
Female,2.833448,6.5,18.056897


In [22]:
t.columns

MultiIndex([(       'tip',        'avg_tip'),
            (       'tip',        'max_tip'),
            ('total_bill', 'avg_total_bill')],
           )

In [23]:
t.columns=['avg_tip','max_tip','avg_total_bill']

In [24]:
t.head()

Unnamed: 0_level_0,avg_tip,max_tip,avg_total_bill
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,3.089618,10.0,20.744076
Female,2.833448,6.5,18.056897


In [25]:
t.reset_index()

Unnamed: 0,sex,avg_tip,max_tip,avg_total_bill
0,Male,3.089618,10.0,20.744076
1,Female,2.833448,6.5,18.056897


## 7.2.3 使用自定义函数进行分组统计

> 如果Pandas中的聚合函数不能满足要求，还可以使用自定义函数来完成聚合功能，如：

In [26]:
data = sns.load_dataset('tips')

In [27]:
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [28]:
data.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027B60801FD0>

In [29]:
# 我们想查看各组中账单的最大和最小的差异，可以这样写：
data.groupby('sex').agg(
    {
        'total_bill': lambda bill : bill.max() - bill.min()
    }
)

Unnamed: 0_level_0,total_bill
sex,Unnamed: 1_level_1
Male,43.56
Female,41.23


> **除了lambda函数外，我们当然可以直接定义函数来使用！**

## 7.2.4 数据过滤与变换

> 有时我们需要对数据进行分组不是为了分组统计，而是为了对数据进行过滤或变换，此时可以使用`filter()`和`transform()`函数来完成。例如，我们想
知道tips数据集中每天消费大于20的账单：

In [30]:
df_tips.groupby('day').filter(lambda bill : bill['total_bill'].mean()>20)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2


In [31]:
# 如果我们需要把分组数据进行变换，可以使用`transform()`函数，例如
df_tips['day_avg'] = df_tips.groupby('day')['total_bill'].transform(lambda x : x.mean())

In [32]:
df_tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_avg
0,16.99,1.01,Female,No,Sun,Dinner,2,21.410000
1,10.34,1.66,Male,No,Sun,Dinner,3,21.410000
2,21.01,3.50,Male,No,Sun,Dinner,3,21.410000
3,23.68,3.31,Male,No,Sun,Dinner,2,21.410000
4,24.59,3.61,Female,No,Sun,Dinner,4,21.410000
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.441379
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.441379
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.441379
242,17.82,1.75,Male,No,Sat,Dinner,2,20.441379


In [33]:
# 还可以使用apply函数：
# 比如我们可以按性别分组后计算消费占总账单的百分比：
df_tips.groupby('sex').apply(lambda x : x['tip']/x['total_bill'])

sex        
Male    1      0.160542
        2      0.166587
        3      0.139780
        5      0.186240
        6      0.228050
                 ...   
Female  226    0.198216
        229    0.130199
        238    0.130338
        240    0.073584
        243    0.159744
Length: 244, dtype: float64