# 分组

In [1]:
import numpy as np
import pandas as pd

## 分组模式及其对象

### 分组的一般模式

分组操作一般需要三个明确要素：**分组依据**、**数据来源**和**操作及其返回结果**。

据此，分组操作代码的一般模式为：`df.groupby(分组依据)[数据来源].操作`。

In [2]:
# 读取学生体侧数据集
df = pd.read_csv('data/learn_pandas.csv')
# 查看
df.head(3)

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22


In [3]:
# 按照性别统计身高中位数
df.groupby('Gender')['Height'].median()

Gender
Female    159.6
Male      173.4
Name: Height, dtype: float64

### 分组依据的本质

向`groupby`传入由多个列名构成的**列表**，即可实现根据多个维度进行分组。

In [4]:
# 根据学校和性别分组，统计身高的均值
df.groupby(['School','Gender'])['Height'].mean()

School                         Gender
Fudan University               Female    158.776923
                               Male      174.212500
Peking University              Female    158.666667
                               Male      172.030000
Shanghai Jiao Tong University  Female    159.122500
                               Male      176.760000
Tsinghua University            Female    159.753333
                               Male      171.638889
Name: Height, dtype: float64

分组条件除了直接输入列名之外，还可以为逻辑表达式。

In [5]:
# 根据体重是否超过总体均值分组，计算身高均值
df.groupby(df.Weight>df.Weight.mean())['Height'].mean()

Weight
False    159.034646
True     172.705357
Name: Height, dtype: float64

#### 练一练1

In [6]:
# 输入两层条件
df.groupby([df.Weight > df.Weight.quantile(0.25),df.Weight > df.Weight.quantile(0.75)])['Height'].mean()

Weight  Weight
False   False     155.891071
True    False     162.255294
        True      174.935714
Name: Height, dtype: float64

`False False`表示low，`True False`表示normal，`True True`表示high。

### Groupby 对象

通过`df.groupby[condition]`操作生成的是一个**groupby对象**。

In [7]:
gb = df.groupby(['School','Grade'])
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019D2262C288>

接下来展示`groupby`对象的两个属性：`ngroups` 和 `groups`。

In [8]:
# ngroups属性返回分组个数
gb.ngroups

16

In [9]:
# groups属性返回一个字典，其中键为组名，值为组内索引
gb.groups.keys()

dict_keys([('Fudan University', 'Freshman'), ('Fudan University', 'Junior'), ('Fudan University', 'Senior'), ('Fudan University', 'Sophomore'), ('Peking University', 'Freshman'), ('Peking University', 'Junior'), ('Peking University', 'Senior'), ('Peking University', 'Sophomore'), ('Shanghai Jiao Tong University', 'Freshman'), ('Shanghai Jiao Tong University', 'Junior'), ('Shanghai Jiao Tong University', 'Senior'), ('Shanghai Jiao Tong University', 'Sophomore'), ('Tsinghua University', 'Freshman'), ('Tsinghua University', 'Junior'), ('Tsinghua University', 'Senior'), ('Tsinghua University', 'Sophomore')])

这里再展示`groupby`对象的两个方法：`size` 和 `getgroup`。

In [10]:
# size方法返回每组的元素个数
gb.size()

School                         Grade    
Fudan University               Freshman      9
                               Junior       12
                               Senior       11
                               Sophomore     8
Peking University              Freshman     13
                               Junior        8
                               Senior        8
                               Sophomore     5
Shanghai Jiao Tong University  Freshman     13
                               Junior       17
                               Senior       22
                               Sophomore     5
Tsinghua University            Freshman     17
                               Junior       22
                               Senior       14
                               Sophomore    16
dtype: int64

In [11]:
# get_group方法可以直接获取指定的组内的元素（行）
gb.get_group(('Peking University','Senior')).iloc[:2,:3]

Unnamed: 0,School,Grade,Name
30,Peking University,Senior,Changli Lv
86,Peking University,Senior,Feng Zheng


#### 练一练2

In [12]:
# 展示分组类别
list(df.groupby(['School','Grade']).groups.keys())

[('Fudan University', 'Freshman'),
 ('Fudan University', 'Junior'),
 ('Fudan University', 'Senior'),
 ('Fudan University', 'Sophomore'),
 ('Peking University', 'Freshman'),
 ('Peking University', 'Junior'),
 ('Peking University', 'Senior'),
 ('Peking University', 'Sophomore'),
 ('Shanghai Jiao Tong University', 'Freshman'),
 ('Shanghai Jiao Tong University', 'Junior'),
 ('Shanghai Jiao Tong University', 'Senior'),
 ('Shanghai Jiao Tong University', 'Sophomore'),
 ('Tsinghua University', 'Freshman'),
 ('Tsinghua University', 'Junior'),
 ('Tsinghua University', 'Senior'),
 ('Tsinghua University', 'Sophomore')]

### 分组的三大操作

完成分组后，可实现如下三类操作，后文将详细介绍：

1. **聚合(aggregation)**：计算各个分组的描述统计量，每组返回一个标量；
2. **变换(transformation)**：对各分组内数据进行特定操作，每组返回一个Series；
3. **过滤/筛选(filtration)**：根据组间计算的逻辑值来排除一些组别，返回满足条件的组别，即DataFrame。

## 聚合函数

### 内置聚合函数

`groupby`对象内直接定义了一些聚合函数（返回标量），使用对应功能时应该优先考虑它们。

这些聚合函数当传入的数据来源包含多个列时，将按照列进行迭代计算。

#### 练一练3

以下列出一些聚合函数的功能：

|函数|功能|
|:---|:---|
|all|返回是否所有元素均为`True`|
|any|返回是否有元素为`True`|
|mad|返回指定行/列的平均绝对离差|
|skew|返回指定行/列的偏度|
|sem|返回指定行/列的均值估计标准误|
|prod|返回指定行/列元素值的乘积|

### agg 方法

`agg`函数能解决定义在`groupby`对象上聚合函数无法解决的问题。

In [13]:
# 定义示例
gb = df.groupby('Gender')[['Height','Weight']]

**1** 利用`agg`调用多个函数

以**列表**形式把内置聚合函数对应的**字符串**传入即可。

In [14]:
# 计算列加总、找到列最大值对应的行标签、计算列偏度
gb.agg(['sum','idxmax','skew'])

Unnamed: 0_level_0,Height,Height,Height,Weight,Weight,Weight
Unnamed: 0_level_1,sum,idxmax,skew,sum,idxmax,skew
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,21014.0,28,-0.219253,6469.0,28,-0.268482
Male,8854.9,193,0.437535,3929.0,2,-0.332393


从结果看，此时的列索引为**多级索引**，第一层为数据源，第二层为使用的聚合方法，分别逐一对列使用聚合。

**2** 利用`agg`对特定列使用特定聚合函数

以**字典**形式将列（键）及其对应操作的字符串/字符串列表（值）传入即可。

In [15]:
# 对身高求均值和最大值，对体重求中位数
gb.agg({'Height':['mean','max'],'Weight':'median'})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,mean,max,median
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,159.19697,170.2,48.0
Male,173.62549,193.9,73.0


**3** 利用`agg`调用自定义函数

<font color=red>需要注意传入函数的参数是之前数据源中的列，逐列进行计算。</font>

In [16]:
# 分组计算身高和体重的极差
gb.agg(lambda x: x.max()-x.mean())

Unnamed: 0_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,11.00303,15.081481
Male,20.27451,16.240741


**4** 利用`agg`对聚合结果重命名。

只需将上述的函数字符所在的位置改为**元组**即可，其中第一个元素为新名称，第二个元素为函数字符串或字符串列表。

In [17]:
# 重命名
gb.agg([('range',lambda x:x.max()-x.min()),('my_sum','sum')])

Unnamed: 0_level_0,Height,Height,Weight,Weight
Unnamed: 0_level_1,range,my_sum,range,my_sum
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,24.8,21014.0,29.0,6469.0
Male,38.2,8854.9,38.0,3929.0


#### 练一练4

In [18]:
# 字典方法
gb.agg({'Height':['sum', 'idxmax', 'skew'],'Weight':['sum', 'idxmax', 'skew']})

Unnamed: 0_level_0,Height,Height,Height,Weight,Weight,Weight
Unnamed: 0_level_1,sum,idxmax,skew,sum,idxmax,skew
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,21014.0,28,-0.219253,6469.0,28,-0.268482
Male,8854.9,193,0.437535,3929.0,2,-0.332393


#### 练一练5

In [19]:
# 展示原方法
gb.describe()

Unnamed: 0_level_0,Height,Height,Height,Height,Height,Height,Height,Height,Weight,Weight,Weight,Weight,Weight,Weight,Weight,Weight
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Female,132.0,159.19697,5.053982,145.4,155.675,159.6,162.825,170.2,135.0,47.918519,5.405983,34.0,44.0,48.0,52.0,63.0
Male,51.0,173.62549,7.048485,155.7,168.9,173.4,177.15,193.9,54.0,72.759259,7.772557,51.0,69.0,73.0,78.75,89.0


In [20]:
# 等价实现
gb.agg(['count','mean','std','min','quantile','max'])

Unnamed: 0_level_0,Height,Height,Height,Height,Height,Height,Weight,Weight,Weight,Weight,Weight,Weight
Unnamed: 0_level_1,count,mean,std,min,quantile,max,count,mean,std,min,quantile,max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Female,132,159.19697,5.053982,145.4,159.6,170.2,135,47.918519,5.405983,34.0,48.0,63.0
Male,51,173.62549,7.048485,155.7,173.4,193.9,54,72.759259,7.772557,51.0,73.0,89.0


没太搞明白如何将`quantile(0.25)`作为字符串传入`agg`函数中。

## 变换和过滤

### 变换函数与transform方法

In [21]:
# 最常用的内置变换函数是累计函数，返回相同长度的Series
gb.cummax().head()

Unnamed: 0,Height,Weight
0,158.9,46.0
1,166.5,70.0
2,188.9,89.0
3,,46.0
4,188.9,89.0


如果要进行组内**自定义**变换，需要使用`transform`方法。

In [22]:
# 分组标准化
gb.transform(lambda x: (x-x.mean())/x.std()).head()

Unnamed: 0,Height,Weight
0,-0.05876,-0.354888
1,-1.010925,-0.355
2,2.167063,2.089498
3,,-1.279789
4,0.053133,0.159631


事实上，也可以向`transform`输入返回标量的函数，最终得到的结果是将该标量**广播**到整个组的Series/DataFrame。这种技巧被称为<font color=red>标量广播</font>，在特征工程中非常常见。

In [23]:
# 计算组内均值，并广播到组内各元素
gb.transform('mean').head()

Unnamed: 0,Height,Weight
0,159.19697,47.918519
1,173.62549,72.759259
2,173.62549,72.759259
3,159.19697,47.918519
4,173.62549,72.759259


#### 练一练6

以下直接引用pandas文档对`rank`函数的定义（没搞懂）：

> Compute numerical data ranks (1 through n) along axis.

In [24]:
gb.rank().head()

Unnamed: 0,Height,Weight
0,58.0,47.5
1,5.0,19.0
2,50.0,54.0
3,,14.5
4,27.0,31.5


#### 练一练7

这一问没有明确思路。感觉可以先在外部定义一个函数实现对列分别处理然后返回更新的DataFrame。

### 组索引与过滤

组过滤作为行过滤的推广，指的是如果对一个组的**全体**所在行进行统计的结果返回`True`则会被保留，`False`则该组会被过滤，最后把所有未被过滤的组其对应的所在行拼接起来作为DataFrame返回。

在groupby对象中，定义了`filter`方法进行组的筛选。其中，自定义函数的输入参数为数据源构成的**DataFrame本身**，因此所有表方法和属性都可以在自定义函数中被调用。

In [25]:
# 找出所有容量大于100的组
gb.filter(lambda x: x.shape[0]>100).head()# 这里调用了表的shape属性

Unnamed: 0,Height,Weight
0,158.9,46.0
3,,41.0
5,158.0,51.0
6,162.5,52.0
7,161.9,50.0


#### 练一练8

个人认为大致思路是：把每一行单独视为一个“组”，分组后即可调用`filter`函数。

## 跨列分组

### apply 的引入

前述的`agg`、`transform` 和 `filter` 函数均不能进行**多列数据**联合汇总操作，例如结合`Height`和`Weight`来计算BMI。

要实现多列数据同时处理，就得引入`apply`函数。

### apply 的使用

下面的例子通过计算BMI来展示 `apply` 的基本用法。

In [26]:
# 定义BMI计算函数
def BMI(x):
    Height = x['Height']/100
    Weight = x['Weight']
    BMI_value = Weight/Height**2
    return BMI_value.mean()

# 使用apply实现
gb.apply(BMI)

Gender
Female    18.860930
Male      24.318654
dtype: float64

下面的例子展示了 `apply` 返回Series的情况。

In [27]:
gb = df.groupby(['Gender','Test_Number'])[['Height','Weight']]

gb.apply(lambda x: 0)

Gender  Test_Number
Female  1              0
        2              0
        3              0
Male    1              0
        2              0
        3              0
dtype: int64

In [28]:
# 这个例子注意体会
gb.apply(lambda x: [0,0])
# 返回值仍被看作标量

Gender  Test_Number
Female  1              [0, 0]
        2              [0, 0]
        3              [0, 0]
Male    1              [0, 0]
        2              [0, 0]
        3              [0, 0]
dtype: object

下面的例子展示了 `apply` 返回DataFrame的情况。

In [29]:
# 注意与上一个例子对比
gb.apply(lambda x: pd.Series([0,0],index=['a','b']))

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
Gender,Test_Number,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,1,0,0
Female,2,0,0
Female,3,0,0
Male,1,0,0
Male,2,0,0
Male,3,0,0


`apply`的灵活程度虽然比三种基本的分组操作要高，但性能与后者相差较大。

#### 练一练11

没读懂题目的要求。

#### 练一练11

In [30]:
gb = df.groupby('Gender')[['Height','Weight']]

In [31]:
gb.cov()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,Height,25.542739,24.838146
Female,Weight,24.838146,29.224655
Male,Height,49.681137,47.803901
Male,Weight,47.803901,60.412648


In [32]:
%timeit gb.cov()

2.07 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [33]:
def myf(x):
    s = [i.cov(j) for i in [x.Height,x.Weight] for j in [x.Height,x.Weight]]
    ss = pd.Series(s,index=['Height&Height','Height&Weight','Weight&Height','Weight&Weight'])
    return ss

gb.apply(myf)

Unnamed: 0_level_0,Height&Height,Height&Weight,Weight&Height,Weight&Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,25.542739,24.838146,24.838146,29.224655
Male,49.681137,47.803901,47.803901,60.412648


In [34]:
%timeit gb.apply(myf)

3.6 ms ± 45.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 练习

### Ex1: 汽车数据集

In [35]:
df = pd.read_csv('data/car.csv')
df.head(3)

Unnamed: 0,Brand,Price,Country,Reliability,Mileage,Type,Weight,Disp.,HP
0,Eagle Summit 4,8895,USA,4.0,33,Small,2560,97,113
1,Ford Escort 4,7402,USA,2.0,33,Small,2345,114,90
2,Ford Festiva 4,6319,Korea,4.0,37,Small,1845,81,63


第**1**问

个人认为本题描述得不太清楚，故根据自己的理解完成。

In [36]:
gbc = df.groupby('Country')

In [37]:
# 过滤
gbcf = gbc.filter(lambda x:x.shape[0]>2)

需要注意的是，过滤所得为<font color=red>DataFrame</font>，而不是groupby对象。因此要重新进行分组才能完成后续操作。

In [38]:
gbcf.groupby('Country').agg({'Price':['mean',('CoV',lambda x: x.std()/x.mean())],'Brand':'count'})

Unnamed: 0_level_0,Price,Price,Brand
Unnamed: 0_level_1,mean,CoV,count
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Japan,13938.052632,0.387429,19
Japan/USA,10067.571429,0.24004,7
Korea,7857.333333,0.243435,3
USA,12543.269231,0.203344,26


第**2**问

In [39]:
df['ID'] = np.nan
for i in range(len(df.index)):
    if df.index[i] < len(df.index)/3:
        df['ID'][i] = 'one'
    elif df.index[i] < len(df.index)/3*2:
        df['ID'][i] = 'two'
    else:
        df['ID'][i] = 'three'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [40]:
df.groupby('ID')['Price'].mean()

ID
one       9069.95
three    15420.65
two      13356.40
Name: Price, dtype: float64

第**2**问**参考答案**

第二问的参考答案给的方法很巧妙，相比之下我的方案显得十分笨拙……

In [41]:
# 无需循环，直接可以通过列表的加和来构造
condition = ['Head']*20+['Mid']*20+['Tail']*20

In [42]:
df.groupby(condition)['Price'].mean()

Head     9069.95
Mid     13356.40
Tail    15420.65
Name: Price, dtype: float64

第**3**问

In [43]:
gb = df.groupby('Type')[['Price','HP']]
new = gb.agg(['max','min'])
new.columns = new.columns.map(lambda x:x[0]+'_'+x[1])
new

Unnamed: 0_level_0,Price_max,Price_min,HP_max,HP_min
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Compact,18900,9483,142,95
Large,17257,14525,170,150
Medium,24760,9999,190,110
Small,9995,5866,113,63
Sporty,13945,9410,225,92
Van,15395,12267,150,106


第**4**问

In [44]:
gbt = df.groupby('Type')
gbt['HP'].transform(lambda x: (x-x.min())/(x.max()-x.min()))

0     1.000000
1     0.540000
2     0.000000
3     0.580000
4     0.800000
5     0.380000
6     0.540000
7     0.220000
8     0.540000
9     0.200000
10    0.780000
11    0.300000
12    0.740000
13    0.586466
14    0.060150
15    1.000000
16    0.135338
17    0.120301
18    0.360902
19    0.360902
20    0.000000
21    0.037594
22    0.276596
23    0.319149
24    0.000000
25    0.978723
26    0.063830
27    0.638298
28    0.319149
29    0.148936
30    1.000000
31    0.914894
32    0.319149
33    0.531915
34    0.744681
35    0.425532
36    0.404255
37    0.625000
38    0.000000
39    0.500000
40    0.462500
41    0.500000
42    0.375000
43    0.375000
44    0.000000
45    0.600000
46    0.625000
47    0.000000
48    0.312500
49    1.000000
50    0.750000
51    1.000000
52    0.000000
53    0.090909
54    1.000000
55    0.886364
56    1.000000
57    0.022727
58    0.727273
59    0.000000
Name: HP, dtype: float64

第**5**问

In [45]:
gbt[['Disp.','HP']].corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,Disp.,HP
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Compact,Disp.,1.0,0.586087
Compact,HP,0.586087,1.0
Large,Disp.,1.0,-0.242765
Large,HP,-0.242765,1.0
Medium,Disp.,1.0,0.370491
Medium,HP,0.370491,1.0
Small,Disp.,1.0,0.603916
Small,HP,0.603916,1.0
Sporty,Disp.,1.0,0.871426
Sporty,HP,0.871426,1.0


第**5**问**参考答案**

In [46]:
# 注意：下面操作返回的是一个二维数组！相关系数只需取左上或右下的元素即可。
np.corrcoef(df['HP'].values,df['Disp.'].values)

array([[1.       , 0.8181881],
       [0.8181881, 1.       ]])

In [47]:
df.groupby('Type')[['HP','Disp.']].apply(lambda x: np.corrcoef(x['HP'].values,x['Disp.'].values)[0,1])

Type
Compact    0.586087
Large     -0.242765
Medium     0.370491
Small      0.603916
Sporty     0.871426
Van        0.819881
dtype: float64

### Ex2: 实现transform函数

这一问太复杂了，得花时间理解一下参考答案……