In [1]:
import numpy as np
import pandas as pd

****** 

## 目录

**一、分组模式及其对象groupby**

*  1.1. 分组的一般模式
*  1.2. 分组依据的本质
*  1.3. Groupby的基本操作
*  1.4. 分组的三大操作


**二、聚合函数agg**

* 2.1. 内置函数
* 2.2. agg方法


**三、变换和过滤**

* 2.1. 变换函数与transform方法¶
* 2.2. 组索引与过滤


**四、跨列分组**


* 2.1.apply的引入
* 2.2.apply的使用

**五、练习**

* Ex1：汽车数据集
* 
Ex2：实现transform函数


## 正式学习内容

### 一、分组模式及其对象

In [2]:
# 
df = pd.read_csv('../data/learn_pandas.csv')
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22




#### 1.1. 分组的一般模式

pandas中的groupby提供了数据的分组操作，而数据分组只需要确定三个要素：
* 分组依据 
* 数据来源  
* 操作及其返回结果 

**groupby的使用模式**就是如此：

> **df.groupby**(<font color = red>分组依据</font>)[<font color = blue>数据来源</font>].<font color = green>基本操作</font>

【例】学生体测数据集上，按照性别统计身高中位数

In [6]:
# 性别统计身高中位数

df.groupby('Gender')['Height'].median()

Gender
Female    159.6
Male      173.4
Name: Height, dtype: float64

#### 1.2. groupby分组依据

groupby(分组依据)中的分组依据可以是 *数据中的某一维，可以是列表形式的多个维度，也可以是 一定的逻辑关系*。所有示例以“./data/learn_pandas.csv”中数据作为数据集。



* **单一维度** 按照性别分类 df.groupby('Gender')
* **列表形式的多维度** .按照性别、年级分类 df.groupby(['Gender','Grade'])
* **一定的逻辑关系** 按照体重是否大于平均值进行分组 df.groupby(df.Weight > df.Weight.mean())

In [8]:
# 单一维度
df.groupby('Gender')['Height'].mean()

Gender
Female    159.19697
Male      173.62549
Name: Height, dtype: float64

In [9]:
# 列表形式多维度

df.groupby(['Gender','Grade'])['Height'].mean()

Gender  Grade    
Female  Freshman     159.689189
        Junior       159.782500
        Senior       158.480556
        Sophomore    158.363158
Male    Freshman     175.260000
        Junior       171.207143
        Senior       175.594118
        Sophomore    172.030000
Name: Height, dtype: float64

In [10]:
# 一定的逻辑关系

condition = df.Weight > df.Weight.mean()
df.groupby(condition)['Height'].mean()

Weight
False    159.034646
True     172.705357
Name: Height, dtype: float64

In [40]:
# 列表形式 逻辑关系+item元素
item = np.random.choice(list('abc'),df.shape[0])

df.groupby([condition,item])['Height'].mean()

Weight   
False   a    159.044898
        b    159.313158
        c    158.757500
True    a    172.526316
        b    173.282353
        c    172.385000
Name: Height, dtype: float64

【练一练】请根据上下四分位数分割，将体重分为high、normal、low三组，统计身高的均值。

In [36]:
# contion的计算
df_test =df.copy()

c1 = df.Weight.quantile(0.25)
c2 = df.Weight.quantile(0.75)


In [38]:
def tran_Weight(data):
    if data >c2:
        return 'high'
    elif data<c1:
        return 'low'
    else:
        return 'normal'
    
df_test['Weight_new'] = df_test['Weight'].apply(tran_Weight)
df_test.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record,Weight_new
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34,normal
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20,high
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22,high
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08,low
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22,high


In [39]:
df_test.groupby('Weight_new')['Height'].mean()

Weight_new
high      174.935714
low       153.753659
normal    162.177000
Name: Height, dtype: float64

Weight  Weight  Weight
False   False   False     165.144444
                True      153.753659
        True    False     161.883516
True    False   False     174.935714
Name: Height, dtype: float64

#### 1.3. Groupby对象

pandas的数据groupby之后返回的是一个groupby的对象，一些基本的操作可以直接在groupby上操作。例如：

* **ngroups** 得到分组个数 
* **groups**  返回分组对应信息items（分组名称:对象所在行数）
* **size**   统计分组的每个组的元素个数
* **get_group** 获取所在组对应的数据
* **其他的基本统计方法：** mean()、median()、sum()、std()等等

In [41]:
# 得到groupby的对象 
gb = df.groupby(['School','Grade'])
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001361F46B9E8>

In [44]:
#
gb.ngroups

16

In [48]:
#
gb.groups

{('Fudan University', 'Freshman'): [15, 28, 63, 70, 73, 105, 108, 157, 186], ('Fudan University', 'Junior'): [26, 41, 82, 84, 90, 107, 145, 152, 173, 187, 189, 195], ('Fudan University', 'Senior'): [39, 46, 49, 52, 66, 77, 112, 129, 131, 138, 144], ('Fudan University', 'Sophomore'): [3, 4, 37, 48, 68, 98, 135, 170], ('Peking University', 'Freshman'): [1, 32, 35, 36, 38, 45, 54, 57, 88, 96, 99, 140, 185], ('Peking University', 'Junior'): [9, 20, 59, 72, 75, 102, 159, 183], ('Peking University', 'Senior'): [30, 86, 116, 127, 130, 132, 147, 194], ('Peking University', 'Sophomore'): [29, 61, 83, 101, 120], ('Shanghai Jiao Tong University', 'Freshman'): [0, 6, 10, 60, 114, 117, 119, 121, 141, 148, 149, 153, 184], ('Shanghai Jiao Tong University', 'Junior'): [31, 42, 50, 56, 58, 64, 85, 93, 115, 122, 143, 155, 164, 172, 174, 188, 190], ('Shanghai Jiao Tong University', 'Senior'): [2, 12, 19, 21, 22, 23, 79, 87, 89, 103, 104, 109, 123, 134, 156, 161, 165, 166, 171, 192, 197, 198], ('Shanghai 

In [49]:
#
gb.groups.keys()

dict_keys([('Fudan University', 'Freshman'), ('Fudan University', 'Junior'), ('Fudan University', 'Senior'), ('Fudan University', 'Sophomore'), ('Peking University', 'Freshman'), ('Peking University', 'Junior'), ('Peking University', 'Senior'), ('Peking University', 'Sophomore'), ('Shanghai Jiao Tong University', 'Freshman'), ('Shanghai Jiao Tong University', 'Junior'), ('Shanghai Jiao Tong University', 'Senior'), ('Shanghai Jiao Tong University', 'Sophomore'), ('Tsinghua University', 'Freshman'), ('Tsinghua University', 'Junior'), ('Tsinghua University', 'Senior'), ('Tsinghua University', 'Sophomore')])

In [50]:
#
gb.size()

School                         Grade    
Fudan University               Freshman      9
                               Junior       12
                               Senior       11
                               Sophomore     8
Peking University              Freshman     13
                               Junior        8
                               Senior        8
                               Sophomore     5
Shanghai Jiao Tong University  Freshman     13
                               Junior       17
                               Senior       22
                               Sophomore     5
Tsinghua University            Freshman     17
                               Junior       22
                               Senior       14
                               Sophomore    16
dtype: int64

In [53]:
# 
gb.get_group(('Tsinghua University','Sophomore')).head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
40,Tsinghua University,Sophomore,Li Wang,Male,175.0,79.0,N,1,2019/10/7,0:04:12
53,Tsinghua University,Sophomore,Chengli You,Female,164.1,57.0,N,1,2020/1/8,0:04:39
55,Tsinghua University,Sophomore,Chengquan Zhang,Female,168.9,54.0,N,1,2019/12/7,0:04:29
74,Tsinghua University,Sophomore,Yanli Qin,Male,169.4,74.0,Y,1,2019/9/3,0:03:32
76,Tsinghua University,Sophomore,Yanquan Lv,Male,174.6,,N,3,2019/9/26,0:03:59


#### 1.4. 分组的三大操作


我们可以理解为，将数据进行分组后，我们可以进行三大操作：

* <font color = red>聚合 aggregate</font>
* <font color = blue>变换 transform</font>
* <font color = green>过滤 functions</font>

下面几个小节可以分别学习分组之后数据处理的三大操作。

### 二、聚合函数agg

#### 2.1. 内置函数

groupby对象具备一些内置函数，即一些统计类的函数等

* max/min/mean/median/count/idxmax/idxmin/nunique/quantile/sum/std/var/size/

* all/any/mad/skew/sem/prod

In [4]:
# 
gb = df.groupby('Gender')['Height']

#  返回最大值
gb.max()

Gender
Female    170.2
Male      193.9
Name: Height, dtype: float64

In [5]:
# 返回最大值的索引
gb.idxmax()

Gender
Female     28
Male      193
Name: Height, dtype: int64

**【练一练】** 
查阅文档，明确all/any/mad/skew/sem/prod函数的含义

* all/any
* mad ： 返回平均绝对偏差
* skew ：无偏差归一化
* sem ： 排除缺失值，计算标准偏差
* prod ：计算维度上的乘积

In [10]:
gb.prod()

Gender
Female    4.232080e+290
Male      1.594210e+114
Name: Height, dtype: float64

#### 2.2. agg方法

聚合函数agg的使用，解决了以下几个问题：

* 同时使用多个函数
* 对特定的列使用特定的聚合函数
* 使用自定义的聚合函数
* 直接对结果的列名在聚合前进行自定义命名

下面的示例就是展示agg函数的这四个功能：

##### 2.2.1 同时使用多个函数

把内置的聚合函数对应的字符串以**列表的形式**传入

In [14]:
# 
gb = df.groupby('Gender')['Height','Weight']
gb.agg(['mean','std','max'])

Unnamed: 0_level_0,Height,Height,Height,Weight,Weight,Weight
Unnamed: 0_level_1,mean,std,max,mean,std,max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,159.19697,5.053982,170.2,47.918519,5.405983,63.0
Male,173.62549,7.048485,193.9,72.759259,7.772557,89.0


##### 2.2.2 对特定的列使用特定的聚合函数

将特定列的处理方式以**字典的形式**传入

In [15]:
gb.agg({'Height':['mean','max'],'Weight':'min'})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,mean,max,min
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,159.19697,170.2,34.0
Male,173.62549,193.9,51.0


##### 2.2.3 使用自定义函数

直接在agg()传入自定义的函数.

由于传入的是序列，因此序列上的方法和属性都是可以在函数中使用的，只需保证**返回值是标量**即可。

In [17]:
#
gb.agg({'Height':['mean','max'],
        'Weight':lambda x:x.mean()-x.min()})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,mean,max,<lambda>
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,159.19697,170.2,13.918519
Male,173.62549,193.9,21.759259


In [18]:
def my_func(x):
    return x.mean() - x.min()

gb.agg({'Height':['mean','max'],
        'Weight':my_func})

Unnamed: 0_level_0,Height,Height,Weight
Unnamed: 0_level_1,mean,max,my_func
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,159.19697,170.2,13.918519
Male,173.62549,193.9,21.759259


**【练一练】**
在groupby对象中可以使用describe方法进行统计信息汇总，请同时使用多个聚合函数，完成与该方法相同的功能。

In [19]:
df['Weight'].describe()

count    189.000000
mean      55.015873
std       12.824294
min       34.000000
25%       46.000000
50%       51.000000
75%       65.000000
max       89.000000
Name: Weight, dtype: float64

In [81]:
gb = df.groupby('Gender')['Weight']

# 返回百分位数可以自己创建一个函数
# 
# test_n
def percentile(n):
    def quan_(x):
        return x.quantile(n)
    quan_.__name__ = "%0.2f分位数"%(n)
    return quan_


## 
gb.agg(['count','mean','std','min'
        ,'max',percentile(0.25),percentile(0.50),percentile(0.)])

Unnamed: 0_level_0,count,mean,std,min,max,0.25分位数,0.50分位数,0.00分位数
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Female,135,47.918519,5.405983,34.0,63.0,44.0,48.0,34.0
Male,54,72.759259,7.772557,51.0,89.0,69.0,73.0,51.0


##### 2.2.4 聚合结果重命名

如果想要对聚合结果的列名进行重命名，在上述函数表达的基础上增加一个**元组表达[（新名字，聚合操作）]**

In [89]:
# 
gb = df.groupby('Gender')['Height','Weight']
gb.agg('max')

Unnamed: 0_level_0,Height,Weight
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,170.2,63.0
Male,193.9,89.0


In [90]:
#
gb.agg([('最大值','max')])

Unnamed: 0_level_0,Height,Weight
Unnamed: 0_level_1,最大值,最大值
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2
Female,170.2,63.0
Male,193.9,89.0


In [91]:
# 
gb.agg([('最大值','max'),
        ('极差',lambda x:x.max()-x.min())])

Unnamed: 0_level_0,Height,Height,Weight,Weight
Unnamed: 0_level_1,最大值,极差,最大值,极差
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,170.2,24.8,63.0,29.0
Male,193.9,38.2,89.0,38.0


### 三、变换和过滤

#### 3.1. 变换函数与transform方法


**内置变换函数**使用累计函数表示的在组别的基础上，进行累计/累加/累乘/累计求最大值/累计求最小值的处理。

对比SQL中的over函数

变换函数与agg的区别就是，变换函数返回是 与原数据相同格式的 DataFrame。

In [10]:
gb = df.groupby('Gender')['Height','Weight']
gb.cumcount()

0        0
1        0
2        1
3        1
4        2
      ... 
195    138
196    139
197    140
198     57
199     58
Length: 200, dtype: int64

In [20]:
gb.transform('mean').head()

Unnamed: 0,Height,Weight
0,159.19697,47.918519
1,173.62549,72.759259
2,173.62549,72.759259
3,159.19697,47.918519
4,173.62549,72.759259


#### 3.2. 组索引与过滤

组过滤作为行过滤的推广，指的是如果对一个组的全体所在行进行统计的结果返回True则会被保留，False则该组会被过滤，最后把所有未被过滤的组其对应的所在行拼接起来作为DataFrame返回。


gb.filter

### 四、跨列分组

apply的应用，解决了跨列进行数据计算的需求。

apply也可以在未分组的数据进行处理。

#### 2.1.apply的使用


In [21]:
# apply的基本使用 

def BMI(x):
    Height = x['Height']/100
    Weight = x['Weight']
    BMI_value = Weight/Height**2
    return BMI_value.mean()
gb.apply(BMI)

Gender
Female    18.860930
Male      24.318654
dtype: float64

In [23]:
# apply 返回标量

gb = df.groupby(['Gender','Test_Number'])['Height','Weight']

gb.apply(lambda x:0)

Gender  Test_Number
Female  1              0
        2              0
        3              0
Male    1              0
        2              0
        3              0
dtype: int64

In [24]:
gb.apply(lambda x:[1,1])

Gender  Test_Number
Female  1              [1, 1]
        2              [1, 1]
        3              [1, 1]
Male    1              [1, 1]
        2              [1, 1]
        3              [1, 1]
dtype: object

In [26]:
# 返回series

gb.apply(lambda x:pd.Series([0],index=['a']))

Unnamed: 0_level_0,Unnamed: 1_level_0,a
Gender,Test_Number,Unnamed: 2_level_1
Female,1,0
Female,2,0
Female,3,0
Male,1,0
Male,2,0
Male,3,0
