# 数据聚合与分组计算

对数据集进行**分组**并对各组应用一个**函数**(无论是聚合还是转换),这是数据分析工作中的重要环节。在将数据集准备好之后,通常的任务就是**计算分组统计**或**生成透视表**。pandas提供了一个灵活高效的**gruopby**功能,它使你能以一种自然的方式对数据集进行**切片**、**切块**、**摘要**等操作。

关系型数据库和SQL(Structured Query Language,结构化查询语言)能够如此流行的原因之一就是其能够方便地对数据进行连接、过滤、转换和聚合。但是,像SQL这样的查询语言所能执行的分组运算的种类很有限。在本章中你将会看到,由于Python和pandas强大的表达能力,我们可以执行复杂得多的分组运算(利用任何可以接受pandas对象或NumPy数组的函数)。

在本章中,你将会学到:
* 根据一个或多个键(可以是函数、数组或DataFrame列名)拆分pandas对象。
* 计算分组摘要统计,如计数、平均值、标准差,或用户自定义函数。
* 对DataFrame的列应用各种各样的函数。
* 应用组内转换或其他运算,如规格化、线性回归、排名或选取子集等。
* 计算透视表或交叉表。
* 执行分位数分析以及其他分组分析。

In [176]:
import numpy as np

import pandas as pd
from pandas import Series
from pandas import DataFrame

import statsmodels.api as sm

import matplotlib.pyplot as plt
%matplotlib inline

  from pandas.core import datetools


In [2]:
def print_gb(gb):
    for n,g in gb:
        print n
        print g
        print '\n'

## GroupBy技术

### 分组运算（split-apply-combine）

分组运算（split-apply-combine）：
* 拆分：
    * 通过一个或多个键对原数据进行拆分到不同组中；
* 应用：
    * 在不同组上应用函数计算得到结果；
* 合并：
    * 将结果合并到最终的结果对象中；

下图很好的展示了该过程：
![分组计算](https://github.com/NemoHoHaloAi/machine_learning/blob/master/python%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/images/%E5%88%86%E7%BB%84%E8%AE%A1%E7%AE%97.png?raw=true)

### 分组键的可取情况

分组键可以有多种形式,且类型不必相同:
* 列表或数组,其长度与待分组的轴一样。
* 表示DataFrame某个列名的值。
* 字典或Series,给出待分组轴上的值与分组名之间的对应关系。
* 函数,用于处理轴索引或索引中的各个标签。
**注意**：后三种本质上是第一种的快捷方式，通过各种方式获取用于拆分对象的值，因此可以将这四种方式看做是如何获取用于拆分对象的值的四种方式即可，第一种是直接使用数组，第二种是取列名，第三种是映射关系，第四种是靠返回值；

注意：不管分组时表面上使用的是什么，最终都会转换成一个用于对应数据应该处于哪个分组的数组，数组上每个值，决定了相应位置的数据应该属于哪个分组；

### 分组示例

#### 使用Series做分组键 -- 例如df['key1']

In [3]:
df = DataFrame({'data1':[10,20,30,40,50],'data2':[40,50,60,70,80],
                'key1':['a','b','a','b','a'],'key2':['c','d','d','c','c']},
              index=['HL','LM','BL','JK','MP'])
df

Unnamed: 0,data1,data2,key1,key2
HL,10,40,a,c
LM,20,50,b,d
BL,30,60,a,d
JK,40,70,b,c
MP,50,80,a,c


In [4]:
# 对data1列数据按照key1分组并聚合计算平均值
df['data1'].groupby(df['key1']).mean() # 生成Series索引为key1的唯一值

key1
a    30
b    30
Name: data1, dtype: int64

In [5]:
# 对data1按照key1，key2分组并计算平均值
df['data1'].groupby([df['key1'],df['key2']]).mean() # 生成Series索引为key1，key2的唯一键组合

key1  key2
a     c       30
      d       30
b     c       40
      d       20
Name: data1, dtype: int64

#### 任意数组做分组键 -- 数组每个值对应同位置行的值，也就是强行有一种映射关系

In [6]:
arr = np.array(['aa','bb','cc','aa','cc'])
df['data1'].groupby(arr).mean()

aa    25
bb    20
cc    40
Name: data1, dtype: int64

#### 将列名(可以是字符串、数字或其他Python对象)用作分组键 -- 默认丢弃非数值组

In [7]:
df.groupby('key1').mean() # 使用列名作为分组键时不能针对某一列（Series）分组了就，因为Series没有该列名

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,30,60
b,30,60


#### groupby的size

In [8]:
df.groupby('key2').size()

key2
c    3
d    2
dtype: int64

### 对分组进行迭代

GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。

#### 单键分组迭代

In [9]:
for name,group in df.groupby('key1'):
    print name
    print group
    print '\n'

a
    data1  data2 key1 key2
HL     10     40    a    c
BL     30     60    a    d
MP     50     80    a    c


b
    data1  data2 key1 key2
LM     20     50    b    d
JK     40     70    b    c




#### 多键分组迭代

In [10]:
for names,group in df.groupby(['key1','key2']):
    print names
    print group
    print '\n'

('a', 'c')
    data1  data2 key1 key2
HL     10     40    a    c
MP     50     80    a    c


('a', 'd')
    data1  data2 key1 key2
BL     30     60    a    d


('b', 'c')
    data1  data2 key1 key2
JK     40     70    b    c


('b', 'd')
    data1  data2 key1 key2
LM     20     50    b    d




#### 将分组结果转换为字典

In [11]:
group_dict = dict(list(df.groupby(['key1','key2'])))
for key in group_dict:
    print key
    print group_dict[key]
    print '\n'

('b', 'c')
    data1  data2 key1 key2
JK     40     70    b    c


('a', 'd')
    data1  data2 key1 key2
BL     30     60    a    d


('a', 'c')
    data1  data2 key1 key2
HL     10     40    a    c
MP     50     80    a    c


('b', 'd')
    data1  data2 key1 key2
LM     20     50    b    d




#### 在索引上分组 -- 指定axis=0

In [12]:
for name,group in df.groupby(['A','A','B','B','B'], axis=0):
    print name
    print group
    print '\n'

A
    data1  data2 key1 key2
HL     10     40    a    c
LM     20     50    b    d


B
    data1  data2 key1 key2
BL     30     60    a    d
JK     40     70    b    c
MP     50     80    a    c




### 选取一个或一组列 -- 可以直接对指定的列进行分组，或对分组结果取对应列

#### 对指定的列进行分组

In [13]:
df['data2'].groupby([df['key1'],df['key2']]).mean()

key1  key2
a     c       60
      d       60
b     c       70
      d       50
Name: data2, dtype: int64

#### 对分组结果取指定列 -- 这种方式是上一种方式的语法糖

In [14]:
df.groupby(['key1', 'key2'])['data2'].mean()

key1  key2
a     c       60
      d       60
b     c       70
      d       50
Name: data2, dtype: int64

#### 注意下述两种写法的不同之处

In [15]:
df.groupby(['key1'])['data2'].mean()

key1
a    60
b    60
Name: data2, dtype: int64

In [16]:
df.groupby(['key1'])[['data2']].mean()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,60
b,60


比较：
* \['data2'\]：
    * 结果为Series；
    * Name属性为对应取的列名；
    * **DataFrame['列名']**得到的是对应列的**Series**形式；
* \[\['data2'\]\]：
    * 结果为DataFrame；
    * 索引为分组键，列为对应取的列名；
    * **DataFrame[['列名']]**得到的是对应列+原索引组成的**DataFrame**形式；

In [17]:
df['key1'] # 获取原索引+该列数据的Series

HL    a
LM    b
BL    a
JK    b
MP    a
Name: key1, dtype: object

In [18]:
df[['key1']] # 获取原索引+该列的DataFrame

Unnamed: 0,key1
HL,a
LM,b
BL,a
JK,b
MP,a


### 通过字典或Series进行分组

In [19]:
df = DataFrame(np.random.randn(5,5),
              columns=['a','b','c','d','e'],
              index=['01','02','03','04','05'])
df.ix[1:4,1:4] = np.nan # 设置几个nan值
df

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.


Unnamed: 0,a,b,c,d,e
1,0.235712,1.546067,0.086968,0.087098,-1.587111
2,0.689585,,,,-0.868422
3,0.060613,,,,-1.79137
4,0.235926,,,,1.231969
5,1.14196,-1.285498,1.766622,0.469069,0.416248


#### 通过字典分组

In [20]:
dict_col = {'a':'A','b':'B','c':'A','d':'A','e':'B'}
df.groupby(dict_col, axis=1).mean()# 指定分组关系，等价于df.groupby(['A','B','A','A','B'], axis=1).mean()

Unnamed: 0,A,B
1,0.136593,-0.020522
2,0.689585,-0.868422
3,0.060613,-1.79137
4,0.235926,1.231969
5,1.125884,-0.434625


#### 通过Series分组 -- 

In [21]:
series_col = Series({'a':'A','b':'B','c':'A','d':'A','e':'B'}) # 长度不一定要一致的
df.groupby(series_col, axis=1).mean()

Unnamed: 0,A,B
1,0.136593,-0.020522
2,0.689585,-0.868422
3,0.060613,-1.79137
4,0.235926,1.231969
5,1.125884,-0.434625


### 通过函数分组

相较于字典或Series,Python函数在定义分组映射关系时可以更
有创意且更为抽象。任何被当做分组键的函数都会在各个索引值上被
调用一次,其返回值就会被用作分组名称。

#### 纯函数分组

In [22]:
df = DataFrame({'grade':[67,54,47,82,66]}, index=['Jack Jr.','Murphy','Mark Jr.','Lily','John Jr.'])
df.groupby(lambda name:'Jr.' in name).mean() # 根据名称中是否存在Jr.进行分组统计分数平均值

Unnamed: 0,grade
False,68
True,60


#### 函数混合其他分组 -- 先使用函数分为True，False两组，再根据数组继续细分

In [23]:
df.groupby([lambda name:'Jr.' in name,['1','2','2','1','2']]).mean() # 函数混合数组

Unnamed: 0,Unnamed: 1,grade
False,1,82.0
False,2,54.0
True,1,67.0
True,2,56.5


### 根据索引级别分组 -- 直接通过level参数指定分组级别即可

In [24]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],[1, 3, 5, 1, 3]], names=['cty', 'tenor'])
df = DataFrame(np.random.randn(4, 5), columns=columns)
df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.41639,-0.353136,0.463632,-0.178678,-0.154986
1,-0.332494,0.930559,-0.128227,0.263087,-0.898967
2,-0.345656,-0.161283,-1.252275,-0.908882,1.490345
3,-0.884441,-0.74076,-0.411203,1.960717,1.511257


In [25]:
df.groupby(level='cty', axis=1).mean() # 按照cty分组，也就是最外层索引

cty,JP,US
0,-0.166832,-0.101965
1,-0.31794,0.156613
2,0.290732,-0.586405
3,1.735987,-0.678801


In [26]:
df.groupby(level=1, axis=1).mean() # 按照最内层索引分组

tenor,1,3,5
0,-0.297534,-0.254061,0.463632
1,-0.034703,0.015796,-0.128227
2,-0.627269,0.664531,-1.252275
3,0.538138,0.385248,-0.411203


## 数据聚合

对于**聚合**,我指的是任何能够从**数组**产生**标量值**的**数据转换**过
程。之前的例子中我已经用过一些,比如mean、count、min以及sum等。
你可能想知道在GroupBy对象上调用mean()时究竟发生了什么。许多
常见的聚合运算都有就地计算数据集统计信息的优化
实现。然而,并不是只能使用这些方法。你可以使用**自己发明**的**聚合运
算**,还可以调用分组对象上**已经定义**好的任何**方法**。

### 一般聚合方法

In [27]:
gb = DataFrame({'data1':[10,20,30,40,50],'data2':[40,50,60,70,80],
                'key1':['a','b','a','b','a'],'key2':['c','d','d','c','c']},
              index=['HL','LM','BL','JK','MP']).groupby('key1')

for name,group in gb:
    print name
    print group
    print '\n'

a
    data1  data2 key1 key2
HL     10     40    a    c
BL     30     60    a    d
MP     50     80    a    c


b
    data1  data2 key1 key2
LM     20     50    b    d
JK     40     70    b    c




#### quantile -- 分位数 默认丢弃非数值列

In [28]:
gb.quantile(0.9)

0.9,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,46.0,76.0
b,38.0,68.0


#### 使用自定义聚合方法 -- agg(callable, string, dictionary, or list of string/callables)

In [29]:
def func(x):
    return x.max() - x.min()

gb.agg(func) # 应用于每一个分组

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,40,40
b,20,20


注意: 可能你已经注意到了,自定义聚合函数要比内置的那些
经过优化的函数慢得多。这是因为在构造**中间分组数据块**时存在非常
大的开销(**函数调用**、**数据重排**等)。

#### GroupBy内置的可用聚合方法

* count：分组中非NaN的值的个数；
* sum：分组中非NaN值的和；
* mean：分组中非NaN值的平均值；
* median：分组中非NaN值的算数中位数；
* std,var：分组中非NaN值的无偏（分母为n-1，矫正过）标准差/方差；
* min,max：分组中非NaN值的最小/最大值；
* prod：分组中非NaN值的积；
* first,last：分组第一个/最后一个非NaN的值；

### 对比agg，aggregate，apply

In [30]:
def test(x):
    print type(x)
    print x
    print '\n'
    return 2

#### agg

In [31]:
gb.agg(test)

<class 'pandas.core.series.Series'>
HL    10
BL    30
MP    50
Name: data1, dtype: int64


<class 'pandas.core.series.Series'>
LM    20
JK    40
Name: data1, dtype: int64


<class 'pandas.core.series.Series'>
HL    40
BL    60
MP    80
Name: data2, dtype: int64


<class 'pandas.core.series.Series'>
LM    50
JK    70
Name: data2, dtype: int64


<class 'pandas.core.series.Series'>
HL    c
BL    d
MP    c
Name: key2, dtype: object


<class 'pandas.core.series.Series'>
LM    d
JK    c
Name: key2, dtype: object




Unnamed: 0_level_0,data1,data2,key2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,2,2
b,2,2,2


agg：
* 作用于每一组的每一个列（也就是Series）上；
* 结果是聚合结果组成的DataFrame

#### aggregate

In [32]:
gb.aggregate(test)

<class 'pandas.core.series.Series'>
HL    10
BL    30
MP    50
Name: data1, dtype: int64


<class 'pandas.core.series.Series'>
LM    20
JK    40
Name: data1, dtype: int64


<class 'pandas.core.series.Series'>
HL    40
BL    60
MP    80
Name: data2, dtype: int64


<class 'pandas.core.series.Series'>
LM    50
JK    70
Name: data2, dtype: int64


<class 'pandas.core.series.Series'>
HL    c
BL    d
MP    c
Name: key2, dtype: object


<class 'pandas.core.series.Series'>
LM    d
JK    c
Name: key2, dtype: object




Unnamed: 0_level_0,data1,data2,key2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,2,2
b,2,2,2


同上

#### apply

In [33]:
gb.apply(test)

<class 'pandas.core.frame.DataFrame'>
    data1  data2 key2
HL     10     40    c
BL     30     60    d
MP     50     80    c


<class 'pandas.core.frame.DataFrame'>
    data1  data2 key2
HL     10     40    c
BL     30     60    d
MP     50     80    c


<class 'pandas.core.frame.DataFrame'>
    data1  data2 key2
LM     20     50    d
JK     40     70    c




key1
a    2
b    2
dtype: int64

In [34]:
gb.apply(lambda x:x.min())

Unnamed: 0_level_0,data1,data2,key2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,10,40,c
b,20,50,c


apply：
* 作用于每一个分组；
* 聚合结果是每一个分组的结果组成的Series；

但是为什么有三次循环呢？？

#### 总结

apply是作用于每一个分组上的，而agg，aggregate是作用于每一个分组的每一列上；

### 小费示例

In [35]:
# step1 加载数据
tips = pd.read_csv('https://raw.githubusercontent.com/NemoHoHaloAi/pydata-book/2nd-edition/examples/tips.csv')
tips.head(5)

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [36]:
# step2 增加小费所占比例列
tips['tip_proportion'] = tips['tip'] / tips['total_bill']
tips.head(5)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_proportion
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [37]:
# step3 增加sex列，这个主要因为目前的数据没有这一列。。。
tips['sex'] = ['male' if x%3==0 else 'female' for x in np.arange(len(tips))]
tips.head(5)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_proportion,sex
0,16.99,1.01,No,Sun,Dinner,2,0.059447,male
1,10.34,1.66,No,Sun,Dinner,3,0.160542,female
2,21.01,3.5,No,Sun,Dinner,3,0.166587,female
3,23.68,3.31,No,Sun,Dinner,2,0.13978,male
4,24.59,3.61,No,Sun,Dinner,4,0.146808,female


### 面向列的多函数应用

对Series或DataFrame列的**聚合运算**其实就是使用
**aggregate**(使用自定义函数)或调用诸如**mean**、**std**之类的方法。然而,
你可能希望对**不同的列**使用**不同的聚合函数**,或**一次应用多个函数**。

In [38]:
gb_sex_smoker = tips.groupby(['sex','smoker'])
print_gb(gb_sex_smoker)

('female', 'No')
     total_bill   tip smoker   day    time  size  tip_proportion     sex
1         10.34  1.66     No   Sun  Dinner     3        0.160542  female
2         21.01  3.50     No   Sun  Dinner     3        0.166587  female
4         24.59  3.61     No   Sun  Dinner     4        0.146808  female
5         25.29  4.71     No   Sun  Dinner     4        0.186240  female
7         26.88  3.12     No   Sun  Dinner     4        0.116071  female
8         15.04  1.96     No   Sun  Dinner     2        0.130319  female
10        10.27  1.71     No   Sun  Dinner     2        0.166504  female
11        35.26  5.00     No   Sun  Dinner     4        0.141804  female
13        18.43  3.00     No   Sun  Dinner     4        0.162778  female
14        14.83  3.02     No   Sun  Dinner     2        0.203641  female
16        10.33  1.67     No   Sun  Dinner     3        0.161665  female
17        16.29  3.71     No   Sun  Dinner     3        0.227747  female
19        20.65  3.35     No   Sat

#### 应用一个函数

In [39]:
gb_sex_smoker['tip_proportion'].agg('mean') # 对于tips_proportion进行平均值聚合计算

sex     smoker
female  No        0.162939
        Yes       0.171521
male    No        0.151813
        Yes       0.148060
Name: tip_proportion, dtype: float64

#### 应用多个函数

In [40]:
def test(x):
    return x.max() - x.min()

gb_sex_smoker['tip_proportion'].agg(['mean', test, np.std]) # 同时应用平均值，自定义函数，np内置标准差函数

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,test,std
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,No,0.162939,0.219028,0.037195
female,Yes,0.171521,0.644685,0.095908
male,No,0.151813,0.195876,0.044509
male,Yes,0.14806,0.244897,0.059199


#### 针对单个列分别应用不同函数 -- 类似维护一个该列名称与实际聚合操作的映射表

In [41]:
# 对tip_proportion应用平均值，np.max和lambda表达式
gb_sex_smoker['tip_proportion'].agg([('平均小费','mean'), ('最大小费',np.max), ('平均小费/10', lambda x:x.mean()/10)])

Unnamed: 0_level_0,Unnamed: 1_level_0,平均小费,最大小费,平均小费/10
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,No,0.162939,0.29199,0.016294
female,Yes,0.171521,0.710345,0.017152
male,No,0.151813,0.252672,0.015181
male,Yes,0.14806,0.280535,0.014806


#### 针对多个列同时应用多个函数 -- 其实就是将多个不同的函数同时应用到每个列，然后将结果concat起来得到一个具有层次化索引的结果

In [42]:
gb_sex_smoker['tip_proportion','total_bill'].agg([('平均','mean'), ('最大',np.max)])

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_proportion,tip_proportion,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,平均,最大,平均,最大
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,No,0.162939,0.29199,19.860196,48.33
female,Yes,0.171521,0.710345,20.183333,50.81
male,No,0.151813,0.252672,17.789592,48.17
male,Yes,0.14806,0.280535,21.798182,44.3


#### 对不同的列分别应用多个不同的函数 -- 需要传入表示列与聚合操作对应关系的字典

In [43]:
gb_sex_smoker.agg({'tip':('mean','min'), 'total_bill':('max',np.std,np.var)})

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill,total_bill,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,max,std,var,mean,min
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,No,48.33,8.460528,71.580533,3.180882,1.25
female,Yes,50.81,10.475436,109.734768,3.066167,1.0
male,No,48.17,7.707749,59.409391,2.598367,1.0
male,Yes,44.3,8.594847,73.871397,2.904242,1.17


### 以”无索引“的形式返回聚合数据 -- 默认索引由分类的属性值决定，比如上述的male或female，或层次化等

In [44]:
print_gb(tips[:10].groupby('sex', as_index=False)) # 指定不使用分类列属性作为索引

female
   total_bill   tip smoker  day    time  size  tip_proportion     sex
1       10.34  1.66     No  Sun  Dinner     3        0.160542  female
2       21.01  3.50     No  Sun  Dinner     3        0.166587  female
4       24.59  3.61     No  Sun  Dinner     4        0.146808  female
5       25.29  4.71     No  Sun  Dinner     4        0.186240  female
7       26.88  3.12     No  Sun  Dinner     4        0.116071  female
8       15.04  1.96     No  Sun  Dinner     2        0.130319  female


male
   total_bill   tip smoker  day    time  size  tip_proportion   sex
0       16.99  1.01     No  Sun  Dinner     2        0.059447  male
3       23.68  3.31     No  Sun  Dinner     2        0.139780  male
6        8.77  2.00     No  Sun  Dinner     2        0.228050  male
9       14.78  3.23     No  Sun  Dinner     2        0.218539  male




## 分组级运算和转换 -- transform/apply

聚合只不过是分组运算的其中**一种**而已。它是数据转换的一个**特例**,也就是说,它接受能够将**一维数组简化为标量值**的函数。在本节中,我将介绍**transform**和**apply**方法,它们能够执行更多其他的**分组运算**。

### transform

In [45]:
df = DataFrame({'math':[55,65,70,48,63],'english':[80,95,77,65,45],'sex':['male','female','male','male','female']},
              index=['HoLoong','Mark','John','Lily','Joy'])
df

Unnamed: 0,english,math,sex
HoLoong,80,55,male
Mark,95,65,female
John,77,70,male
Lily,65,48,male
Joy,45,63,female


#### 增加一列用于描述根据性别分组的英语和数学的平均分 -- 旧方式：先聚合，再合并

In [55]:
# step 1:先聚合计算平均值
mean_sex = df.groupby('sex').mean()
mean_sex

Unnamed: 0_level_0,english,math
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,70.0,64.0
male,74.0,57.666667


In [59]:
# step 2:再合并到原dataframe中，注意此时sex在两侧数据的位置，一个是列，一个是索引，同时还需要指定后缀
df.merge(mean_sex, left_on='sex', right_index=True, suffixes=('', '_mean'))

Unnamed: 0,english,math,sex,english_mean,math_mean
HoLoong,80,55,male,74.0,57.666667
John,77,70,male,74.0,57.666667
Lily,65,48,male,74.0,57.666667
Mark,95,65,female,70.0,64.0
Joy,45,63,female,70.0,64.0


#### 增加一列用于描述根据性别分组的英语和数学的平均分 -- 新方式：transform

In [64]:
df.merge(df.groupby('sex').transform(np.mean), left_index=True, right_index=True, suffixes=('','_mean'))

Unnamed: 0,english,math,sex,english_mean,math_mean
HoLoong,80,55,male,74,57.666667
Mark,95,65,female,70,64.0
John,77,70,male,74,57.666667
Lily,65,48,male,74,57.666667
Joy,45,63,female,70,64.0


#### transform用法

可以看到，transform会将函数应用于**每一个**分组，同时将结果自动放置到以**原DataFrame索引**为索引，以**分组枚举值**为列的DataFrame中，这样的结果可以**直接**与原DataFrame进行索引的合并；

### apply -- 一般性的“拆分--应用--合并”

跟aggregate一样,transform也是一个有着**严格条件**的特殊函数:
传入的函数只能产生两种结果,要么产生一个可以广播的**标量值**(如
np.mean),要么产生一个**相同大小的结果数组**。最一般化的GroupBy方
法是apply,本节剩余部分将重点讲解它。apply会将待处
理的对象拆分成**多个片段**,然后对**各片段**调用传入的函数,最后尝试将各片段**组合**到一起。

In [66]:
tips.head(5)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_proportion,sex
0,16.99,1.01,No,Sun,Dinner,2,0.059447,male
1,10.34,1.66,No,Sun,Dinner,3,0.160542,female
2,21.01,3.5,No,Sun,Dinner,3,0.166587,female
3,23.68,3.31,No,Sun,Dinner,2,0.13978,male
4,24.59,3.61,No,Sun,Dinner,4,0.146808,female


#### 根据分组选出最高的5个tip_proportion值

In [72]:
# 获取指定DataFrame中按照某列排序的最高n个元素
def top_5(df, by, n=5):
    return df.sort_values(by=by)[-n:]

top_5(tips, by='tip_proportion')

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_proportion,sex
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535,male
232,11.61,3.39,No,Sat,Dinner,2,0.29199,female
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733,female
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667,female
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345,female


In [75]:
# 将top_5函数应用到分组上
tips.groupby('smoker').apply(top_5, by='tip_proportion')

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_proportion,sex
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746,female
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663,female
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672,male
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312,female
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199,female
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525,female
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535,male
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733,female
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667,female
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345,female


可以看到：apply-top_5方式，将top应用到了每个分组上，然后将返回的结果concat到一个具有层次化索引（外层为分组枚举值，内层为原DataFrame索引）的DataFrame上；

#### apply使用注意

**注意**: 除这些基本用法之外,能否充分发挥apply的威力很大程度
上取决于你的**创造力**。传入的那个函数能做什么全由你说了算,它**只
需**返回一个**pandas对象**或**标量值**即可。

**PS**：如果调用的函数有其他参数，可以通过在apply后追加的方式设置进去，如下：

    def t(data,a=1,b=2)
        return ...
    df.groupby(...).apply(t, a=3, b=4)# 通过在apply后追加的方式给函数t增加其他参数设置

#### groupby上调用describe

In [79]:
tips.groupby('sex').describe().stack(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_proportion,total_bill
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,count,162.0,162.0,162.0,162.0
female,mean,2.598765,3.138395,0.166117,19.979877
female,std,0.948748,1.481206,0.065238,9.227265
female,min,1.0,1.0,0.06566,3.07
female,25%,2.0,2.0,0.136395,13.2125
female,50%,2.0,3.0,0.156907,17.905
female,75%,3.0,3.9375,0.1927,24.075
female,max,6.0,10.0,0.710345,50.81
male,count,82.0,82.0,82.0,82.0
male,mean,2.512195,2.721463,0.150303,19.402805


上述写法等价于以下：
1. 首先定义应用于每个分组的函数，此处用lambda表示；
2. 其次将该lambda表达式应用到apply上；

In [77]:
tips.groupby('sex').apply(lambda x:x.describe())

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_proportion
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,count,162.0,162.0,162.0,162.0
female,mean,19.979877,3.138395,2.598765,0.166117
female,std,9.227265,1.481206,0.948748,0.065238
female,min,3.07,1.0,1.0,0.06566
female,25%,13.2125,2.0,2.0,0.136395
female,50%,17.905,3.0,2.0,0.156907
female,75%,24.075,3.9375,3.0,0.1927
female,max,50.81,10.0,6.0,0.710345
male,count,82.0,82.0,82.0,82.0
male,mean,19.402805,2.721463,2.512195,0.150303


### 禁止分组键

从上面的例子中可以看出,分组键会跟原始对象的索引共同构成
结果对象中的层次化索引。将**group_keys=False**传入groupby即可禁止
该效果。

In [82]:
tips.groupby('sex', group_keys=False).apply(top_5, by='tip_proportion', n=3)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_proportion,sex
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733,female
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667,female
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345,female
51,10.29,2.6,No,Sun,Dinner,2,0.252672,male
93,16.32,4.3,Yes,Fri,Dinner,2,0.26348,male
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535,male


### 分位数和桶分析

In [83]:
df = DataFrame({'data1': np.random.randn(1000), 'data2': np.random.randn(1000)})
df.head(5)

Unnamed: 0,data1,data2
0,0.884944,0.853652
1,-1.07678,-1.299567
2,0.085905,0.154615
3,-0.305735,-1.547157
4,-1.99693,-0.828612


#### cut + groupby -- 桶的大小一致，桶内元素数量不一定

In [116]:
def desc(g):
    return Series({
        'mean':g.mean(),
        'max':g.max(),
        'min':g.min(),
        'count':g.count()
    })
factor = pd.cut(df.data1, 5)
df.data2.groupby(factor).apply(desc).unstack() # 基本正态分布

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-2.903, -1.735]",42.0,2.901241,0.059832,-1.565256
"(-1.735, -0.573]",234.0,3.770032,0.073398,-2.481241
"(-0.573, 0.589]",439.0,3.003025,0.107727,-2.886407
"(0.589, 1.751]",245.0,2.60455,-0.025059,-2.82941
"(1.751, 2.913]",40.0,2.426777,-0.246561,-2.915815


**小结**：cut函数得到的factor是一个**切分后的Series**，索引是**原数据索引**，值是对应的**区间值**Category，直接使用该factor作为groupby，利用了使用**Series**作为**分组键**的特性，根据同样的索引进行分组，实际效果就是可以**根据某个列值进行cut**，再使用得到的**Series**对原DataFrame其他**任意列**进行**分组计算**；

#### qcut + groupby -- 桶内元素个数一致，桶自身大小不一定

In [119]:
factor_qcut = pd.qcut(df.data1, 5)
df.data2.groupby(factor_qcut).apply(desc).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-2.898, -0.796]",200.0,2.901241,0.035315,-2.430407
"(-0.796, -0.215]",200.0,3.770032,0.141296,-2.481241
"(-0.215, 0.286]",200.0,2.725658,0.060881,-2.886407
"(0.286, 0.848]",200.0,2.346581,0.129282,-2.82941
"(0.848, 2.913]",200.0,2.60455,-0.111882,-2.915815


### 示例:用特定于分组的值填充缺失值 -- fillna

希望用一个固定值或由数据集本身所衍生出来的值去填充
NA值。这时就得使用fillna这个工具了。

In [120]:
df = DataFrame({'math':[66,77,88,55,44,np.nan,np.nan],
                'english':[46,76,56,65,54,np.nan,np.nan],
                'sex':['male','female','male','female','female','male','female']},
              index=['HoLoong','Lily','Mark','Murphy','Tina','John','Mary'])
df

Unnamed: 0,english,math,sex
HoLoong,46.0,66.0,male
Lily,76.0,77.0,female
Mark,56.0,88.0,male
Murphy,65.0,55.0,female
Tina,54.0,44.0,female
John,,,male
Mary,,,female


In [144]:
df.groupby(['sex'], group_keys=False).apply(lambda g:g.fillna(g.mean())) # perfect

Unnamed: 0,english,math,sex
Lily,76.0,77.0,female
Murphy,65.0,55.0,female
Tina,54.0,44.0,female
Mary,65.0,58.666667,female
HoLoong,46.0,66.0,male
Mark,56.0,88.0,male
John,51.0,77.0,male


**小结**：此时的**g**是一个分组，那么g.mean对应的就是该分组中所有数值列的平均值，相等于**fillna**传入一个**Series**来指定**每列的nan需要填充什么值**进去，最后group_keys=False可以去掉最外层的male/female索引；

### 示例:随机采样和排列

假设你想要从一个**大数据集**中**随机抽取样本**以进行蒙特卡罗模
拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很多,其
中一些的效率会比其他的高很多。一个办法是,选取
**np.random.permutation(N)的前K个元素**,其中**N**为**完整数据的大小**,**K**
为**期望的样本大小**。

In [147]:
# 红桃(Hearts)、黑桃(Spades)、梅花(Clubs)、方片(Diamonds)
suits = ['H', 'S', 'C', 'D']
card_val = (range(1, 11) + [10] * 3) * 4
base_names = ['A'] + range(2, 11) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)
deck = Series(card_val, index=cards)
deck.head(5)

AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64

#### 随机抽取5张牌

In [149]:
deck.take(np.random.permutation(len(deck))[:5])

3D     3
AH     1
8H     8
KC    10
5S     5
dtype: int64

#### 每个花色随机抽取2张牌

In [158]:
deck.groupby(lambda x:x[-1]).apply(lambda g:g.take(np.random.permutation(len(g))[:2]))

C  6C     6
   9C     9
D  7D     7
   JD    10
H  8H     8
   QH    10
S  QS    10
   7S     7
dtype: int64

### 示例:分组加权平均数和相关系数 -- 计算平均值的基础上增加一个权重的概念

根据groupby的“拆分-应用-合并”范式,DataFrame的列与列之
间或两个Series之间的运算(比如分组加权平均)成为一种标准作业。
以下面这个数据集为例,它含有分组键、值以及一些权重值:

In [159]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df

Unnamed: 0,category,data,weights
0,a,0.491793,0.233183
1,a,1.274612,0.129594
2,a,-1.553217,0.753137
3,a,-0.536717,0.655118
4,b,-1.11469,0.749729
5,b,-0.061133,0.676752
6,b,-0.002359,0.302896
7,b,0.217114,0.988975


In [160]:
df.groupby('category').apply(lambda g:np.average(g.data, weights=g.weights))

category
a   -0.701025
b   -0.243928
dtype: float64

#### 例子：标准普尔500指数 -- 计算一个由日收益率(通过百分数变化计算)与SPX之间的年度相关系数组成的DataFrame

In [161]:
# 读取数据
close_px = pd.read_csv('https://raw.githubusercontent.com/NemoHoHaloAi/pydata-book/2nd-edition/examples/stock_px.csv', 
                       parse_dates=True, index_col=0)
close_px.head(5)

Unnamed: 0,AA,AAPL,GE,IBM,JNJ,MSFT,PEP,SPX,XOM
1990-02-01,4.98,7.86,2.87,16.79,4.27,0.51,6.04,328.79,6.12
1990-02-02,5.04,8.0,2.87,16.89,4.37,0.51,6.09,330.92,6.24
1990-02-05,5.07,8.18,2.87,17.32,4.34,0.51,6.05,331.85,6.25
1990-02-06,5.01,8.12,2.88,17.56,4.32,0.51,6.15,329.66,6.23
1990-02-07,5.04,7.77,2.91,17.93,4.38,0.51,6.17,333.75,6.33


In [174]:
# step 1:去掉nan
# step 2:根据DatetimeIndex索引的year属性分组
# step 3:应用分组计算函数
# 计算函数：g.corrwith(g.SPX)计算每个股票的每天的收益率与年收益的关系，结果中SPX一列为1说明是ok的
close_px.pct_change().dropna().groupby(lambda idx:idx.year).apply(lambda g:g.corrwith(g.SPX))

Unnamed: 0,AA,AAPL,GE,IBM,JNJ,MSFT,PEP,SPX,XOM
1990,0.595024,0.545067,0.752187,0.738361,0.801145,0.586691,0.783168,1.0,0.517586
1991,0.453574,0.365315,0.759607,0.557046,0.646401,0.524225,0.641775,1.0,0.569335
1992,0.39818,0.498732,0.632685,0.262232,0.51574,0.492345,0.473871,1.0,0.318408
1993,0.259069,0.238578,0.447257,0.211269,0.451503,0.425377,0.385089,1.0,0.318952
1994,0.428549,0.26842,0.572996,0.385162,0.372962,0.436585,0.450516,1.0,0.395078
1995,0.291532,0.161829,0.519126,0.41639,0.315733,0.45366,0.413144,1.0,0.368752
1996,0.292344,0.191482,0.750724,0.388497,0.569232,0.564015,0.421477,1.0,0.538736
1997,0.564427,0.211435,0.827512,0.646823,0.703538,0.606171,0.509344,1.0,0.695653
1998,0.533802,0.379883,0.815243,0.623982,0.591988,0.698773,0.494213,1.0,0.369264
1999,0.099033,0.425584,0.710928,0.486167,0.517061,0.631315,0.336593,1.0,0.315383


### 示例:面向分组的线性回归

可以用groupby执行更为**复杂**的**分组统计分析**,只要函数返回的是**pandas对象**或**标量值**即可。例如可以定义下面这个regress函数(利用statsmodels库)对各数据块执行普通最小二乘法(Ordinary Least Squares,OLS)回归。

In [183]:
# 执行普通最小二乘法回归
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

In [184]:
close_px.pct_change().dropna().groupby(lambda idx:idx.year).apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
1990,1.512772,0.001395
1991,1.187351,0.000396
1992,1.832427,0.000164
1993,1.39047,-0.002657
1994,1.190277,0.001617
1995,0.858818,-0.001423
1996,0.829389,-0.001791
1997,0.749928,-0.001901
1998,1.164582,0.004075
1999,1.384989,0.003273


## 透视表和交叉表

### 透视表 -- pivot_table

透视表(pivot table)是各种电子表格程序和其他数据分析软件中
一种常见的数据汇总工具。它根据**一个或多个键**对数据进行**聚合**,并
根据**行**和**列**上的**分组键***将数据分配到各个**矩形区域**中。在Python和
pandas中,可以通过groupby功能以及(能够利用层次化
索引的)重塑运算制作透视表。DataFrame有一个**pivot_table**方法,此外
还有一个顶级的**pandas.pivot_table**函数。除能为groupby提供便利之
外,pivot_table还可以添加**分项小计**(也叫做margins)。

#### 小费：根据sex，smoker计算分组平均值，并将sex，smoker放到行上 -- index

In [197]:
tips.pivot_table(index=['sex','smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_proportion,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,No,2.764706,3.180882,0.162939,19.860196
female,Yes,2.316667,3.066167,0.171521,20.183333
male,No,2.469388,2.598367,0.151813,17.789592
male,Yes,2.575758,2.904242,0.14806,21.798182


#### 小费：只想聚合tip_pct和size,而且想根据day进行分组。我将smoker放到列上,把day放到行上 -- [],columns

In [199]:
tips.pivot_table(['tip_proportion','size'], index=['day'], columns=['smoker'])

Unnamed: 0_level_0,size,size,tip_proportion,tip_proportion
smoker,No,Yes,No,Yes
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Fri,2.25,2.066667,0.15165,0.174783
Sat,2.555556,2.47619,0.158048,0.147906
Sun,2.929825,2.578947,0.160113,0.18725
Thur,2.488889,2.352941,0.160298,0.163863


#### 小费：分项小记 -- margins

In [203]:
tips.pivot_table(['tip_proportion','size'], index=['day'], columns=['smoker'], aggfunc=len, margins=True)

Unnamed: 0_level_0,size,size,size,tip_proportion,tip_proportion,tip_proportion
smoker,No,Yes,All,No,Yes,All
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,4.0,15.0,19.0,4.0,15.0,19.0
Sat,45.0,42.0,87.0,45.0,42.0,87.0
Sun,57.0,19.0,76.0,57.0,19.0,76.0
Thur,45.0,17.0,62.0,45.0,17.0,62.0
All,151.0,93.0,244.0,151.0,93.0,244.0


**小结**：可以看到，在列向的All中，数值均为左侧对应行的最大值，而在横向的All中，值均为对应列数值的累加；

#### pivot_table参数说明

* values：待聚合的列的名称集合，默认全部列都聚合；
* index：用于分组的列名或者索引，出现在结果透视表的行；
* columns：用于分组的列名或者索引，出现在结果透视表的列；
* aggfunc：用于聚合的函数，默认求平均值；
* fill_value：缺失值填充值；
* margins：是否计算分项小记All；

### 交叉表 -- pd.crosstab

交叉表(cross-tabulation,简称crosstab)是一种用于计算**分组频率**的**特殊透视表**。

In [207]:
df = DataFrame({'sex':['male','female','female','male','male','female'],'pet':['dog','cat','dog','cat','cat','cat']})
df

Unnamed: 0,pet,sex
0,dog,male
1,cat,female
2,dog,female
3,cat,male
4,cat,male
5,cat,female


In [208]:
pd.crosstab(df.sex, df.pet, margins=True)

pet,cat,dog,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,2,1,3
male,2,1,3
All,4,2,6


小结：可以看到crosstab根据指定指定的两个参数（可以是Series，数组，数组列表等）对数据进行统计汇总，基本等价于使用pivot_table分别指定index和colmuns；

In [211]:
pd.crosstab([tips.sex, tips['size']], tips.smoker, margins=True) # 此处不能使用tips.size去访问size列了，因为size正好是个属性值。。。。

Unnamed: 0_level_0,smoker,No,Yes,All
sex,size,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1.0,1,1,2
female,2.0,56,46,102
female,3.0,18,7,25
female,4.0,22,5,27
female,5.0,3,1,4
female,6.0,2,0,2
male,1.0,1,1,2
male,2.0,34,20,54
male,3.0,8,5,13
male,4.0,4,6,10


## 示例：2012联邦选举委员会数据库

## alert

### df['a']跟df[['a']]的区别

In [189]:
df = DataFrame({'a':[12,23,34],'b':[11,22,33]})

#### df['a']

In [190]:
df['a']

0    12
1    23
2    34
Name: a, dtype: int64

#### df[['a']]

In [191]:
df[['a']]

Unnamed: 0,a
0,12
1,23
2,34


In [193]:
df[['a','b']]

Unnamed: 0,a,b
0,12,11
1,23,22
2,34,33


#### 小结

* df['a']：
    * 得到一个由指定列名的Series；
    * 不能指定多个列，因为多个列无法对应一个Series；
    * 获取原DataFrame的某一列组成的低一个维度的数据；
* df[['a']]：
    * 得到一个DataFrame的子集，但是依然是DataFrame；
    * 可以指定多个，因为本身是DataFrame可以有多列；
    * 维度不变，获取指定的N个列组成的数据；