**轴用来为超过一维的数组定义的属性，二维数据拥有两个轴：第0轴沿着行的垂直往下，第1轴沿着列的方向水平延伸。**

## GroupBy 机制

分组键可以有多种形式,且类型不必相同:
- 列表或数组,其长度与待分组的轴一样。
- 表示 DataFrame 某个列名的值。
- 字典或 Series,给出待分组轴上的值与分组名之间的对应关系。
- 函数,用于处理轴索引或索引中的各个标签

In [11]:
import pandas as pd
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two','one'],'data1' : np.random.randn(5),'data2' : np.random.randn(5)})

In [12]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.352756,-0.133201,a,one
1,-0.751804,-1.447209,a,two
2,-0.379892,2.122472,b,one
3,0.367262,0.72235,b,two
4,-0.207917,0.236865,a,one


In [5]:
grouped = df['data1'].groupby(df['key1'])

In [7]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7fc3f6f9e0b8>

In [13]:
grouped.mean()

key1
a   -0.198129
b    0.579344
Name: data1, dtype: float64

In [17]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [18]:
means

key1  key2
a     one    -0.280337
      two    -0.751804
b     one    -0.379892
      two     0.367262
Name: data1, dtype: float64

In [19]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 对分组进行迭代

GroupBy 对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。

In [20]:
df.groupby('key1')

<pandas.core.groupby.DataFrameGroupBy object at 0x7fc3f6f6ee10>

In [22]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.352756,-0.133201,a,one
1,-0.751804,-1.447209,a,two
2,-0.379892,2.122472,b,one
3,0.367262,0.72235,b,two
4,-0.207917,0.236865,a,one


In [29]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
      data1     data2 key1 key2
0 -0.352756 -0.133201    a  one
1 -0.751804 -1.447209    a  two
4 -0.207917  0.236865    a  one
b
      data1     data2 key1 key2
2 -0.379892  2.122472    b  one
3  0.367262  0.722350    b  two


对于多重键的情况,元组的第一个元素将会是由键值组成的元组:

In [30]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
      data1     data2 key1 key2
0 -0.352756 -0.133201    a  one
4 -0.207917  0.236865    a  one
('a', 'two')
      data1     data2 key1 key2
1 -0.751804 -1.447209    a  two
('b', 'one')
      data1     data2 key1 key2
2 -0.379892  2.122472    b  one
('b', 'two')
      data1    data2 key1 key2
3  0.367262  0.72235    b  two


当然,你可以对这些数据片段做任何操作。有一个你可能会觉得有用的运算:
将这些数据片段做成一个字典:

In [31]:
pieces = dict(list(df.groupby('key1')))

In [32]:
pieces

{'a':       data1     data2 key1 key2
 0 -0.352756 -0.133201    a  one
 1 -0.751804 -1.447209    a  two
 4 -0.207917  0.236865    a  one, 'b':       data1     data2 key1 key2
 2 -0.379892  2.122472    b  one
 3  0.367262  0.722350    b  two}

groupby 默认是在 axis=0 上进行分组的,通过设置也可以在其他任何轴上进行
分组。拿上面例子中的 df 来说,我们可以根据 dtype 对列进行分组:

In [37]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [38]:
grouped = df.groupby(df.dtypes, axis=1)

In [40]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -0.352756 -0.133201
1 -0.751804 -1.447209
2 -0.379892  2.122472
3  0.367262  0.722350
4 -0.207917  0.236865
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### 选取一列或列的子集

计算 data2 列的平均值并以 DataFrame 形式得到结果

In [41]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.051832
a,two,-1.447209
b,one,2.122472
b,two,0.72235


In [51]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

In [54]:
s_grouped.mean()

key1  key2
a     one     0.051832
      two    -1.447209
b     one     2.122472
      two     0.722350
Name: data2, dtype: float64

### 通过字典或 Series 进行分组

In [56]:
people = pd.DataFrame(np.random.randn(5, 5),columns=['a', 'b', 'c', 'd', 'e'],index=['Joe', 'Steve', 'Wes', 'Jim','Travis'])

In [58]:
people

Unnamed: 0,a,b,c,d,e
Joe,-1.804495,0.026609,-0.403145,-0.737225,-0.886117
Steve,-0.37391,-0.793373,0.374623,2.093259,-1.309418
Wes,-0.767725,-0.728805,-0.149424,0.373313,-0.676941
Jim,0.768895,-1.453106,1.635168,0.993487,0.172099
Travis,-0.692005,0.169683,0.606582,-1.335459,-0.116094


In [59]:
people.iloc[2:3, [1, 2]] = np.nan

In [60]:
people

Unnamed: 0,a,b,c,d,e
Joe,-1.804495,0.026609,-0.403145,-0.737225,-0.886117
Steve,-0.37391,-0.793373,0.374623,2.093259,-1.309418
Wes,-0.767725,,,0.373313,-0.676941
Jim,0.768895,-1.453106,1.635168,0.993487,0.172099
Travis,-0.692005,0.169683,0.606582,-1.335459,-0.116094


现在,假设已知列的分组关系,并希望根据分组计算列的和:

In [63]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue','d': 'blue', 'e': 'red', 'f' : 'orange'}

In [64]:
mapping

{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

现在,你可以将这个字典传给 groupby,来构造数组,但我们可以直接传递字
典(我包含了键“f”来强调,存在未使用的分组键是可以的):

In [65]:
by_column = people.groupby(mapping, axis=1)

In [66]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.14037,-2.664003
Steve,2.467883,-2.476701
Wes,0.373313,-1.444667
Jim,2.628655,-0.512112
Travis,-0.728877,-0.638415


Series 也有同样的功能,它可以被看做一个固定大小的映射:

In [67]:
map_series = pd.Series(mapping)

In [68]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [72]:
people.groupby(map_series,axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 通过函数进行分组

比起使用字典或 Series,使用 Python 函数是一种更原生的方法定义分组映
射。任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会
被用作分组名称。具体点说,以上一小节的示例 DataFrame 为例,其索引值为
人的名字。你可以计算一个字符串长度的数组,更简单的方法是传入 len 函
数:

In [74]:
people

Unnamed: 0,a,b,c,d,e
Joe,-1.804495,0.026609,-0.403145,-0.737225,-0.886117
Steve,-0.37391,-0.793373,0.374623,2.093259,-1.309418
Wes,-0.767725,,,0.373313,-0.676941
Jim,0.768895,-1.453106,1.635168,0.993487,0.172099
Travis,-0.692005,0.169683,0.606582,-1.335459,-0.116094


In [78]:
# 根据len计算字符串长度，求和统计
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-1.803325,-1.426497,1.232024,0.629575,-1.390959
5,-0.37391,-0.793373,0.374623,2.093259,-1.309418
6,-0.692005,0.169683,0.606582,-1.335459,-0.116094


将函数跟数组、列表、字典、Series 混合使用也不是问题,因为任何东西在内
部都会被转换为数组:

In [80]:
key_list = ['one', 'one', 'one', 'two', 'two']

In [81]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-1.804495,0.026609,-0.403145,-0.737225,-0.886117
3,two,0.768895,-1.453106,1.635168,0.993487,0.172099
5,one,-0.37391,-0.793373,0.374623,2.093259,-1.309418
6,two,-0.692005,0.169683,0.606582,-1.335459,-0.116094


### 根据索引级别分组

层次化索引数据集最方便的地方就在于它能够根据轴索引的一个级别进行聚合:

In [83]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US','JP', 'JP'],[1, 3, 5, 1, 3]],names=['cty', 'tenor'])

In [84]:
columns

MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],
           labels=[[1, 1, 1, 0, 0], [0, 1, 2, 0, 1]],
           names=['cty', 'tenor'])

In [87]:
hier_df = pd.DataFrame(np.random.randn(4, 5),columns=columns)

In [88]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,2.485399,-0.920189,0.323174,0.418062,0.038732
1,-0.340086,-0.145115,0.49109,0.088428,0.014071
2,-1.019425,0.504878,0.179464,-0.876748,-0.290613
3,-0.739021,0.953929,1.284472,-0.126326,-0.658911


In [90]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 数据聚合

In [96]:
grouped = df.groupby('key1')

In [97]:
grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x7fc3f6c1b5c0>

In [98]:
grouped.quantile(0.9)

0.9,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.236885,0.162851
b,0.292546,1.98246


如果要使用你自己的聚合函数,只需将其传入 aggregate 或 agg 方法即可:

In [100]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [101]:
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.543887,1.684074
b,0.747154,1.400122


你可能注意到注意,有些方法(如 describe)也是可以用在这里的,即使严格来讲,它们并非聚合运算:

In [102]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,-0.437493,0.281671,-0.751804,-0.55228,-0.352756,-0.280337,-0.207917,3.0,-0.447849,0.88503,-1.447209,-0.790205,-0.133201,0.051832,0.236865
b,2.0,-0.006315,0.528318,-0.379892,-0.193104,-0.006315,0.180473,0.367262,2.0,1.422411,0.990036,0.72235,1.072381,1.422411,1.772441,2.122472


## apply:一般性的“拆分-应用-合并”

最通用的 GroupBy 方法是 apply,本节剩余部分将重点讲解它。apply 会将待处理的对象拆分成多个片段,然后对各片段调用传入的函数。最后尝试将各片段组合到一起。

根据分组选出最高的 5 个 tip_pct 值：

In [None]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips, n=6)

### 随机采样和排列

假设你想要从一个大数据集中随机抽取(进行替换或不替换)样本以进行蒙特
卡罗模拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很
多,这里使用的方法是对 Series 使用 sample 方法:

In [104]:
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []

In [106]:
base_names

['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']

In [107]:
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

In [112]:
deck = pd.Series(card_val, index=cards)

In [111]:
deck

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
AS      1
2S      2
3S      3
4S      4
5S      5
6S      6
7S      7
8S      8
9S      9
10S    10
JS     10
KS     10
QS     10
AC      1
2C      2
3C      3
4C      4
5C      5
6C      6
7C      7
8C      8
9C      9
10C    10
JC     10
KC     10
QC     10
AD      1
2D      2
3D      3
4D      4
5D      5
6D      6
7D      7
8D      8
9D      9
10D    10
JD     10
KD     10
QD     10
dtype: int64

In [114]:
deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

In [115]:
def draw(deck, n=5):
    return deck.sample(n)

In [116]:
draw(deck)

7D      7
2S      2
5S      5
10S    10
KC     10
dtype: int64

假设你想要从每种花色中随机抽取两张牌。由于花色是牌名的最m后一个字符,
所以我们可以据此进行分组,并使用 apply:

In [119]:
get_suit = lambda card: card[-1] # last letter is suit

In [120]:
deck.groupby(get_suit).apply(draw, n=2)

C  7C      7
   10C    10
D  9D      9
   3D      3
H  10H    10
   5H      5
S  AS      1
   QS     10
dtype: int64

In [121]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

2C     2
8C     8
KD    10
5D     5
2H     2
6H     6
9S     9
7S     7
dtype: int64

### 组级别的线性回归

顺着上一个例子继续,你可以用 groupby 执行更为复杂的分组统计分析,只要
函数返回的是 pandas 对象或标量值即可。例如,我可以定义下面这个 regress
函数(利用 statsmodels 计量经济学库)对各数据块执行普通最小二乘法
(Ordinary Least Squares,OLS)回归:

In [122]:
import statsmodels.api as sm

  from pandas.core import datetools


In [129]:
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

现在,为了按年计算 AAPL 对 SPX 收益率的线性回归,执行:

In [None]:
by_year.apply(regress, 'AAPL', ['SPX'])

## 透视表和交叉表

透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。它根据一个或多个键对数据进行聚合,并根据行和列上的分组
键将数据分配到各个矩形区域中。在 Python 和 pandas 中,可以通过本章所介绍的 groupby 功能以及(能够利用层次化索引的)重塑运算制作透视表。

DataFrame 有一个 pivot_table 函数。除能为 groupby 提供便利之外,pivot_table 还可以添加分项小计,也叫做 margins。

交叉表(cross-tabulation,简称 crosstab)是一种用于计算分组频率的特殊透视表。