In [1]:
import numpy as np
import pandas as pd

# 基础用法

In [2]:
# 创建基础数据
index = pd.date_range('2020/01/01', periods=8)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
2020-01-01,0.174623,1.333081,1.716372
2020-01-02,0.456879,0.035471,0.512009
2020-01-03,1.07838,-0.803751,-1.144742
2020-01-04,-0.671246,0.537099,-0.521711
2020-01-05,0.193148,0.546439,-1.003801
2020-01-06,-0.619001,-1.318154,0.301367
2020-01-07,-0.227666,-0.138881,1.015857
2020-01-08,-0.496451,-1.148134,0.961194


## Head与Tail

In [3]:
# head()与 tail()用于快速预览 Series 与 DataFrame，默认显示 5 条数据，也可以指定显示数据的数量。
long_series = pd.Series(np.random.randn(1000))
long_series.head()

0    0.882959
1    0.889588
2   -0.489109
3    2.031270
4   -0.329551
dtype: float64

In [4]:
long_series.tail(3)

997   -1.301484
998   -0.864606
999    0.472036
dtype: float64

## 属性与底层数据

Pandas 可以通过多个属性访问元数据：
- shape
    - 输出对象的轴维度，与 ndarray 一致
- 轴标签
    - Series: Index (仅有此轴)
    - DataFrame: Index (行) 与列

注意： **为属性赋值是安全的！**

In [5]:
df[:2]

Unnamed: 0,A,B,C
2020-01-01,0.174623,1.333081,1.716372
2020-01-02,0.456879,0.035471,0.512009


In [6]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,a,b,c
2020-01-01,0.174623,1.333081,1.716372
2020-01-02,0.456879,0.035471,0.512009
2020-01-03,1.07838,-0.803751,-1.144742
2020-01-04,-0.671246,0.537099,-0.521711
2020-01-05,0.193148,0.546439,-1.003801
2020-01-06,-0.619001,-1.318154,0.301367
2020-01-07,-0.227666,-0.138881,1.015857
2020-01-08,-0.496451,-1.148134,0.961194


Pandas 对象，Series，DataFrame相当于数组的容器，用于存储数据、执行计算。大部分类型的底层数组都是 numpy.ndarray。
不过，Pandas 与第三方支持库一般都会扩展 NumPy 类型系统，添加自定义数组。

.array 属性用于提取 Index (opens new window)或 Series (opens new window)里的数据。

In [7]:
s.array

<PandasArray>
[  1.1153324712758892,   1.1125438949037862,  -1.0535782141452155,
 -0.05545301904826901,  -0.5423105004783969]
Length: 5, dtype: float64

In [8]:
s.index.array

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array一般指 ExtensionArray。至于什么是 ExtensionArray及 Pandas 为什么要用 ExtensionArray不是本节要说明的内容。更多信息请参阅数据类型。

提取 NumPy 数组，用 to_numpy()或 numpy.asarray()。

In [9]:
s.to_numpy()

array([ 1.11533247,  1.11254389, -1.05357821, -0.05545302, -0.5423105 ])

In [10]:
np.asarray(s)

array([ 1.11533247,  1.11254389, -1.05357821, -0.05545302, -0.5423105 ])

Series 与 Index 的类型是 ExtensionArray时， to_numpy()会复制数据，并强制转换值。详情见数据类型。

to_numpy()可以控制 numpy.ndarray生成的数据类型。以带时区的 datetime 为例，NumPy 未提供时区信息的 datetime 数据类型，Pandas 则提供了两种表现形式：

1. 一种是带 Timestamp的 numpy.ndarray，提供了正确的 tz 信息。

2. 另一种是 datetime64，这也是一种 numpy.ndarray，值被转换为 UTC，但去掉了时区信息。

时区信息可以用 dtype=object 保存。

In [11]:
ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))
ser

0   2000-01-01 00:00:00+01:00
1   2000-01-02 00:00:00+01:00
dtype: datetime64[ns, CET]

In [12]:
ser.to_numpy(dtype=object)

array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

或用datetime64[ns]去除时间戳,转换为UTC时间

In [13]:
# 或用datetime64[ns]去除时间戳,转换为UTC时间
ser.to_numpy(dtype="datetime64[ns]")

array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

提取 DataFrame 里的原数据稍微有点复杂。DataFrame 里所有列的数据类型都一样时，DataFrame.to_numpy()返回底层数据：



In [14]:
df.to_numpy()

array([[ 0.17462318,  1.33308085,  1.71637172],
       [ 0.45687853,  0.03547119,  0.51200908],
       [ 1.07838044, -0.803751  , -1.144742  ],
       [-0.67124648,  0.53709917, -0.52171084],
       [ 0.19314827,  0.54643863, -1.00380114],
       [-0.61900127, -1.31815396,  0.30136731],
       [-0.22766645, -0.13888088,  1.01585668],
       [-0.4964515 , -1.14813403,  0.9611945 ]])

DataFrame 为同构型数据时，Pandas 直接修改原始 ndarray，所做修改会直接反应在数据结构里。对于异质型数据，即 DataFrame 列的数据类型不一样时，就不是这种操作模式了。与轴标签不同，不能为值的属性赋值。

> 注意:  
DataFrame 为同构型数据时，Pandas 直接修改原始 ndarray，所做修改会直接反应在数据结构里。对于异质型数据，即 DataFrame 列的数据类型不一样时，就不是这种操作模式了。与轴标签不同，不能为值的属性赋值。

以前，Pandas 推荐用 Series.values或 DataFrame.values从 Series 或 DataFrame 里提取数据。旧有代码库或在线教程里仍在用这种操作，但 Pandas 已改进了此功能，现在，推荐用 .array 或 to_numpy 提取数据，别再用 .values 了。.values 有以下几个缺点：
1. Series 含扩展类型时，Series.values无法判断到底是该返回 NumPy array，还是返回 ExtensionArray。而 Series.array则只返回 ExtensionArray，且不会复制数据。Series.to_numpy()则返回 NumPy 数组，其代价是需要复制、并强制转换数据的值。
2. DataFrame 含多种数据类型时，DataFrame.values会复制数据，并将数据的值强制转换同一种数据类型，这是一种代价较高的操作。DataFrame.to_numpy()则返回 NumPy 数组，这种方式更清晰，也不会把 DataFrame 里的数据都当作一种类型。

## 加速操作

借助 numexpr 与 bottleneck 支持库，Pandas 可以加速特定类型的二进制数值与布尔操作。

处理大型数据集时，这两个支持库特别有用，加速效果也非常明显。 numexpr 使用智能分块、缓存与多核技术。bottleneck 是一组专属 cython 例程，处理含 nans 值的数组时，特别快。

请看下面这个例子（DataFrame 包含 100 列 X 10 万行数据）:  


| 操作 | 0.11.0版 (ms) | 旧版 (ms) | 提升比率 |
| ---- | ---- | ---- | ---- |
|df1 > df2|	13.32|	125.35|	0.1063|
|df1 * df2|	21.71|	36.63|	0.5928|
|df1 + df2|	22.04|	36.50|	0.6039|


这两个支持库默认为启用状态，可用以下选项设置：

In [15]:
# pd.set_option('compute.use_bottleneck', False)
# pd.set_option('compute.use_numexpr', False)

## 二进制操作

Pandas 数据结构之间执行二进制操作，要注意下列两个关键点：
- 多维（DataFrame）与低维（Series）对象之间的广播机制；
- 计算中的缺失值处理。

这两个问题可以同时处理，但下面先介绍怎么分开处理。

### 广播机制


DataFrame 支持 add()、sub()、mul()、div()及 radd()、rsub()等方法执行二进制操作。广播机制重点关注输入的 Series。通过 axis 关键字，匹配 index 或 columns 即可调用这些函数。

In [16]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-0.915461,0.139273,-1.476982
c,-0.966587,-0.197301,1.15707
d,,-0.236962,1.262711


In [17]:
row = df.iloc[1]
row

one     -0.915461
two      0.139273
three   -1.476982
Name: b, dtype: float64

In [18]:
column = df['two']
column

a   -1.039693
b    0.139273
c   -0.197301
d   -0.236962
Name: two, dtype: float64

In [19]:
df.sub(row, axis='columns')

Unnamed: 0,one,two,three
a,-1.202612,-1.178966,
b,0.0,0.0,0.0
c,-0.051126,-0.336575,2.634053
d,,-0.376235,2.739694


In [20]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,-1.202612,-1.178966,
b,0.0,0.0,0.0
c,-0.051126,-0.336575,2.634053
d,,-0.376235,2.739694


In [21]:
df.sub(column, axis='index')

Unnamed: 0,one,two,three
a,-1.078381,0.0,
b,-1.054735,0.0,-1.616256
c,-0.769286,0.0,1.354371
d,,0.0,1.499673


In [22]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,-1.078381,0.0,
b,-1.054735,0.0,-1.616256
c,-0.769286,0.0,1.354371
d,,0.0,1.499673


还可以用 Series 对齐多层索引 DataFrame 的某一层级。

In [23]:
dfmi = df.copy()
dfmi

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-0.915461,0.139273,-1.476982
c,-0.966587,-0.197301,1.15707
d,,-0.236962,1.262711


In [24]:
dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
                                        (1, 'c'), (2, 'a')],
                                       names=['first', 'second'])
dfmi

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,-2.118074,-1.039693,
1,b,-0.915461,0.139273,-1.476982
1,c,-0.966587,-0.197301,1.15707
2,a,,-0.236962,1.262711


In [25]:
dfmi.sub(column, axis=0, level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,-1.078381,0.0,
1,b,-1.054735,0.0,-1.616256
1,c,-0.769286,0.0,1.354371
2,a,,0.802731,2.302404


Series 与 Index 还支持 divmod()内置函数，该函数同时执行向下取整除与模运算，返回两个与左侧类型相同的元组。示例如下：

In [26]:
s = pd.Series(np.arange(10))
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [27]:
div, rem = divmod(s, 3)
div

0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [28]:
rem

0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [29]:
idx = pd.Index(np.arange(10))
idx

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

### 缺失值与填充缺失值操作

Series 与 DataFrame 的算数函数支持 fill_value 选项，即用指定值替换某个位置的缺失值。比如，两个 DataFrame 相加，除非两个 DataFrame 里同一个位置都有缺失值，其相加的和仍为 NaN，如果只有一个 DataFrame 里存在缺失值，则可以用 fill_value 指定一个值来替代 NaN，当然，也可以用 fillna 把 NaN 替换为想要的值。

In [30]:
df2 = pd.DataFrame({'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']), 
                    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']), 
                    'three': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])})
df2

Unnamed: 0,one,two,three
a,1.090599,-0.822257,-0.026648
b,-1.134086,-2.077328,-0.9052
c,-0.697939,-1.324759,-1.703948
d,,-2.036666,-1.202689


In [31]:
df

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-0.915461,0.139273,-1.476982
c,-0.966587,-0.197301,1.15707
d,,-0.236962,1.262711


In [32]:
df + df2

Unnamed: 0,one,two,three
a,-1.027474,-1.86195,
b,-2.049547,-1.938055,-2.382182
c,-1.664525,-1.52206,-0.546878
d,,-2.273627,0.060022


In [33]:
df.add(df2, fill_value=0)

Unnamed: 0,one,two,three
a,-1.027474,-1.86195,-0.026648
b,-2.049547,-1.938055,-2.382182
c,-1.664525,-1.52206,-0.546878
d,,-2.273627,0.060022


### 比较操作

与上一小节的算数运算类似，Series 与 DataFrame 还支持 eq、ne、lt、gt、le、ge 等二进制比较操作的方法：  

|序号|缩写|英文|中文|
|----|----|:----|----|
|1|eq|equal to|等于|
|2|ne|notequal to|不等于|
|3|lt|less than|小于|
|4|gt|greater than|大于|
|5|le|less than or equal to	|小于等于|
|6|ge|greater than or equal to	|大于等于|


In [34]:
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,True,True,False
c,False,True,True
d,False,True,True


In [35]:
df2.ne(df)

Unnamed: 0,one,two,three
a,True,True,True
b,True,True,True
c,True,True,True
d,True,True,True


这些操作生成一个与左侧输入对象类型相同的 Pandas 对象，即，dtype 为 bool。boolean 对象可用于索引操作，参阅布尔索引。

### 布尔简化

empty、any()、all()、bool()可以把数据汇总简化至单个布尔值

In [36]:
(df > 0).all()

one      False
two      False
three    False
dtype: bool

In [37]:
(df > 0).any()

one      False
two       True
three     True
dtype: bool

In [38]:
# 还可以进一步把上面的结果简化为单个布尔值。
(df > 0).any().any()


True

通过empty属性，可以验证 Pandas 对象是否为空。

In [39]:
df.empty

False

In [40]:
pd.DataFrame(columns=list('ABC')).empty

True

用bool()方法验证单元素 pandas 对象的布尔值。

In [41]:
pd.Series([True]).bool()

True

In [42]:
pd.Series([False]).bool()

False

In [43]:
pd.DataFrame([[True]]).bool()

True

In [44]:
pd.DataFrame([[False]]).bool()

False

### 比较对象是否相等

一般情况下，多种方式都能得出相同的结果。以 df + df 与 df * 2 为例。应用上一小节学到的知识，测试这两种计算方式的结果是否一致，一般人都会用 (df + df == df * 2).all()，不过，这个表达式的结果是 False。

In [45]:
df + df == df * 2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [46]:
(df + df == df * 2).all()

one      False
two       True
three    False
dtype: bool

注意：布尔型 DataFrame df + df == df * 2 中有 False 值！这是因为两个 NaN 值的比较结果为不等：

In [47]:
np.nan == np.nan

False

为了验证数据是否等效，Series 与 DataFrame 等 N 维框架提供了 equals()方法，用这个方法验证 NaN 值的结果为相等。

In [48]:
(df + df).equals(df * 2)

True

注意：Series 与 DataFrame 索引的顺序必须一致，验证结果才能为 True

In [49]:
df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
df1

Unnamed: 0,col
0,foo
1,0
2,


In [50]:
df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])
df2

Unnamed: 0,col
2,
1,0
0,foo


In [51]:
df1.equals(df2)

False

In [52]:
df1.equals(df2.sort_index())

True

### 比较array型对象

用标量值与 Pandas 数据结构对比数据元素非常简单：

In [53]:
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

In [54]:
pd.Index(['foo', 'bar', 'baz']) == 'foo'

array([ True, False, False])

Pandas 还能对比两个等长 array 对象里的数据元素：

In [55]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [56]:
pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

对比不等长的 Index 或 Series 对象会触发 ValueError：

In [57]:
# pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

In [58]:
# pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

注意： 这里的操作与 NumPy 的广播机制不同：

In [59]:
np.array([1, 2, 3]) == np.array([2])

array([False,  True, False])

NumPy 无法执行广播操作时，返回 False:

In [60]:
np.array([1, 2, 3]) == np.array([1, 3])

  np.array([1, 2, 3]) == np.array([1, 3])


False

### 合并重叠数据集

有时，要合并两个相似的数据集，两个数据集里的其中一个的数据比另一个多。比如，展示特定经济指标的两个数据序列，其中一个是“高质量”指标，另一个是“低质量”指标。一般来说，低质量序列可能包含更多的历史数据，或覆盖更广的数据。因此，要合并这两个 DataFrame 对象，其中一个 DataFrame 中的缺失值将按指定条件用另一个 DataFrame 里类似标签中的数据进行填充。要实现这一操作，请用下列代码中的 combine_first()函数。

In [61]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
                    'B': [np.nan, 2., 3., np.nan, 6.]})
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [62]:
df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],
                    'B': [np.nan, np.nan, 3., 4., 6., 8.]})
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [63]:
df1.combine_first(df2)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


### DataFrame 通用合并方法

上述 combine_first方法调用了更普适的 DataFrame.combine()方法。该方法提取另一个 DataFrame 及合并器函数，并将之与输入的 DataFrame 对齐，再传递与 Series 配对的合并器函数（比如，名称相同的列）。  
下面的代码复现了上述的 combine_first()函数：

In [64]:
def combiner(x, y):
    return np.where(pd.isna(x), y, x)  # y if isna(x) else x   如果x为空就填入y的值，否则就填入x的值

## 描述性统计

Series与 DataFrame支持大量计算描述性统计的方法与操作。这些方法大部分都是 sum()、mean()、quantile()等聚合函数，其输出结果比原始数据集小；此外，还有输出结果与原始数据集同样大小的 cumsum()、 cumprod()等函数。这些方法都基本上都接受 axis 参数，如， ndarray.{sum,std,…}，但这里的 axis 可以用名称或整数指定：

Series：无需 axis 参数
DataFrame：
index，即 axis=0，默认值
columns, 即 axis=1

In [65]:
df

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-0.915461,0.139273,-1.476982
c,-0.966587,-0.197301,1.15707
d,,-0.236962,1.262711


In [66]:
df.mean(0)

one     -1.333374
two     -0.333671
three    0.314266
dtype: float64

In [67]:
df.mean(1)

a   -1.578883
b   -0.751057
c   -0.002273
d    0.512875
dtype: float64

上述方法都支持 skipna 关键字，指定是否要排除缺失数据，默认值为 True

In [68]:
df.sum(0, skipna=False)

one           NaN
two     -1.334682
three         NaN
dtype: float64

In [69]:
df.sum(axis=1, skipna=True)

a   -3.157767
b   -2.253170
c   -0.006818
d    1.025750
dtype: float64

结合广播机制或算数操作，可以描述不同统计过程，比如标准化，即渲染数据零均值与标准差  
1，这种操作非常简单：


In [70]:
ts_stand = (df - df.mean()) / df.std()
ts_stand

Unnamed: 0,one,two,three
a,-1.153884,-1.411957,
b,0.614532,0.945829,-1.154032
c,0.539353,0.272722,0.542986
d,,0.193406,0.611046


In [71]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [72]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
xs_stand.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注 ： cumsum()与 cumprod()等方法保留 NaN 值的位置。这与 expanding()和 rolling()略显不同。

In [73]:
df.cumsum()

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-3.033535,-0.900419,-1.476982
c,-4.000122,-1.097721,-0.319912
d,,-1.334682,0.942799


下表为常用函数汇总表。每个函数都支持 level 参数，仅在数据对象为结构化 Index时使用。  

|函数|描述|
|:----|:----|
|count|统计非空值数量|
|sum|汇总值|
|mean|平均值|
|mad|平均绝对偏差|
|median|算数中位数|
|min|	最小值|
|max|	最大值|
|mode|	众数|
|abs|	绝对值|
|prod|	乘积|
|std|	贝塞尔校正的样本标准偏差|
|var|	无偏方差|
|sem|	平均值的标准误差|
|skew|	样本偏度 (第三阶)|
|kurt|	样本峰度 (第四阶)|
|quantile|	样本分位数 (不同 % 的值)|
|cumsum|	累加|
|cumprod|	累乘|
|cummax|	累积最大值|
|cummin|	累积最小值|

注意：NumPy 的 mean、std、sum 等方法默认不统计 Series 里的空值。

In [74]:
np.mean(df['one'])

-1.3333738932280481

In [75]:
np.mean(df['one'].to_numpy())

nan

Series.nunique()返回 Series 里所有非空值的唯一值。

In [76]:
series = pd.Series(np.random.randn(500))

In [77]:
series[20:500] = np.nan

In [78]:
series[10:20] = 5
series

0      2.470937
1     -1.012261
2      1.376939
3      0.637797
4      0.621899
         ...   
495         NaN
496         NaN
497         NaN
498         NaN
499         NaN
Length: 500, dtype: float64

In [79]:
series.nunique()

11

### 数据总结：describe

describe()函数计算 Series 与 DataFrame 数据列的各种数据统计量，注意，这里排除了空值。

In [80]:
series = pd.Series(np.random.randn(1000))

In [81]:
series[::2] = np.nan

In [82]:
series.describe()

count    500.000000
mean      -0.000327
std        1.075076
min       -2.831556
25%       -0.689540
50%        0.017956
75%        0.749266
max        2.928665
dtype: float64

In [83]:
frame = pd.DataFrame(np.random.randn(1000, 5),
                     columns=['a', 'b', 'c', 'd', 'e'])

In [84]:
frame.iloc[::2] = np.nan

In [85]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,0.021978,0.031614,0.011636,-0.031727,0.026996
std,0.975394,1.007633,0.983217,0.999287,1.025172
min,-3.476481,-2.783547,-2.355005,-3.552103,-2.817457
25%,-0.695098,-0.653857,-0.661236,-0.672135,-0.621586
50%,0.062104,0.050623,0.012704,0.020355,0.055099
75%,0.691449,0.6874,0.691197,0.667311,0.678402
max,2.72173,2.839587,2.703116,2.922027,2.958678


此外，还可以指定输出结果包含的分位数：

In [86]:
series.describe(percentiles=[.05, .25, .75, .95])

count    500.000000
mean      -0.000327
std        1.075076
min       -2.831556
5%        -1.867631
25%       -0.689540
50%        0.017956
75%        0.749266
95%        1.754502
max        2.928665
dtype: float64

cut() 函数（以值为依据实现分箱）及 qcut() 函数（以样本分位数为依据实现分箱）用于连续值的离散化：

In [87]:
arr = np.random.randn(20)
arr

array([ 0.1861946 ,  0.75226841,  0.13973596, -0.85551312,  0.51874969,
       -1.45770645,  1.4073972 ,  1.45687588, -1.13408348, -0.67618875,
       -1.09865813, -0.54291241,  0.068999  , -0.68171923,  0.18183451,
       -1.15184275, -0.22122322,  0.38934664, -1.44432504, -1.89771814])

In [88]:
factor = pd.cut(arr, 4)
factor

[(-0.22, 0.618], (0.618, 1.457], (-0.22, 0.618], (-1.059, -0.22], (-0.22, 0.618], ..., (-1.901, -1.059], (-1.059, -0.22], (-0.22, 0.618], (-1.901, -1.059], (-1.901, -1.059]]
Length: 20
Categories (4, interval[float64]): [(-1.901, -1.059] < (-1.059, -0.22] < (-0.22, 0.618] < (0.618, 1.457]]

In [89]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])
factor

[(0, 1], (0, 1], (0, 1], (-1, 0], (0, 1], ..., (-5, -1], (-1, 0], (0, 1], (-5, -1], (-5, -1]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut()计算样本分位数。比如，下列代码按等距分位数分割正态分布的数据：

In [90]:
arr = np.random.randn(30)
arr

array([-0.04756039, -1.05101235, -0.134385  ,  1.27558029, -0.09326886,
       -0.50951047,  0.28758226,  0.85732271, -0.25231632, -0.76344051,
       -0.22874282, -0.23743971, -0.98783096, -0.47930293, -0.21575116,
       -0.36477093,  0.59172536,  1.36681987,  0.85204202, -0.33029829,
        0.62329995,  0.7553514 ,  2.45614453, -0.63649977, -0.55569655,
       -0.00283777,  1.63126472, -0.07171883, -0.25552776, -0.34991896])

In [91]:
factor = pd.qcut(arr, [0, .25, .5, .75, 1])
factor

[(-0.175, 0.615], (-1.0519999999999998, -0.361], (-0.175, 0.615], (0.615, 2.456], (-0.175, 0.615], ..., (-0.175, 0.615], (0.615, 2.456], (-0.175, 0.615], (-0.361, -0.175], (-0.361, -0.175]]
Length: 30
Categories (4, interval[float64]): [(-1.0519999999999998, -0.361] < (-0.361, -0.175] < (-0.175, 0.615] < (0.615, 2.456]]

In [92]:
pd.value_counts(factor)

(-1.0519999999999998, -0.361]    8
(0.615, 2.456]                   8
(-0.361, -0.175]                 7
(-0.175, 0.615]                  7
dtype: int64

In [93]:
arr = np.random.randn(20)

In [94]:
factor = pd.cut(arr, [-np.inf, 0, np.inf])
factor

[(0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], ..., (0.0, inf], (0.0, inf], (-inf, 0.0], (0.0, inf], (-inf, 0.0]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

## 函数应用

不管是为 Pandas 对象应用自定义函数，还是应用第三方函数，都离不开以下三种方法。用哪种方法取决于操作的对象是 DataFrame，还是 Series ；是行、列，还是元素。
1. 表级函数应用：pipe()
2. 行列级函数应用： apply()
3. 聚合 API： agg() 与 transform()
4. 元素级函数应用：applymap()

### 表级函数应用

虽然可以把 DataFrame 与 Series 传递给函数，不过链式调用函数时，最好使用 pipe()方法。对比以下两种方式：

In [95]:
# f、g、h 是提取、返回 `DataFrames` 的函数
f(g(h(df), arg1=1), arg2=2, arg3=3)

NameError: name 'f' is not defined

In [96]:
(df.pipe(h).pipe(g, arg1=1).pipe(f, arg2=2, arg3=3)

SyntaxError: unexpected EOF while parsing (<ipython-input-96-fa1bd519bbd0>, line 1)

Pandas 鼓励使用第二种方式，即链式方法。在链式方法中调用自定义函数或第三方支持库函数时，用 pipe 更容易，与用 Pandas 自身方法一样。

上例中，f、g 与 h 这几个函数都把 DataFrame 当作首位参数。要是想把数据作为第二个参数，该怎么办？本例中，pipe 为元组 （callable,data_keyword）形式。.pipe 把 DataFrame 作为元组里指定的参数。

下例用 statsmodels 拟合回归。该 API 先接收一个公式，DataFrame 是第二个参数，data。要传递函数，则要用pipe 接收关键词对 (sm.ols,'data')。

In [97]:
import statsmodels.formula.api as sm

In [102]:
bb = pd.read_csv('../data/baseball.csv', index_col='id')
bb

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,...,2.0,1.0,1.0,4,4.0,0.0,0.0,3.0,0.0,0.0
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0.0,0.0,0.0,0,2.0,0.0,0.0,0.0,0.0,0.0
88650,johnsra05,2006,1,NYA,AL,33,6,0,1,0,...,0.0,0.0,0.0,0,4.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89525,benitar01,2007,2,FLO,NL,34,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
89526,benitar01,2007,1,SFN,NL,19,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
89530,ausmubr01,2007,1,HOU,NL,117,349,38,82,16,...,25.0,6.0,1.0,37,74.0,3.0,6.0,4.0,1.0,11.0
89533,aloumo01,2007,1,NYN,NL,87,328,51,112,19,...,49.0,3.0,0.0,27,30.0,5.0,2.0,0.0,3.0,13.0


In [103]:
(bb.query('h > 0')
 .assign(ln_h=lambda df: np.log(df.h))
 .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
 .fit()
 .summary()
)

0,1,2,3
Dep. Variable:,hr,R-squared:,0.685
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,34.28
Date:,"Mon, 12 Apr 2021",Prob (F-statistic):,3.48e-15
Time:,09:54:48,Log-Likelihood:,-205.92
No. Observations:,68,AIC:,421.8
Df Residuals:,63,BIC:,432.9
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-8484.7720,4664.146,-1.819,0.074,-1.78e+04,835.780
C(lg)[T.NL],-2.2736,1.325,-1.716,0.091,-4.922,0.375
ln_h,-1.3542,0.875,-1.547,0.127,-3.103,0.395
year,4.2277,2.324,1.819,0.074,-0.417,8.872
g,0.1841,0.029,6.258,0.000,0.125,0.243

0,1,2,3
Omnibus:,10.875,Durbin-Watson:,1.999
Prob(Omnibus):,0.004,Jarque-Bera (JB):,17.298
Skew:,0.537,Prob(JB):,0.000175
Kurtosis:,5.225,Cond. No.,14900000.0


unix 的 pipe 与后来出现的 dplyr (opens new window)及 magrittr (opens new window)启发了pipe 方法，在此，引入了 R 语言里用于读取 pipe 的操作符 (%>%)。pipe 的实现思路非常清晰，仿佛 Python 源生的一样。强烈建议大家阅读 pipe() (opens new window)的源代码。

### 行列级函数应用

apply()方法沿着 DataFrame 的轴应用函数，比如，描述性统计方法，该方法支持 axis 参数

In [104]:
df.apply(np.mean)

one     -1.333374
two     -0.333671
three    0.314266
dtype: float64

In [105]:
df.apply(np.mean, axis=1)

a   -1.578883
b   -0.751057
c   -0.002273
d    0.512875
dtype: float64

In [107]:
df.apply(lambda x: x.max() - x.min())

one      1.202612
two      1.178966
three    2.739694
dtype: float64

In [108]:
df.apply(np.cumsum)

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-3.033535,-0.900419,-1.476982
c,-4.000122,-1.097721,-0.319912
d,,-1.334682,0.942799


In [109]:
df.apply(np.exp)

Unnamed: 0,one,two,three
a,0.120263,0.353563,
b,0.400332,1.149438,0.228326
c,0.380379,0.820943,3.180601
d,,0.789022,3.534993


apply()方法还支持通过函数名字符串调用函数

In [110]:
df.apply('mean')

one     -1.333374
two     -0.333671
three    0.314266
dtype: float64

In [111]:
df.apply('mean', axis=1)

a   -1.578883
b   -0.751057
c   -0.002273
d    0.512875
dtype: float64

默认情况下，apply()调用的函数返回的类型会影响 DataFrame.apply 输出结果的类型。
- 函数返回的是 Series 时，最终输出结果是 DataFrame。输出的列与函数返回的 Series 索引相匹配。
- 函数返回其它任意类型时，输出结果是 Series。 

result_type 会覆盖默认行为，该参数有三个选项：reduce、broadcast、expand。这些选项决定了列表型返回值是否扩展为 DataFrame。

In [115]:
tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
                    index=pd.date_range('1/1/2000', periods=1000))
tsdf.apply(lambda x: x.idxmax())

A   2000-06-25
B   2000-01-30
C   2002-09-21
dtype: datetime64[ns]

还可以向 apply()方法传递额外的参数与关键字参数。比如下例中要应用的这个函数：

In [116]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

In [169]:
df4 = df.apply(subtract_and_divide, args=(5,), divide=3)
df4

Unnamed: 0,one,two,three
a,-2.372691,-2.013231,
b,-1.97182,-1.620242,-2.158994
c,-1.988862,-1.732434,-1.280977
d,,-1.745654,-1.245763


In [118]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,0.749818,1.349233,-0.993858
2000-01-02,0.861563,0.096119,0.500332
2000-01-03,0.024924,0.424351,0.562338
2000-01-04,0.146277,0.329534,0.713721
2000-01-05,0.844864,-1.038103,-0.249341
...,...,...,...
2002-09-22,1.597153,-1.160783,-0.221050
2002-09-23,-0.240841,0.433677,0.597280
2002-09-24,-0.929135,0.832191,-1.252744
2002-09-25,1.323067,-0.472799,0.612361


In [119]:
tsdf.apply(pd.Series.interpolate)

Unnamed: 0,A,B,C
2000-01-01,0.749818,1.349233,-0.993858
2000-01-02,0.861563,0.096119,0.500332
2000-01-03,0.024924,0.424351,0.562338
2000-01-04,0.146277,0.329534,0.713721
2000-01-05,0.844864,-1.038103,-0.249341
...,...,...,...
2002-09-22,1.597153,-1.160783,-0.221050
2002-09-23,-0.240841,0.433677,0.597280
2002-09-24,-0.929135,0.832191,-1.252744
2002-09-25,1.323067,-0.472799,0.612361


apply()有一个参数 raw，默认值为 False，在应用函数前，使用该参数可以将每行或列转换为 Series。该参数为 True 时，传递的函数接收 ndarray 对象，若不需要索引功能，这种操作能显著提高性能。

### 聚合

聚合 API 可以快速、简洁地执行多个聚合操作。Pandas 对象支持多个类似的 API，如 groupby API、window functions API、resample API。聚合函数为DataFrame.aggregate()，它的别名是 DataFrame.agg()。   

此处用与上例类似的 DataFrame：

In [120]:
tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
                    index=pd.date_range('1/1/2000', periods=10))

In [122]:
tsdf.iloc[3:7] = np.nan

In [123]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,-0.292016,0.245535,0.258749
2000-01-02,-1.773805,0.646603,-1.478853
2000-01-03,0.999037,0.653631,-1.234107
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.844466,-2.022481,-1.116674
2000-01-09,-0.78116,-0.553704,0.003263
2000-01-10,-1.730138,1.025073,0.474064


应用单个函数时，该操作与 apply()等效，这里也可以用字符串表示聚合函数名。下面的聚合函数输出的结果为 Series：

In [124]:
tsdf.agg(np.sum)

A   -2.733615
B   -0.005344
C   -3.093558
dtype: float64

In [125]:
tsdf.agg('sum')

A   -2.733615
B   -0.005344
C   -3.093558
dtype: float64

In [126]:
tsdf.sum()

A   -2.733615
B   -0.005344
C   -3.093558
dtype: float64

Series 单个聚合操作返回标量值：

In [127]:
tsdf.A.agg('sum')

-2.733615130174031

### 多函数聚合
还可以用列表形式传递多个聚合函数。每个函数在输出结果 DataFrame 里以行的形式显示，行名是每个聚合函数的函数名。

In [128]:
tsdf.agg(['sum'])

Unnamed: 0,A,B,C
sum,-2.733615,-0.005344,-3.093558


多个函数输出多行：

In [129]:
tsdf.agg(['sum', 'mean'])

Unnamed: 0,A,B,C
sum,-2.733615,-0.005344,-3.093558
mean,-0.455603,-0.000891,-0.515593


Series 聚合多函数返回结果还是 Series，索引为函数名：

In [131]:
tsdf.A.agg(['sum', 'mean'])

sum    -2.733615
mean   -0.455603
Name: A, dtype: float64

传递 lambda 函数时，输出名为 <lambda> 的行：

In [132]:
tsdf.A.agg(['sum', lambda x: x.mean()])

sum        -2.733615
<lambda>   -0.455603
Name: A, dtype: float64

应用自定义函数时，该函数名为输出结果的行名：

In [134]:
def mymean(x):
    return x.mean()

In [135]:
tsdf.A.agg(['sum', mymean])

sum      -2.733615
mymean   -0.455603
Name: A, dtype: float64

### 用字典实现聚合
指定为哪些列应用哪些聚合函数时，需要把包含列名与标量（或标量列表）的字典传递给 DataFrame.agg。  
注意：这里输出结果的顺序不是固定的，要想让输出顺序与输入顺序一致，请使用 OrderedDict。

In [136]:
tsdf.agg({'A': 'mean', 'B': 'sum'})

A   -0.455603
B   -0.005344
dtype: float64

输入的参数是列表时，输出结果为 DataFrame，并以矩阵形式显示所有聚合函数的计算结果，且输出结果由所有唯一函数组成。未执行聚合操作的列输出结果为 NaN 值：

In [137]:
tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})

Unnamed: 0,A,B
mean,-0.455603,
min,-1.773805,
sum,,-0.005344


### 多种数据类型（Dtype）

与 groupby 的 .agg 操作类似，DataFrame 含不能执行聚合的数据类型时，.agg 只计算可聚合的列：

In [139]:
mdf = pd.DataFrame({'A': [1, 2, 3],
                    'B': [1., 2., 3.],
                    'C': ['foo', 'bar', 'baz'],
                    'D': pd.date_range('20130101', periods=3)})

In [140]:
mdf.dtypes

A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object

In [141]:
mdf.agg(['min', 'sum'])

Unnamed: 0,A,B,C,D
min,1,1.0,bar,2013-01-01
sum,6,6.0,foobarbaz,NaT


### 自定义 Describe

.agg() 可以创建类似于内置 describe 函数的自定义 describe 函数。

In [143]:
from functools import partial

In [148]:
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = '25%'

In [149]:
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = '75%'

In [150]:
tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])

Unnamed: 0,A,B,C
count,6.0,6.0,6.0
mean,-0.455603,-0.000891,-0.515593
std,1.208387,1.127245,0.85483
min,-1.773805,-2.022481,-1.478853
25%,-1.492893,-0.353894,-1.204749
median,-0.536588,0.446069,-0.556705
75%,0.560345,0.651874,0.194878
max,0.999037,1.025073,0.474064


### Transform API

transform()方法的返回结果与原始数据的索引相同，大小相同。与 .agg API 类似，该 API 支持同时处理多种操作，不用一个一个操作。   
下面，先创建一个 DataFrame：

In [152]:
tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
                    index=pd.date_range('1/1/2000', periods=10))

In [153]:
tsdf.iloc[3:7] = np.nan

In [154]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,-0.480529,0.815743,0.184837
2000-01-02,-1.363585,1.457407,-0.366177
2000-01-03,-0.260303,0.086248,-0.181761
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,-0.774477,-0.937633,-0.909907
2000-01-09,-0.069464,-1.031243,-0.843662
2000-01-10,-0.462814,0.628525,-0.15002


这里转换的是整个 DataFrame。.transform() 支持 NumPy 函数、字符串函数及自定义函数。

In [156]:
tsdf.transform(np.abs)

Unnamed: 0,A,B,C
2000-01-01,0.480529,0.815743,0.184837
2000-01-02,1.363585,1.457407,0.366177
2000-01-03,0.260303,0.086248,0.181761
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774477,0.937633,0.909907
2000-01-09,0.069464,1.031243,0.843662
2000-01-10,0.462814,0.628525,0.15002


In [157]:
tsdf.transform('abs')

Unnamed: 0,A,B,C
2000-01-01,0.480529,0.815743,0.184837
2000-01-02,1.363585,1.457407,0.366177
2000-01-03,0.260303,0.086248,0.181761
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774477,0.937633,0.909907
2000-01-09,0.069464,1.031243,0.843662
2000-01-10,0.462814,0.628525,0.15002


In [158]:
tsdf.transform(lambda x: x.abs())

Unnamed: 0,A,B,C
2000-01-01,0.480529,0.815743,0.184837
2000-01-02,1.363585,1.457407,0.366177
2000-01-03,0.260303,0.086248,0.181761
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774477,0.937633,0.909907
2000-01-09,0.069464,1.031243,0.843662
2000-01-10,0.462814,0.628525,0.15002


这里的 transform()接受单个函数；与 ufunc 等效。

In [159]:
np.abs(tsdf)

Unnamed: 0,A,B,C
2000-01-01,0.480529,0.815743,0.184837
2000-01-02,1.363585,1.457407,0.366177
2000-01-03,0.260303,0.086248,0.181761
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774477,0.937633,0.909907
2000-01-09,0.069464,1.031243,0.843662
2000-01-10,0.462814,0.628525,0.15002


.transform() 向 Series 传递单个函数时，返回的结果也是单个 Series。

In [160]:
tsdf.A.transform(np.abs)

2000-01-01    0.480529
2000-01-02    1.363585
2000-01-03    0.260303
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.774477
2000-01-09    0.069464
2000-01-10    0.462814
Freq: D, Name: A, dtype: float64

### 多函数 Transform
transform() 调用多个函数时，生成多层索引 DataFrame。第一层是原始数据集的列名；第二层是 transform() 调用的函数名。

In [161]:
tsdf.transform([np.abs, lambda x: x + 1])

Unnamed: 0_level_0,A,A,B,B,C,C
Unnamed: 0_level_1,absolute,<lambda>,absolute,<lambda>,absolute,<lambda>
2000-01-01,0.480529,0.519471,0.815743,1.815743,0.184837,1.184837
2000-01-02,1.363585,-0.363585,1.457407,2.457407,0.366177,0.633823
2000-01-03,0.260303,0.739697,0.086248,1.086248,0.181761,0.818239
2000-01-04,,,,,,
2000-01-05,,,,,,
2000-01-06,,,,,,
2000-01-07,,,,,,
2000-01-08,0.774477,0.225523,0.937633,0.062367,0.909907,0.090093
2000-01-09,0.069464,0.930536,1.031243,-0.031243,0.843662,0.156338
2000-01-10,0.462814,0.537186,0.628525,1.628525,0.15002,0.84998


为 Series 应用多个函数时，输出结果是 DataFrame，列名是 transform() 调用的函数名。

In [162]:
tsdf.A.transform([np.abs, lambda x: x + 1])

Unnamed: 0,absolute,<lambda>
2000-01-01,0.480529,0.519471
2000-01-02,1.363585,-0.363585
2000-01-03,0.260303,0.739697
2000-01-04,,
2000-01-05,,
2000-01-06,,
2000-01-07,,
2000-01-08,0.774477,0.225523
2000-01-09,0.069464,0.930536
2000-01-10,0.462814,0.537186


### 用字典执行 transform 操作
函数字典可以为每列执行指定 transform() 操作。

In [163]:
tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})

Unnamed: 0,A,B
2000-01-01,0.480529,1.815743
2000-01-02,1.363585,2.457407
2000-01-03,0.260303,1.086248
2000-01-04,,
2000-01-05,,
2000-01-06,,
2000-01-07,,
2000-01-08,0.774477,0.062367
2000-01-09,0.069464,-0.031243
2000-01-10,0.462814,1.628525


transform() 的参数是列表字典时，生成的是以 transform() 调用的函数为名的多层索引 DataFrame。

In [164]:
tsdf.transform({'A': np.abs, 'B': [lambda x: x + 1, 'sqrt']})

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0_level_0,A,B,B
Unnamed: 0_level_1,A,<lambda>,sqrt
2000-01-01,0.480529,1.815743,0.903185
2000-01-02,1.363585,2.457407,1.207231
2000-01-03,0.260303,1.086248,0.293679
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774477,0.062367,
2000-01-09,0.069464,-0.031243,
2000-01-10,0.462814,1.628525,0.792796


### 元素级函数应用

并非所有函数都能矢量化，即接受 NumPy 数组，返回另一个数组或值，DataFrame 的 applymap()及 Series 的 map()，支持任何接收单个值并返回单个值的 Python 函数。

In [170]:
# df4 = pd.DataFrame(np.random.randn(4, 3), columns=['one', 'two', 'three'], index=['a', 'b', 'c', 'd'])
df4

Unnamed: 0,one,two,three
a,-2.372691,-2.013231,
b,-1.97182,-1.620242,-2.158994
c,-1.988862,-1.732434,-1.280977
d,,-1.745654,-1.245763


In [171]:
def f(x):
    return len(str(x))

In [172]:
df4['one'].map(f)

a    19
b    19
c    19
d     3
Name: one, dtype: int64

Series.map()还有个功能，可以“连接”或“映射”第二个 Series 定义的值。这与 merging / joining 功能联系非常紧密：

In [174]:
s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
              index=['a', 'b', 'c', 'd', 'e'])
s

a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [175]:
t = pd.Series({'six': 6., 'seven': 7.})
t

six      6.0
seven    7.0
dtype: float64

In [176]:
s.map(t)

a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

## 重置索引与更换标签

reindex()是 Pandas 里实现数据对齐的基本方法，该方法执行几乎所有功能都要用到的标签对齐功能。 reindex 指的是沿着指定轴，让数据与给定的一组标签进行匹配。该功能完成以下几项操作：
- 让现有数据匹配一组新标签，并重新排序；
- 在无数据但有标签的位置插入缺失值（NA）标记；
- 如果指定，则按逻辑填充无标签的数据，该操作多见于时间序列数据。



In [177]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.051003
b   -1.204418
c   -2.782148
d    0.741887
e    0.748059
dtype: float64

In [178]:
s.reindex(['e', 'b', 'f', 'd'])

e    0.748059
b   -1.204418
f         NaN
d    0.741887
dtype: float64

本例中，原 Series 里没有标签 f ，因此，输出结果里 f 对应的值为 NaN。   
DataFrame 支持同时 reindex 索引与列：

In [180]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

Unnamed: 0,three,two,one
c,1.15707,-0.197301,-0.966587
f,,,
b,-1.476982,0.139273,-0.915461


reindex 还支持 axis 关键字：

In [181]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,-0.966587,-0.197301,1.15707
f,,,
b,-0.915461,0.139273,-1.476982


注意：不同对象可以共享 Index 包含的轴标签。比如，有一个 Series，还有一个 DataFrame，可以执行下列操作：

In [183]:
rs = s.reindex(df.index)
rs

a    0.051003
b   -1.204418
c   -2.782148
d    0.741887
dtype: float64

In [185]:
rs.index is df.index

True

这里指的是，重置后，Series 的索引与 DataFrame 的索引是同一个 Python 对象。

DataFrame.reindex()还支持 “轴样式”调用习语，可以指定单个 labels 参数，并指定应用于哪个 axis。

In [187]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,-0.966587,-0.197301,1.15707
f,,,
b,-0.915461,0.139273,-1.476982


In [188]:
df.reindex(['three', 'two', 'one'], axis='columns')

Unnamed: 0,three,two,one
a,,-1.039693,-2.118074
b,-1.476982,0.139273,-0.915461
c,1.15707,-0.197301,-0.966587
d,1.262711,-0.236962,


> 编写注重性能的代码时，最好花些时间深入理解 reindex：预对齐数据后，操作会更快。两个未对齐的 DataFrame 相加，后台操作会执行 reindex。探索性分析时很难注意到这点有什么不同，这是因为 reindex 已经进行了高度优化，但需要注重 CPU 周期时，显式调用 reindex 还是有一些影响的。

### 重置索引，并与其它对象对齐 
提取一个对象，并用另一个具有相同标签的对象 reindex 该对象的轴。这种操作的语法虽然简单，但未免有些啰嗦。这时，最好用 reindex_like() 方法，这是一种既有效，又简单的方式：

In [192]:
df2 = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2

Unnamed: 0,one,two
a,-0.267913,-0.44957
b,0.664847,-0.015796
c,0.622647,0.939309


In [194]:
df3 = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df3

Unnamed: 0,one,two
a,-0.72328,-0.015851
b,-0.574376,1.580441
c,-0.260141,0.15337


In [195]:
df.reindex_like(df2)

Unnamed: 0,one,two
a,-2.118074,-1.039693
b,-0.915461,0.139273
c,-0.966587,-0.197301


In [197]:
df.reindex(df2.index)

Unnamed: 0,one,two,three
a,-2.118074,-1.039693,
b,-0.915461,0.139273,-1.476982
c,-0.966587,-0.197301,1.15707


### 用 align 对齐多个对象
align()方法是对齐两个对象最快的方式，该方法支持 join 参数（请参阅 joining 与 merging）：
- `join='outer'`：使用两个对象索引的合集，默认值
- `join='left'`：使用左侧调用对象的索引
- `join='right'`：使用右侧传递对象的索引
- `join='inner'`：使用两个对象索引的交集

In [199]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    1.872222
b    1.160736
c    0.806891
d    1.818457
e    0.436520
dtype: float64

In [201]:
s1 = s[:4]
s1

a    1.872222
b    1.160736
c    0.806891
d    1.818457
dtype: float64

In [202]:
s2 = s[1:]

In [203]:
s1.align(s2)

(a    1.872222
 b    1.160736
 c    0.806891
 d    1.818457
 e         NaN
 dtype: float64,
 a         NaN
 b    1.160736
 c    0.806891
 d    1.818457
 e    0.436520
 dtype: float64)

In [204]:
s1.align(s2, join='inner')

(b    1.160736
 c    0.806891
 d    1.818457
 dtype: float64,
 b    1.160736
 c    0.806891
 d    1.818457
 dtype: float64)

In [205]:
s1.align(s2, join='left')

(a    1.872222
 b    1.160736
 c    0.806891
 d    1.818457
 dtype: float64,
 a         NaN
 b    1.160736
 c    0.806891
 d    1.818457
 dtype: float64)

In [206]:
s1.align(s2, join='right')

(b    1.160736
 c    0.806891
 d    1.818457
 e         NaN
 dtype: float64,
 b    1.160736
 c    0.806891
 d    1.818457
 e    0.436520
 dtype: float64)

默认条件下， join 方法既应用于索引，也应用于列：

In [208]:
df.align(df2, join='inner')

(        one       two
 a -2.118074 -1.039693
 b -0.915461  0.139273
 c -0.966587 -0.197301,
         one       two
 a -0.267913 -0.449570
 b  0.664847 -0.015796
 c  0.622647  0.939309)

align 方法还支持 axis 选项，用来指定要对齐的轴：

In [209]:
df.align(df2, join='inner', axis=0)

(        one       two     three
 a -2.118074 -1.039693       NaN
 b -0.915461  0.139273 -1.476982
 c -0.966587 -0.197301  1.157070,
         one       two
 a -0.267913 -0.449570
 b  0.664847 -0.015796
 c  0.622647  0.939309)