# 索引

In [1]:
import numpy as np
import pandas as pd

## 索引器

### 表的列索引

通过`df['name']`的方式可以从DataFrame`df`中取出索引为`'name'`的列。

In [2]:
# 读取样例数据
df = pd.read_csv('/Users/LWKM/Desktop/DW_pandas/data/learn_pandas.csv',
                 usecols = ['School', 'Grade', 'Name', 'Gender','Weight', 'Transfer'])

In [3]:
# 取'Name'列
df['Name'].head()

0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

取多个列的方法也类似，但应注意将多个索引名称用`[]`括住，即输入一个**列表**。

In [4]:
# 取'Gender'和'Name'列
df[['Gender','Name']].head()

Unnamed: 0,Gender,Name
0,Female,Gaopeng Yang
1,Male,Changqiang You
2,Male,Mei Sun
3,Female,Xiaojuan Sun
4,Male,Gaojuan You


In [5]:
# 取出单列还有另外一种方法，前提时列名不包含空格
# 取'Name'
df.School.head(5)

0    Shanghai Jiao Tong University
1                Peking University
2    Shanghai Jiao Tong University
3                 Fudan University
4                 Fudan University
Name: School, dtype: object

### 序列的行索引

In [6]:
# 构造示例Series
s = pd.Series([1,2,3,4,5,6],
              index=['a', 'b', 'a', 'a', 'a', 'c'])

In [7]:
# 取出索引为'a'的行
s['a']

a    1
a    3
a    4
a    5
dtype: int64

In [8]:
# 取出索引为'b'或'c'的行
s[['b','c']]

b    2
c    6
dtype: int64

In [9]:
# 注意，输入的顺序是有影响的
s[['c','b']]

c    6
b    2
dtype: int64

上述示例以字符串为索引，而Series也可以用整数作为索引。两种索引方式均可以使用切片，但区别在于：

- 以字符串为索引时，切片**包含**两个端点，并且使用前提是起点和终点字符均为**唯一值**
- 以整数为索引时，切片**不包含**末端点

### loc 索引器

选取DataFrame的**行**时，有两种索引器：

1. 基于**元素**的 <font color=red>loc</font> 索引器
2. 基于**位置**的 <font color=red>iloc</font> 索引器

利用`set_index`方法可以将特定列设为索引。

In [10]:
# 将'Name'设为索引
df_demo = df.set_index('Name')
df_demo.head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman,Female,46.0,N
Changqiang You,Peking University,Freshman,Male,70.0,N
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Xiaojuan Sun,Fudan University,Sophomore,Female,41.0,N
Gaojuan You,Fudan University,Sophomore,Male,74.0,N


<font color=red>loc</font> 索引器的一般形式为`loc[*,*]`,其中第一个`*`代表行的选择，第二个`*`代表列的选择。`loc[*]`代表对行的筛选。

上述`*`位置有五类合法对象，分别是：单个元素、元素列表、元素切片、布尔列表以及函数。

Case 1: 单个元素

In [11]:
df_demo.loc['Qiang Sun'] # 选取名为'Qiang Sun'的行记录

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Qiang Sun,Tsinghua University,Junior,Female,53.0,N
Qiang Sun,Tsinghua University,Sophomore,Female,40.0,N
Qiang Sun,Shanghai Jiao Tong University,Junior,Female,,N


In [12]:
df_demo.loc['Quan Zhao']

School      Shanghai Jiao Tong University
Grade                              Junior
Gender                             Female
Weight                                 53
Transfer                                N
Name: Quan Zhao, dtype: object

In [13]:
df_demo.loc['Qiang Sun','School'] # 选取'Qiang Sun'行的'School'值

Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object

In [14]:
df_demo.loc['Quan Zhao','School']

'Shanghai Jiao Tong University'

Case 2: 元素列表

In [15]:
df_demo.loc[['Qiang Sun','Quan Zhao'],['School','Gender']]

Unnamed: 0_level_0,School,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Qiang Sun,Tsinghua University,Female
Qiang Sun,Tsinghua University,Female
Qiang Sun,Shanghai Jiao Tong University,Female
Quan Zhao,Shanghai Jiao Tong University,Female


Case 3: 切片

使用切片时，要确保端点值是唯一的。同时，使用切片进行索引时，**包含**两个端点。

In [16]:
# 拷贝样例
df_loc_slice_demo = df_demo.copy()
# 定义整数索引
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)

In [17]:
# 利用切片取出元素
df_loc_slice_demo.loc[7:4]
# 注意结果是包含端点值的

Unnamed: 0,School,Grade,Gender,Weight,Transfer
7,Tsinghua University,Senior,Male,79.0,N
6,Peking University,Senior,Female,49.0,
5,Fudan University,Junior,Female,46.0,N
4,Tsinghua University,Senior,Female,50.0,N


Case 4: 布尔列表

向`loc`索引器传入布尔列表，实际是根据条件进行筛选，因此这种方式十分常用。要注意的是，此处传入`loc`的布尔列表需要与DataFrame的**长度相同**。

In [18]:
# 取出体重超出80kg的学生
df_demo.loc[df_demo.Weight>70].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Gaojuan You,Fudan University,Sophomore,Male,74.0,N
Xiaopeng Zhou,Shanghai Jiao Tong University,Freshman,Male,74.0,N
Xiaofeng Sun,Tsinghua University,Senior,Male,71.0,N
Qiang Zheng,Shanghai Jiao Tong University,Senior,Male,87.0,N


可以利用`isin`方法生成布尔列表

In [19]:
# isin使用示例
df_demo.Grade.isin(['Freshman','Senior']).head()

Name
Gaopeng Yang       True
Changqiang You     True
Mei Sun            True
Xiaojuan Sun      False
Gaojuan You       False
Name: Grade, dtype: bool

In [20]:
# 选出所有大一和大四的同学
df_demo.loc[df_demo.Grade.isin(['Freshman','Senior'])].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman,Female,46.0,N
Changqiang You,Peking University,Freshman,Male,70.0,N
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Xiaoli Qian,Tsinghua University,Freshman,Female,51.0,N
Qiang Chu,Shanghai Jiao Tong University,Freshman,Female,52.0,N


可以利用`|`（或），`&`（且），`~`（取反）来生成复合条件。

Case 5: 函数

个人认为这个方法不太实用。

#### 练一练1

`select_dtypes('type')`能从表中选出类型为`'type'`的列。

In [21]:
# 示例
df_demo.select_dtypes('number').head()

Unnamed: 0_level_0,Weight
Name,Unnamed: 1_level_1
Gaopeng Yang,46.0
Changqiang You,70.0
Mei Sun,89.0
Xiaojuan Sun,41.0
Gaojuan You,74.0


暂时没有想明白如何利用布尔列表实现上述功能。感觉这里布尔列表是对“行”选择，而不是对“列”选择。

### iloc 索引器

`iloc`索引器针对**位置**进行筛选，其使用方法与`loc`索引器类似。以下为一些实例：

In [22]:
# 选取第3行第1列
df_demo.iloc[2,0]

'Shanghai Jiao Tong University'

In [23]:
# 前两行前两列
df_demo.iloc[[0,1],[0,1]]

Unnamed: 0_level_0,School,Grade
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman
Changqiang You,Peking University,Freshman


In [24]:
# 利用切片，注意不含末端点
df_demo.iloc[1:4,2:4]
# 2至4行，3至4列

Unnamed: 0_level_0,Gender,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Changqiang You,Male,70.0
Mei Sun,Male,89.0
Xiaojuan Sun,Female,41.0


In [25]:
# 传入布尔列表时，必须是Series的values。有点麻烦
# 选出体重超过80kg的学生
df_demo.iloc[(df_demo.Weight>80).values].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Qiang Zheng,Shanghai Jiao Tong University,Senior,Male,87.0,N
Qiang Han,Peking University,Freshman,Male,87.0,N
Chengpeng Zhou,Fudan University,Senior,Male,81.0,N
Feng Han,Shanghai Jiao Tong University,Sophomore,Male,82.0,N


In [26]:
# Series也可以使用iloc索引器
df_demo.School.iloc[1:6:3]

Name
Changqiang You    Peking University
Gaojuan You        Fudan University
Name: School, dtype: object

### query 方法

pandas支持把字符串形式的*查询表达式*传入`query`方法中来查询数据，并返回布尔列表。这种方法的优势在于可以**直接引用**列名，而不用在前面加上`df.`，这有助于简化代码。

In [27]:
# 用query查询体重超过80kg的学生
df.query('Weight > Weight.mean()').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
1,Peking University,Freshman,Changqiang You,Male,70.0,N
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou,Male,74.0,N
14,Tsinghua University,Senior,Xiaomei Zhou,Female,57.0,N


从上述示例中可以看出，列名`Weight`被直接引用，而无需使用`df.Weight`，同时在查询表达式中可以直接调用Series的方法。

查询表达式中还有一些比较直观的表达，例如`or`, `and`, `is in`, `not in`。其中，`is in`与`==`，`not in`与`!=`等价。

In [28]:
# 选取不是大一或大二的学生
df.query('Grade not in ["Freshman","Sophomore"] and (Gender == "Male")').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
16,Tsinghua University,Junior,Xiaoqiang Qin,Male,68.0,N
17,Tsinghua University,Junior,Peng Wang,Male,65.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
21,Shanghai Jiao Tong University,Senior,Xiaopeng Shen,Male,62.0,


如果要在查询表达式中引入外部变量，则需在变量名之前加上`@`。

In [29]:
# 定义新变量
low, high = 70,80
# 查询体重位于70 kg和80 k之间的学生
df.query('Weight.between(@low, @high).values').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
1,Peking University,Freshman,Changqiang You,Male,70.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou,Male,74.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
35,Peking University,Freshman,Gaoli Zhao,Male,78.0,N


上面这个例子如果不在布尔表达式`Weight.between(@low, @high)`之后加上`.values`会报错。

### 随机抽样

可以把DataFrame看成总体，每一行为其样本。在处理大型数据集时，可以先从整体中抽取部分样本来*近似估计*一些统计特征。这就需要使用`sample`函数。

`sample`函数的主要参数为`n`（数量）, `axis`（行或列）, `frac`（抽样比例）, `replace`（是否放回）, `weights`（抽样概率权重）。以下为一个示例。

In [30]:
# 生成总体
df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1,2,3,4,90]})
df_sample

Unnamed: 0,id,value
0,a,1
1,b,2
2,c,3
3,d,4
4,e,90


In [31]:
# 以value值的相对大小为权重，有放回地抽取3个样本
df_sample.sample(3,replace = True,weights = df_sample.value)

Unnamed: 0,id,value
3,d,4
4,e,90
4,e,90


## 多级索引

### 多级索引及其表的结构

In [32]:
# 构造示例表
np.random.seed(0)

multi_index = pd.MultiIndex.from_product([list('ABCD'),df.Gender.unique()], names=('School', 'Gender'))

multi_column = pd.MultiIndex.from_product([['Height', 'Weight'],df.Grade.unique()], names=('Indicator', 'Grade'))

df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(),
                              (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index,
                        columns = multi_column).round(1)

df_multi

Unnamed: 0_level_0,Indicator,Height,Height,Height,Height,Weight,Weight,Weight,Weight
Unnamed: 0_level_1,Grade,Freshman,Senior,Sophomore,Junior,Freshman,Senior,Sophomore,Junior
School,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
A,Female,171.8,165.0,167.9,174.2,60.6,55.1,63.3,65.8
A,Male,172.3,158.1,167.8,162.2,71.2,71.0,63.1,63.5
B,Female,162.5,165.1,163.7,170.3,59.8,57.9,56.5,74.8
B,Male,166.8,163.6,165.2,164.7,62.5,62.8,58.7,68.9
C,Female,170.5,162.0,164.6,158.7,56.9,63.9,60.5,66.9
C,Male,150.2,166.3,167.3,159.3,62.4,59.1,64.9,67.1
D,Female,174.3,155.7,163.2,162.1,65.3,66.5,61.8,63.2
D,Male,170.7,170.3,163.8,164.9,61.6,63.2,60.9,56.4


In [33]:
# 查看行索引的名字
df_multi.index.names

FrozenList(['School', 'Gender'])

In [34]:
# 查看行索引的取值
df_multi.index.values

array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)

In [35]:
# 查看列索引的名字
df_multi.columns.names

FrozenList(['Indicator', 'Grade'])

In [36]:
# 查看列索引的取值
df_multi.columns.values

array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)

### 多级索引中的 loc 索引器

In [37]:
# 将本章案例表的索引设置为学校和年级二重索引
df_multi = df.set_index(['School', 'Grade'])
# 查看设置结果
df_multi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
Peking University,Freshman,Changqiang You,Male,70.0,N
Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Fudan University,Sophomore,Gaojuan You,Male,74.0,N


对于多级索引，同样可以使用`loc`索引器，但在进行索引前，最好对多重索引进行**排序**，避免性能警告。

**注意**：多级索引中的单个元素以**元组**为单位。

In [38]:
# 对索引排序
df_multi = df_multi.sort_index()
# 查询复旦大学大三学生
df_multi.loc[('Fudan University','Junior')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Junior,Yanli You,Female,48.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Junior,Yanjuan Lv,Female,49.0,
Fudan University,Junior,Gaoqiang Zhou,Female,43.0,N


In [39]:
# 查询复旦大四和上海交大大一的学生
df_multi.loc[[('Fudan University','Senior'),('Shanghai Jiao Tong University', 'Freshman')]].tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shanghai Jiao Tong University,Freshman,Chunmei Shi,Female,52.0,N
Shanghai Jiao Tong University,Freshman,Xiaomei Yang,Female,49.0,N
Shanghai Jiao Tong University,Freshman,Xiaofeng Qian,Female,49.0,N
Shanghai Jiao Tong University,Freshman,Changmei Lv,Male,75.0,N
Shanghai Jiao Tong University,Freshman,Qiang Feng,Male,80.0,N


In [40]:
# 对于多级索引的DataFrame也可以用布尔列表
df_multi.loc[df_multi.Weight > 70].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Freshman,Feng Wang,Male,74.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
Fudan University,Senior,Chengpeng Qian,Male,73.0,Y


利用`loc`可以实现多级行索引的交叉组合，但要求同时指定缩印的**列**，全选则用`:`表示。

例如，北大复旦的大二大三学生索引可以通过如下方式快速生成: `(['Peking University', 'Fudan University'],['Sophomore', 'Junior'])`。

In [41]:
# 选出北大和复旦的大二大三学生
res = df_multi.loc[(['Peking University', 'Fudan University'],['Sophomore', 'Junior']), :]
res.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Sophomore,Changmei Xu,Female,43.0,N
Peking University,Sophomore,Xiaopeng Qin,Male,,N
Peking University,Sophomore,Mei Xu,Female,39.0,N
Peking University,Sophomore,Xiaoli Zhou,Female,55.0,N
Peking University,Sophomore,Peng Han,Female,34.0,
Peking University,Junior,Juan Xu,Female,,N
Peking University,Junior,Changjuan You,Female,47.0,N
Peking University,Junior,Gaoli Xu,Female,48.0,N
Peking University,Junior,Gaoquan Zhou,Male,70.0,N
Peking University,Junior,Qiang You,Female,56.0,N


In [42]:
# 注意与上面的例子对比，这里仅选取北大大三和复旦大二的学生
res = df_multi.loc[[('Peking University', 'Junior'),
                    ('Fudan University', 'Sophomore')]]
res.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Junior,Juan Xu,Female,,N
Peking University,Junior,Changjuan You,Female,47.0,N
Peking University,Junior,Gaoli Xu,Female,48.0,N
Peking University,Junior,Gaoquan Zhou,Male,70.0,N
Peking University,Junior,Qiang You,Female,56.0,N
Peking University,Junior,Chengli Zhao,Male,,
Peking University,Junior,Chengpeng Zhao,Female,44.0,N
Peking University,Junior,Xiaofeng Zhao,Female,46.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Fudan University,Sophomore,Gaojuan You,Male,74.0,N


### IndexSlice 对象

In [43]:
# 首先，构造一个索引不重复的DataFrame
np.random.seed(0)

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))

L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))

df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
                     index=mul_index1,
                     columns=mul_index2)
df_ex

Unnamed: 0_level_0,Big,D,D,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,d,e,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,3,6,-9,-6,-6,-2,0,9,-5
A,b,-3,3,-8,-3,-2,5,8,-4,4
A,c,-1,0,7,-4,6,6,-9,9,-6
B,a,8,5,-2,-9,-8,0,-9,1,-6
B,b,2,9,-7,-9,-9,-5,-4,-3,-1
B,c,8,6,-5,0,1,-8,-8,-2,0
C,a,-6,-3,2,5,9,-9,5,-6,3
C,b,1,2,-5,-3,-5,6,-6,3,-5
C,c,-1,5,6,-6,6,4,7,8,-4


In [44]:
# 首先，定义slice对象
idx = pd.IndexSlice

In [45]:
# 第一种切片索引方式，无法实现分层切片
df_ex.loc[idx['C':,('D','f'):]]

Unnamed: 0_level_0,Big,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
C,a,2,5,9,-9,5,-6,3
C,b,-5,-3,-5,6,-6,3,-5
C,c,6,-6,6,4,7,8,-4


In [46]:
# 第二种切片索引方式，可以实现分层切片
df_ex.loc[idx[:'A','b':],idx['E':,'e':]]

Unnamed: 0_level_0,Big,E,E,F,F
Unnamed: 0_level_1,Small,e,f,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,b,-2,5,-4,4
A,c,6,6,9,-6


### 多级索引构造

`pd.MultiIndex`对象下，有三种函数可用于构造多级索引：

|函数|构造方式|
|:---|:-------|
|from_tuples|传入由**元组**组成的列表进行构造|
|from_arrays|根据传入列表中对应层的**列表**进行构造|
|from_product|根据给定多个列表的**笛卡尔积**构造|

实践中，个人认为`from_product`函数最实用。

## 索引常用方法

### 索引层的交换和删除

In [47]:
# 首先，构造一个三级索引
np.random.seed(0)

L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
                                        names=('Upper', 'Lower','Extra'))

L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
                                        names=('Big', 'Small', 'Other'))

df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                     index=mul_index1,
                     columns=mul_index2)
df_ex

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


若想进行两层互换，可使用`swaplevel`方法；若想进行任意多层的重新排列，可使用`reorder_levels`方法。通过`axis`参数可以指定行或列索引，其中行索引为`0`，列索引为`1`。

In [48]:
# 交换列索引的第一和第三层
df_ex.swaplevel(0,2,axis=1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Other,cat,dog,cat,dog,cat,dog,cat,dog
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Big,C,C,C,C,D,D,D,D
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


In [49]:
# 重排行索引顺序
df_ex.reorder_levels([2,0,1],axis=0).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Upper,Lower,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
alpha,A,a,3,6,-9,-6,-6,-2,0,9
beta,A,a,-5,-3,3,-8,-3,-2,5,8
alpha,A,b,-4,4,-1,0,7,-4,6,6
beta,A,b,-9,9,-6,8,5,-2,-9,-8
alpha,B,a,0,-9,1,-6,2,9,-7,-9


可使用`droplevel`方法删除特定层的索引。

In [50]:
# 删除第二层列索引
df_ex.droplevel(1,axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


### 索引属性的修改

通过`rename_axis`可以对索引层的**名字**进行修改，常用的修改方式是传入字典的映射:

In [51]:
df_ex.rename_axis(index={'Upper':'Changed_row'},columns={'Other':'Changed_Col'}).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Changed_Col,cat,dog,cat,dog,cat,dog,cat,dog
Changed_row,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


通过`rename`可以对索引的**值**进行修改（传入参数可以是函数），如果是多级索引需要指定修改的层号`level`：

In [52]:
# 将列索引第3层的cat改为not_cat
df_ex.rename(columns={'cat':'not_cat'}, level=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,not_cat,dog,not_cat,dog,not_cat,dog,not_cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


可以利用定义在Index上的`map`方法实现索引的*压缩*，该函数将直接传入索引的**元组**。

In [53]:
# 生成副本
df_temp = df_ex.copy()
# 压缩索引
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+x[1]+'-'+x[2]))
# 更新索引
df_temp.index = new_idx
# 查看效果
df_temp.head()

Big,C,C,C,C,D,D,D,D
Small,c,c,d,d,c,c,d,d
Other,cat,dog,cat,dog,cat,dog,cat,dog
A-a-alpha,3,6,-9,-6,-6,-2,0,9
A-a-beta,-5,-3,3,-8,-3,-2,5,8
A-b-alpha,-4,4,-1,0,7,-4,6,6
A-b-beta,-9,9,-6,8,5,-2,-9,-8
B-a-alpha,0,-9,1,-6,2,9,-7,-9


也可以将压缩后的索引反向展开：

In [54]:
# 拆分
new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
# 更新
df_temp.index = new_idx
df_temp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


#### 练一练2

In [55]:
# 将行索引的名字全部大写
df_ex.rename_axis(index=lambda x:str.upper(x))

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
UPPER,LOWER,EXTRA,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


### 索引的设置与重置

In [56]:
# 构造示例
df_new = pd.DataFrame({'A':list('aacd'),
                       'B':list('PQRT'),
                       'C':[1,2,3,4]})
df_new

Unnamed: 0,A,B,C
0,a,P,1
1,a,Q,2
2,c,R,3
3,d,T,4


索引设置通过`set_index`完成，若参数设置为`append=True`，则将新定义索引添加到原索引的内层。

In [57]:
df_new.set_index('A')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


In [58]:
df_new.set_index('B', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,C
Unnamed: 0_level_1,B,Unnamed: 2_level_1,Unnamed: 3_level_1
0,P,a,1
1,Q,a,2
2,R,c,3
3,T,d,4


In [59]:
# 同时指定多个索引
df_new = df_new.set_index(['A','B'])
df_new

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


`reset_index`是`set_index`的逆函数，若参数设定为`drop=True`，则去掉的索引将被直接丢弃。

In [60]:
df_new = df_new.reset_index(['A'])
df_new

Unnamed: 0_level_0,A,C
B,Unnamed: 1_level_1,Unnamed: 2_level_1
P,a,1
Q,a,2
R,c,3
T,d,4


In [61]:
# 重置所有索引后，会重新生成一个默认索引
df_new.reset_index()

Unnamed: 0,B,A,C
0,P,a,1
1,Q,a,2
2,R,c,3
3,T,d,4


### 索引的变形

通过`reindex`方法可以实现索引的扩充和剔除。

In [62]:
df_reindex = pd.DataFrame({"Weight":[60,70,80],
                           "Height":[176,180,179]},
                          index=['1001','1003','1002'])
df_reindex

Unnamed: 0,Weight,Height
1001,60,176
1003,70,180
1002,80,179


In [63]:
df_reindex.reindex(index=['1001','1002','1003','1004'],
                   columns=['Weight','Gender'])

Unnamed: 0,Weight,Gender
1001,60.0,
1002,80.0,
1003,70.0,
1004,,


`reindex_like`方法可以根据现有的DataFrame来实现索引的变形。

In [64]:
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'],
                          columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)

Unnamed: 0,Weight,Gender
1001,60.0,
1002,80.0,
1003,70.0,
1004,,


## 索引运算

索引的运算可以视为**集合**的运算，但需要注意的是在进行运算前要先用`unique`函数对索引**去重**。常用运算总结如下表，其中`id1`和`id2`为索引非重复元素集合。

|运算|命令|等价符号表示|
|:---|:---|:-----------|
|取交集|id1.intersection(id2)|id1 & id2|
|取并集|id1.union(id2)|id1 \| id2|
|取对称差|id1.symmetric_difference(id2)|id1 ^ id2|
|作差|id1.difference(id2)|(id1 ^ id2) & id1|

## 练习

### Ex1: 员工数据集

In [65]:
# 读取数据集
df = pd.read_csv('./data/company.csv')
# 查看
df.head(3)

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F


**1** 分别只使用`query`和`loc`选出年龄不超过四十岁且工作部门为`Dairy`或`Bakery`的男性。

In [66]:
# 使用loc查询
df.loc[(df.age<=40) & ((df.department == 'Dairy')|(df.department == 'Bakery'))]

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
3608,5788,1/12/1975,40,Chilliwack,Dairy,Dairy Person,F
3609,5789,1/13/1975,40,Chilliwack,Dairy,Dairy Person,F
3610,5790,1/13/1975,40,Kelowna,Dairy,Dairy Person,F
3611,5791,1/14/1975,40,Kelowna,Dairy,Dairy Person,M
3613,5793,1/22/1975,40,Richmond,Bakery,Baker,M
...,...,...,...,...,...,...,...
6136,8335,12/28/1994,21,Vancouver,Dairy,Dairy Person,F
6137,8336,12/31/1994,21,Vancouver,Dairy,Dairy Person,M
6270,6312,5/14/1979,36,Grand Forks,Dairy,Dairy Person,M
6271,6540,2/14/1981,34,Victoria,Bakery,Baker,M


In [67]:
# 使用query查询
df.query('(age<=40) & ((department=="Dairy")|(department=="Bakery"))')

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
3608,5788,1/12/1975,40,Chilliwack,Dairy,Dairy Person,F
3609,5789,1/13/1975,40,Chilliwack,Dairy,Dairy Person,F
3610,5790,1/13/1975,40,Kelowna,Dairy,Dairy Person,F
3611,5791,1/14/1975,40,Kelowna,Dairy,Dairy Person,M
3613,5793,1/22/1975,40,Richmond,Bakery,Baker,M
...,...,...,...,...,...,...,...
6136,8335,12/28/1994,21,Vancouver,Dairy,Dairy Person,F
6137,8336,12/31/1994,21,Vancouver,Dairy,Dairy Person,M
6270,6312,5/14/1979,36,Grand Forks,Dairy,Dairy Person,M
6271,6540,2/14/1981,34,Victoria,Bakery,Baker,M


**2** 选出员工`ID`号为奇数所在行的第1、第3和倒数第2列。

In [68]:
df.loc[df.EmployeeID%2==1].iloc[:,[0,2,-2]]

Unnamed: 0,EmployeeID,age,job_title
1,1319,58,VP Stores
3,1321,56,VP Human Resources
5,1323,53,"Exec Assistant, VP Stores"
6,1325,51,"Exec Assistant, Legal Counsel"
8,1329,48,Store Manager
...,...,...,...
6276,7659,26,Cashier
6277,7741,25,Cashier
6278,7801,25,Dairy Person
6280,8181,22,Cashier


**3** 按照以下步骤进行索引操作：

In [69]:
# step 1: 把后三列设为索引后交换内外两层
dfn = df.set_index(list(df.columns[-3:])).swaplevel(0,2)

In [70]:
# step 2: 恢复中间一层
dfn = dfn.droplevel(1)

In [71]:
# step 3: 修改外层索引名为 Gender
dfn = dfn.rename_axis(index={'gender':'Gender'})

In [72]:
# step 4: 用下划线合并两层行索引
new_idx = dfn.index.map(lambda x: (x[0]+'_'+x[1]))
dfn.index = new_idx

In [73]:
# step 5: 将行索引拆分为原状态
orin_idx = dfn.index.map(lambda x: tuple(x.split('_')))
dfn.index = orin_idx

In [74]:
# step 6: 修改索引名为原表名称
dfn.index.names = ['gender','department']

In [75]:
# step 7: 恢复默认索引并将列保持为原表的相对位置
# 恢复默认索引
dfn = dfn.reset_index()

In [76]:
# 按照原表重置
dfn.reindex_like(df).head()

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,,M
1,1319,1/3/1957,58,Vancouver,Executive,,F
2,1320,1/2/1955,60,Vancouver,Executive,,F
3,1321,1/2/1959,56,Vancouver,Executive,,M
4,1322,1/9/1958,57,Vancouver,Executive,,M


### Ex2: 巧克力数据集

In [77]:
df = pd.read_csv('data/chocolate.csv')
df.head(3)

Unnamed: 0,Company,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating
0,A. Morin,2016,63%,France,3.75
1,A. Morin,2015,70%,France,2.75
2,A. Morin,2015,70%,France,3.0


**1** 把列索引名中的`\n`替换为空格.

In [78]:
new_c = df.columns.map(lambda x: x.replace('\n',' '))
df.columns = new_c

In [79]:
df.head()

Unnamed: 0,Company,Review Date,Cocoa Percent,Company Location,Rating
0,A. Morin,2016,63%,France,3.75
1,A. Morin,2015,70%,France,2.75
2,A. Morin,2015,70%,France,3.0
3,A. Morin,2015,70%,France,3.5
4,A. Morin,2015,70%,France,3.5


**2** 巧克力`Rating`评分为1至5，每0.25分一档，请选出2.75分及以下且可可含量`Cocoa Percent`高于中位数的样本。

In [80]:
# 这里需要把百分数转换为浮点数
df['Cocoa Percent']=df['Cocoa Percent'].str.strip('%').astype(float)/100;

In [81]:
df.loc[(df.Rating<=2.75)&(df['Cocoa Percent']>df['Cocoa Percent'].median())]

Unnamed: 0,Company,Review Date,Cocoa Percent,Company Location,Rating
33,Akesson's (Pralus),2010,0.75,Switzerland,2.75
34,Akesson's (Pralus),2010,0.75,Switzerland,2.75
36,Alain Ducasse,2014,0.75,France,2.75
38,Alain Ducasse,2013,0.75,France,2.50
39,Alain Ducasse,2013,0.75,France,2.50
...,...,...,...,...,...
1736,Wilkie's Organic,2013,0.89,Ireland,2.75
1738,Wilkie's Organic,2013,0.75,Ireland,2.75
1741,Willie's Cacao,2013,1.00,U.K.,2.25
1769,Zart Pralinen,2016,0.85,Austria,2.75


**3** 将`Review Date`和`Company Location`设为索引后，选出`Review Date`在2012年之后且`Company Location`不属于`France`,`Canada`,`Amsterdam`,`Belgium`的样本。

In [82]:
# 设置索引
df = df.set_index(['Review Date','Company Location'])

In [83]:
# 索引排序
df = df.sort_index()

In [84]:
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Cocoa Percent,Rating
Review Date,Company Location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006,Belgium,Cote d' Or (Kraft),0.7,1.0
2006,Belgium,Dolfin (Belcolade),0.7,1.5
2006,Belgium,Neuhaus (Callebaut),0.73,2.0
2006,Belgium,Neuhaus (Callebaut),0.75,2.75
2006,Belgium,Neuhaus (Callebaut),0.71,3.0
2006,Belgium,Pierre Marcolini,0.72,4.0
2006,Belgium,Pierre Marcolini,0.72,3.0
2006,Belgium,Pierre Marcolini,0.72,4.0
2006,Colombia,Santander (Compania Nacional),0.53,2.0
2006,Colombia,Santander (Compania Nacional),0.65,3.0


时间有限，本题直接参考答案。

In [85]:
# 定义切片对象
idx = pd.IndexSlice
# 定义排除列表
exclude = ['France', 'Canada', 'Amsterdam', 'Belgium']
# 切片选取
df.loc[idx[2012:,~df.index.get_level_values(1).isin(exclude)],:].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Cocoa Percent,Rating
Review Date,Company Location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012,Australia,Bahen & Co.,0.7,3.0
2012,Australia,Bahen & Co.,0.7,2.5
2012,Australia,Bahen & Co.,0.7,2.5
2012,Australia,Cravve,0.75,3.25
2012,Australia,Cravve,0.65,3.25
