# 变形

In [1]:
import numpy as np
import pandas as pd

## 长宽表的变形

下列示例分别展示了关于性别的**长表**与**宽表**,它们之间是等价的。

In [2]:
# 长表
pd.DataFrame({'Gender':['F','F','M','M'],
              'Height':[163,160,175,180]})

Unnamed: 0,Gender,Height
0,F,163
1,F,160
2,M,175
3,M,180


In [3]:
# 宽表
pd.DataFrame({'Height_F':[163,160],
              'Height_M':[175,180]})

Unnamed: 0,Height_F,Height_M
0,163,175
1,160,180


pandas针对长宽表的变形操作设计了一列函数，包括`pivot`, `pivot_table`, `melt`, `wide_to_long`。接下来将依次介绍它们。

### pivot

In [4]:
df = pd.DataFrame({'Class':[1,1,2,2],
                   'Name':['San Zhang','San Zhang','Si Li','Si Li'],
                   'Subject':['Chinese','Math','Chinese','Math'],
                   'Grade':[80,75,90,85]})
df

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,1,San Zhang,Math,75
2,2,Si Li,Chinese,90
3,2,Si Li,Math,85


`pivot`函数能将长表变为宽表，如下所示：

In [5]:
# 以姓名为行索引，以科目为列索引，展示考试分数
df.pivot(index='Name',columns='Subject',values='Grade')

Subject,Chinese,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
San Zhang,80,75
Si Li,90,85


需要注意的是，要通过`pivot`完成变形操作，参数`index`和`columns`的每个组合必须对应**唯一**的`values`。

此外，`pivot`的三个参数均可传入*列表*，从而形成多级行/列索引。

In [6]:
df = pd.DataFrame({'Class':[1, 1, 2, 2, 1, 1, 2, 2],
                   'Name':['San Zhang', 'San Zhang', 'Si Li', 'Si Li',
                           'San Zhang', 'San Zhang', 'Si Li', 'Si Li'],
                   'Examination': ['Mid', 'Final', 'Mid', 'Final',
                                   'Mid', 'Final', 'Mid', 'Final'],
                   'Subject':['Chinese', 'Chinese', 'Chinese', 'Chinese',
                              'Math', 'Math', 'Math', 'Math'],
                   'Grade':[80, 75, 85, 65, 90, 85, 92, 88],
                   'rank':[10, 15, 21, 15, 20, 7, 6, 2]})
df

Unnamed: 0,Class,Name,Examination,Subject,Grade,rank
0,1,San Zhang,Mid,Chinese,80,10
1,1,San Zhang,Final,Chinese,75,15
2,2,Si Li,Mid,Chinese,85,21
3,2,Si Li,Final,Chinese,65,15
4,1,San Zhang,Mid,Math,90,20
5,1,San Zhang,Final,Math,85,7
6,2,Si Li,Mid,Math,92,6
7,2,Si Li,Final,Math,88,2


In [7]:
# 行索引为班级&姓名，列索引为科目&考试类型，值为分数&排名
pivot_multi = df.pivot(index=['Class','Name'],columns=['Subject','Examination'],values=['Grade','rank'])
pivot_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Grade,Grade,Grade,Grade,rank,rank,rank,rank
Unnamed: 0_level_1,Subject,Chinese,Chinese,Math,Math,Chinese,Chinese,Math,Math
Unnamed: 0_level_2,Examination,Mid,Final,Mid,Final,Mid,Final,Mid,Final
Class,Name,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1,San Zhang,80,75,90,85,10,15,20,7
2,Si Li,85,65,92,88,21,15,6,2


### pivot_table

`pivot_table`的功能与`pivot`类似，但不要求`values`的唯一性。对于`index`和`columns`组合对应的一组`values`，可通过`aggfunc`进行聚合，从而返回一个标量。

In [8]:
df = pd.DataFrame({'Name':['San Zhang', 'San Zhang',
                           'San Zhang', 'San Zhang',
                           'Si Li', 'Si Li', 'Si Li', 'Si Li'],
                   'Subject':['Chinese', 'Chinese', 'Math', 'Math',
                              'Chinese', 'Chinese', 'Math', 'Math'],
                   'Grade':[80, 90, 100, 90, 70, 80, 85, 95]})
df                          

Unnamed: 0,Name,Subject,Grade
0,San Zhang,Chinese,80
1,San Zhang,Chinese,90
2,San Zhang,Math,100
3,San Zhang,Math,90
4,Si Li,Chinese,70
5,Si Li,Chinese,80
6,Si Li,Math,85
7,Si Li,Math,95


In [9]:
# 以姓名为行索引，以科目为列索引，值为两次成绩的平均
df.pivot_table(index = 'Name',
               columns = 'Subject',
               values = 'Grade',
               aggfunc = 'mean')

Subject,Chinese,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
San Zhang,85,95
Si Li,75,90


`aggfunc`既可以为合法的聚合函数字符，又可以是输入序列、输出标量的自定义函数。

`aggfunc`还有参数`margins`，若`margins=True`，则变形后的表还会分别对整行、整列以及整体进行`aggfunc`聚合。

In [10]:
df.pivot_table(index = 'Name',
               columns = 'Subject',
               values = 'Grade',
               aggfunc = 'mean',
               margins = True)

Subject,Chinese,Math,All
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
San Zhang,85,95.0,90.0
Si Li,75,90.0,82.5
All,80,92.5,86.25


#### 练一练1

暂未想出反例

### melt

`melt`的功能正好与`pivot`相反，它能将宽表转换为长表。

In [11]:
# 生成关于成绩的宽表
df = pd.DataFrame({'Class':[1,2],
                   'Name':['San Zhang', 'Si Li'],
                   'Chinese':[80, 90],
                   'Math':[80, 75]})
df

Unnamed: 0,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [12]:
# 将上表转为关于成绩的长表，其中原来的列索引变为新的一列，即科目
df_melted = df.melt(id_vars = ['Class','Name'],
                    value_vars = ['Chinese','Math'],
                    var_name = 'Subject',
                    value_name = 'Grade')
df_melted

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,2,Si Li,Chinese,90
2,1,San Zhang,Math,80
3,2,Si Li,Math,75


In [13]:
# 证明pivot与melt是互逆的
df_unmelted = df_melted.pivot(index = ['Class', 'Name'],
                              columns='Subject',
                              values='Grade')
df_unmelted

Unnamed: 0_level_0,Subject,Chinese,Math
Class,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
1,San Zhang,80,80
2,Si Li,90,75


In [14]:
# 恢复索引
df_unmelted.reset_index()

Subject,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [15]:
# 重命名索引
df_unmelted.reset_index().rename_axis(columns={'Subject':''})

Unnamed: 0,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [16]:
# 验证相等
df_unmelted = df_unmelted.reset_index().rename_axis(columns={'Subject':''})
df_unmelted.equals(df)

True

### wide_to_long

In [17]:
# 引入示例
df = pd.DataFrame({'Class':[1,2],'Name':['San Zhang', 'Si Li'],
                   'Chinese_Mid':[80, 75], 'Math_Mid':[90, 85],
                   'Chinese_Final':[80, 75], 'Math_Final':[90, 85]})
df

Unnamed: 0,Class,Name,Chinese_Mid,Math_Mid,Chinese_Final,Math_Final
0,1,San Zhang,80,90,80,90
1,2,Si Li,75,85,75,85


我们的目标是将`Mid`和`Final`的信息压缩为列，而将`Chinese`和`Math`的信息保留。这可以利用`wide_to_long`来实现。

In [18]:
pd.wide_to_long(df,
                stubnames = ['Chinese','Math'],
                i = ['Class','Name'],
                j = 'Examination',
                sep = '_',
                suffix = '.+')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chinese,Math
Class,Name,Examination,Unnamed: 3_level_1,Unnamed: 4_level_1
1,San Zhang,Mid,80,90
1,San Zhang,Final,80,90
2,Si Li,Mid,75,85
2,Si Li,Final,75,85


其中，参数`suffix`为**正则后缀**。

In [19]:
pivot_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Grade,Grade,Grade,Grade,rank,rank,rank,rank
Unnamed: 0_level_1,Subject,Chinese,Chinese,Math,Math,Chinese,Chinese,Math,Math
Unnamed: 0_level_2,Examination,Mid,Final,Mid,Final,Mid,Final,Mid,Final
Class,Name,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1,San Zhang,80,75,90,85,10,15,20,7
2,Si Li,85,65,92,88,21,15,6,2


接下来将展示如何使用`wide_to_long`函数将上述列表恢复为`df`最初的样子。

In [20]:
# 复制
res = pivot_multi.copy()
# 查看列索引
res.columns

MultiIndex([('Grade', 'Chinese',   'Mid'),
            ('Grade', 'Chinese', 'Final'),
            ('Grade',    'Math',   'Mid'),
            ('Grade',    'Math', 'Final'),
            ( 'rank', 'Chinese',   'Mid'),
            ( 'rank', 'Chinese', 'Final'),
            ( 'rank',    'Math',   'Mid'),
            ( 'rank',    'Math', 'Final')],
           names=[None, 'Subject', 'Examination'])

In [21]:
# 将多级列索引合并为单级，用'_'连接
res.columns = res.columns.map(lambda x: '_'.join(x))
# 将行索引恢复
res = res.reset_index()
res

Unnamed: 0,Class,Name,Grade_Chinese_Mid,Grade_Chinese_Final,Grade_Math_Mid,Grade_Math_Final,rank_Chinese_Mid,rank_Chinese_Final,rank_Math_Mid,rank_Math_Final
0,1,San Zhang,80,75,90,85,10,15,20,7
1,2,Si Li,85,65,92,88,21,15,6,2


In [22]:
# 接下来，就可以用wide_to_long了，由于有两个下表，所以得分两步完成
# 第一步
res = pd.wide_to_long(res, 
                      stubnames = ['Grade','rank'],
                      i = ['Class','Name'],
                      j = 'Subject_Examination',
                      sep = '_',
                      suffix = '.+')
res = res.reset_index()

In [23]:
res

Unnamed: 0,Class,Name,Subject_Examination,Grade,rank
0,1,San Zhang,Chinese_Mid,80,10
1,1,San Zhang,Chinese_Final,75,15
2,1,San Zhang,Math_Mid,90,20
3,1,San Zhang,Math_Final,85,7
4,2,Si Li,Chinese_Mid,85,21
5,2,Si Li,Chinese_Final,65,15
6,2,Si Li,Math_Mid,92,6
7,2,Si Li,Math_Final,88,2


In [24]:
# 第二步（比第一步复杂一些，因为涉及值字符串的拆分）
# 直接将列一分为二
res[['Subject','Examination']] = res['Subject_Examination'].str.split('_',expand=True)
# 可以看到，拆分后新增了两列
res

Unnamed: 0,Class,Name,Subject_Examination,Grade,rank,Subject,Examination
0,1,San Zhang,Chinese_Mid,80,10,Chinese,Mid
1,1,San Zhang,Chinese_Final,75,15,Chinese,Final
2,1,San Zhang,Math_Mid,90,20,Math,Mid
3,1,San Zhang,Math_Final,85,7,Math,Final
4,2,Si Li,Chinese_Mid,85,21,Chinese,Mid
5,2,Si Li,Chinese_Final,65,15,Chinese,Final
6,2,Si Li,Math_Mid,92,6,Math,Mid
7,2,Si Li,Math_Final,88,2,Math,Final


In [25]:
# 保留需要的列，并按学科排序
res = res[['Class', 'Name', 'Examination','Subject', 'Grade', 'rank']].sort_values('Subject')

In [26]:
re = res.reset_index(drop=True)
res

Unnamed: 0,Class,Name,Examination,Subject,Grade,rank
0,1,San Zhang,Mid,Chinese,80,10
1,1,San Zhang,Final,Chinese,75,15
4,2,Si Li,Mid,Chinese,85,21
5,2,Si Li,Final,Chinese,65,15
2,1,San Zhang,Mid,Math,90,20
3,1,San Zhang,Final,Math,85,7
6,2,Si Li,Mid,Math,92,6
7,2,Si Li,Final,Math,88,2


## 索引的变形

### stack 与 unstack

如果要实现行列索引之间的互换，需要使用 `stack` 和 `unstack` 函数。

`unstack` 函数能将行索引转为列索引。

In [27]:
# 生成示例
df = pd.DataFrame(np.ones((4,2)),
                  index = pd.Index([('A', 'cat', 'big'),
                                    ('A', 'dog', 'small'),
                                    ('B', 'cat', 'big'),
                                    ('B', 'dog', 'small')]),
                  columns=['col_1', 'col_2'])
df                

Unnamed: 0,Unnamed: 1,Unnamed: 2,col_1,col_2
A,cat,big,1.0,1.0
A,dog,small,1.0,1.0
B,cat,big,1.0,1.0
B,dog,small,1.0,1.0


In [28]:
# unstack默认将最内层的行索引转为列索引
df.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,col_1,col_1,col_2,col_2
Unnamed: 0_level_1,Unnamed: 1_level_1,big,small,big,small
A,cat,1.0,,1.0,
A,dog,,1.0,,1.0
B,cat,1.0,,1.0,
B,dog,,1.0,,1.0


In [29]:
# 通过指定参数，unstack能实现任意（多）层的行索引转换
# 第二层
df.unstack(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,col_1,col_1,col_2,col_2
Unnamed: 0_level_1,Unnamed: 1_level_1,cat,dog,cat,dog
A,big,1.0,,1.0,
A,small,,1.0,,1.0
B,big,1.0,,1.0,
B,small,,1.0,,1.0


In [30]:
# 头两层
df.unstack([0,1])

Unnamed: 0_level_0,col_1,col_1,col_1,col_1,col_2,col_2,col_2,col_2
Unnamed: 0_level_1,A,A,B,B,A,A,B,B
Unnamed: 0_level_2,cat,dog,cat,dog,cat,dog,cat,dog
big,1.0,,1.0,,1.0,,1.0,
small,,1.0,,1.0,,1.0,,1.0


`stack` 的作用是把列索引转为行索引。

In [31]:
# 生成示例
df = pd.DataFrame(np.ones((4,2)),
                  index = pd.Index([('A', 'cat', 'big'),
                                    ('A', 'dog', 'small'),
                                    ('B', 'cat', 'big'),
                                    ('B', 'dog', 'small')]),
                  columns=['index_1', 'index_2']).T
df                

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,cat,dog,cat,dog
Unnamed: 0_level_2,big,small,big,small
index_1,1.0,1.0,1.0,1.0
index_2,1.0,1.0,1.0,1.0


In [32]:
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B,B
Unnamed: 0_level_1,Unnamed: 1_level_1,cat,dog,cat,dog
index_1,big,1.0,,1.0,
index_1,small,,1.0,,1.0
index_2,big,1.0,,1.0,
index_2,small,,1.0,,1.0


In [33]:
df.stack([1,2])

Unnamed: 0,Unnamed: 1,Unnamed: 2,A,B
index_1,cat,big,1.0,1.0
index_1,dog,small,1.0,1.0
index_2,cat,big,1.0,1.0
index_2,dog,small,1.0,1.0


## 其他变形函数

### crosstab

`crosstab`的默认功能是实现分组计数。

In [34]:
df = pd.read_csv('data/learn_pandas.csv')

In [35]:
# 根据学校+是否转专业分组计数
# 需要注意的是，crosstab要求传入Series，而不是列索引的名字
pd.crosstab(index = df.School, columns = df.Transfer)

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,38,1
Peking University,28,2
Shanghai Jiao Tong University,53,0
Tsinghua University,62,4


`crosstab` 的所有功能都可以通过`pivot_table`实现，且后者速度更快。

In [36]:
df.pivot_table(index='School', columns='Transfer',values = 'Name', aggfunc = 'count')

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,38.0,1.0
Peking University,28.0,2.0
Shanghai Jiao Tong University,53.0,
Tsinghua University,62.0,4.0


#### 练一练2

In [37]:
%timeit pd.crosstab(index = df.School, columns = df.Transfer)

7.79 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [38]:
%timeit df.pivot_table(index='School', columns='Transfer',values = 'Name', aggfunc = 'count')

5.71 ms ± 27 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [39]:
%timeit pd.crosstab(index = df.School, columns = df.Transfer, values = df.Height, aggfunc = 'mean')

6.37 ms ± 87.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
%timeit df.pivot_table(index='School', columns='Transfer',values = 'Height', aggfunc = 'mean')

5.89 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### explode

若某一列的单元格存储了<font color=red>list, tuple, Series, np.ndarray</font>其中的一类数据，那么可以使用`explode`函数将其元素**纵向展开**。

In [41]:
df_ex = pd.DataFrame({'A': [[1, 2],'my_str',{1, 2},pd.Series([3, 4])],
                      'B': 1})

In [42]:
df_ex

Unnamed: 0,A,B
0,"[1, 2]",1
1,my_str,1
2,"{1, 2}",1
3,0 3 1 4 dtype: int64,1


In [43]:
df_ex.explode('A')
# 可以看到，元素为列表和Series的单元格被纵向展开了

Unnamed: 0,A,B
0,1,1
0,2,1
1,my_str,1
2,"{1, 2}",1
3,3,1
3,4,1


### get_dummies

`get_dummies`用于生成虚拟变量。

In [44]:
df.Grade.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: object

In [45]:
# 生成年级虚拟变量
pd.get_dummies(df.Grade).head()

Unnamed: 0,Freshman,Junior,Senior,Sophomore
0,1,0,0,0
1,1,0,0,0
2,0,0,1,0
3,0,0,0,1
4,0,0,0,1


## 练习

### Ex1: 美国非法药物数据集

In [46]:
df = pd.read_csv('data/drugs.csv').sort_values(['State','COUNTY','SubstanceName'],ignore_index=True)
df.head(3)

Unnamed: 0,YYYY,State,COUNTY,SubstanceName,DrugReports
0,2011,KY,ADAIR,Buprenorphine,3
1,2012,KY,ADAIR,Buprenorphine,5
2,2013,KY,ADAIR,Buprenorphine,4


第**1**问

In [47]:
dfp = df.pivot(index = ['State','COUNTY','SubstanceName'], columns = 'YYYY', values = 'DrugReports')
dfp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,YYYY,2010,2011,2012,2013,2014,2015,2016,2017
State,COUNTY,SubstanceName,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
KY,ADAIR,Buprenorphine,,3.0,5.0,4.0,27.0,5.0,7.0,10.0
KY,ADAIR,Codeine,,,1.0,,,,,1.0
KY,ADAIR,Fentanyl,,,1.0,,,,,
KY,ADAIR,Heroin,,,1.0,2.0,,1.0,,2.0
KY,ADAIR,Hydrocodone,6.0,9.0,10.0,10.0,9.0,7.0,11.0,3.0


In [48]:
dfp = dfp.reset_index().rename_axis(columns = {'YYYY':''})

In [49]:
dfp.head()

Unnamed: 0,State,COUNTY,SubstanceName,2010,2011,2012,2013,2014,2015,2016,2017
0,KY,ADAIR,Buprenorphine,,3.0,5.0,4.0,27.0,5.0,7.0,10.0
1,KY,ADAIR,Codeine,,,1.0,,,,,1.0
2,KY,ADAIR,Fentanyl,,,1.0,,,,,
3,KY,ADAIR,Heroin,,,1.0,2.0,,1.0,,2.0
4,KY,ADAIR,Hydrocodone,6.0,9.0,10.0,10.0,9.0,7.0,11.0,3.0


第**2**问

In [50]:
# 目标
df.head(3)

Unnamed: 0,YYYY,State,COUNTY,SubstanceName,DrugReports
0,2011,KY,ADAIR,Buprenorphine,3
1,2012,KY,ADAIR,Buprenorphine,5
2,2013,KY,ADAIR,Buprenorphine,4


In [51]:
dfn = dfp.melt(id_vars = ['State','COUNTY','SubstanceName'],
        value_vars = range(2010,2018),
        var_name = 'YYYY',
        value_name = 'DrugReports').sort_values(['State','COUNTY','SubstanceName'],ignore_index=True)
dfn.head(3)

Unnamed: 0,State,COUNTY,SubstanceName,YYYY,DrugReports
0,KY,ADAIR,Buprenorphine,2010,
1,KY,ADAIR,Buprenorphine,2011,3.0
2,KY,ADAIR,Buprenorphine,2012,5.0


In [52]:
dfn = dfn.loc[dfn.DrugReports.isnull()==False].reset_index(drop=True)
dfn.head(3)

Unnamed: 0,State,COUNTY,SubstanceName,YYYY,DrugReports
0,KY,ADAIR,Buprenorphine,2011,3.0
1,KY,ADAIR,Buprenorphine,2012,5.0
2,KY,ADAIR,Buprenorphine,2013,4.0


In [53]:
dfn = dfn.set_index('YYYY').reset_index()
dfn.head(3)

Unnamed: 0,YYYY,State,COUNTY,SubstanceName,DrugReports
0,2011,KY,ADAIR,Buprenorphine,3.0
1,2012,KY,ADAIR,Buprenorphine,5.0
2,2013,KY,ADAIR,Buprenorphine,4.0


基本还原了，但问题是怎么把`DrugReports`修改为**整数**？

<font color=red><b>解决方案</b></font>: 使用`astype`函数，传入一个**字典**，键为变量名，值为修改后的数据类型。

In [54]:
dfn = dfn.astype({'YYYY':'int64', 'DrugReports':'int64'})

In [55]:
dfn.head()

Unnamed: 0,YYYY,State,COUNTY,SubstanceName,DrugReports
0,2011,KY,ADAIR,Buprenorphine,3
1,2012,KY,ADAIR,Buprenorphine,5
2,2013,KY,ADAIR,Buprenorphine,4
3,2014,KY,ADAIR,Buprenorphine,27
4,2015,KY,ADAIR,Buprenorphine,5


In [56]:
dfn.equals(df)

True

第**3**问

In [57]:
# pivot_table实现
df.pivot_table(index = 'YYYY',
               columns = 'State',
               values = 'DrugReports',
               aggfunc = 'sum')

State,KY,OH,PA,VA,WV
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10453,19707,19814,8685,2890
2011,10289,20330,19987,6749,3271
2012,10722,23145,19959,7831,3376
2013,11148,26846,20409,11675,4046
2014,11081,30860,24904,9037,3280
2015,9865,37127,25651,8810,2571
2016,9093,42470,26164,10195,2548
2017,9394,46104,27894,10448,1614


In [58]:
# groupby+unstack实现
df.groupby(['YYYY','State'])['DrugReports'].agg('sum').unstack()

State,KY,OH,PA,VA,WV
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10453,19707,19814,8685,2890
2011,10289,20330,19987,6749,3271
2012,10722,23145,19959,7831,3376
2013,11148,26846,20409,11675,4046
2014,11081,30860,24904,9037,3280
2015,9865,37127,25651,8810,2571
2016,9093,42470,26164,10195,2548
2017,9394,46104,27894,10448,1614


### Ex2: 特殊的wide_to_long方法

In [59]:
df = pd.DataFrame({'Class':[1,2],'Name':['San Zhang', 'Si Li'],'Chinese':[80, 90],'Math':[80, 75]})
df

Unnamed: 0,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [60]:
df_melted

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,2,Si Li,Chinese,90
2,1,San Zhang,Math,80
3,2,Si Li,Math,75


对列名重命名，使用`rename`函数，参数选择`columns`，并将其赋值为一个字典，其中键为原名，值为新名。

In [61]:
df = df.rename(columns = {'Chinese':'Grade_Chinese','Math':'Grade_Math'})
df

Unnamed: 0,Class,Name,Grade_Chinese,Grade_Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [62]:
pd.wide_to_long(df,
                stubnames=['Grade'],
                i = ['Class','Name'],
                j = 'Subject',
                sep = '_',
                suffix = '.+').reset_index()

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,1,San Zhang,Math,80
2,2,Si Li,Chinese,90
3,2,Si Li,Math,75
