在医学数据处理中，数据以“一个患者+一个指标”的形式记录。而ML往往希望一个样本为一个患者，因此涉及到从行到列的转置

In [1]:
import pandas as pd

In [3]:
data = pd.DataFrame([
        ['Lee', '男', 'ALT', 59],
        ['Lee', '男', 'AST', 20],
        ['Mia', '女','ALT', 60],
        ['Mia', '女','HBV', 12],
        ['Mia', '女','GRU', 23],
        ['Lu', '女', 'CREA', 23],
        ['Lu', '女', 'AST', 12],
        ['Doe', '男', 'CREA', 56],
    ], columns=['name', 'sex', 'indicator', 'value'])
data

Unnamed: 0,name,sex,indicator,value
0,Lee,男,ALT,59
1,Lee,男,AST,20
2,Mia,女,ALT,60
3,Mia,女,HBV,12
4,Mia,女,GRU,23
5,Lu,女,CREA,23
6,Lu,女,AST,12
7,Doe,男,CREA,56


希望数据以每行【name, sex, ALT、AST】等为特征进行记录

In [6]:
"""
以下操作分为如下几步：
1. 根据name和sex进行分组
2. 取[['indicator','value']]字段为子dataframe，并将'indicator'设为index，从而生成的对象包括三层index，从外到内依次为name,sex和indicator
3. 通过unstack将indicator层转为columns
"""
data2 = data.groupby(['name', 'sex'])[[[['indicator','value']]]].apply(lambda df: df.set_index('indicator')).unstack(-1)
data2

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value,value,value,value
Unnamed: 0_level_1,indicator,ALT,AST,CREA,GRU,HBV
name,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Doe,男,,,56.0,,
Lee,男,59.0,20.0,,,
Lu,女,,12.0,23.0,,
Mia,女,60.0,,,23.0,12.0


In [7]:
# 两层index
data2.index

MultiIndex(levels=[['Doe', 'Lee', 'Lu', 'Mia'], ['女', '男']],
           labels=[[0, 1, 2, 3], [1, 1, 0, 0]],
           names=['name', 'sex'])

In [9]:
# 两层columns
data2.columns

MultiIndex(levels=[['value'], ['ALT', 'AST', 'CREA', 'GRU', 'HBV']],
           labels=[[0, 0, 0, 0, 0], [0, 1, 2, 3, 4]],
           names=[None, 'indicator'])

In [12]:
# 重设index，只保留最外层的name
"""
对于多层的index和columns都是属于MultiIndex，最外层编号为0，往内依次增加
reset_index可将原多层index复原到列， level表示要remove的index，默认全部remove，col_level为将remove的index放置到的ZeroDivisionError
columns的层级，默认最外层
"""
data3 = data2.reset_index(level=-1, col_level=-1)
data3

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value,value,value,value
indicator,sex,ALT,AST,CREA,GRU,HBV
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Doe,男,,,56.0,,
Lee,男,59.0,20.0,,,
Lu,女,,12.0,23.0,,
Mia,女,60.0,,,23.0,12.0


In [16]:
# index已满足要求
data3.index

Index(['Doe', 'Lee', 'Lu', 'Mia'], dtype='object', name='name')

In [17]:
# columns还未满足
data3.columns

MultiIndex(levels=[['value', ''], ['ALT', 'AST', 'CREA', 'GRU', 'HBV', 'sex']],
           labels=[[1, 0, 0, 0, 0, 0], [5, 0, 1, 2, 3, 4]],
           names=[None, 'indicator'])

In [19]:
# 删除最外层的columns
data4 = data3.copy()
data4.columns = data3.columns.droplevel(0)
data4

indicator,sex,ALT,AST,CREA,GRU,HBV
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Doe,男,,,56.0,,
Lee,男,59.0,20.0,,,
Lu,女,,12.0,23.0,,
Mia,女,60.0,,,23.0,12.0


补充一些关于index的常规操作

In [20]:
# 获得某个index的值
data3.index.get_level_values('name')

Index(['Doe', 'Lee', 'Lu', 'Mia'], dtype='object', name='name')

In [23]:
# MultiIndex的索引定位
data2.loc[('Doe', '男'), ('value', 'CREA')]

56.0

In [27]:
# 删除某个层级的index的index
data2.index.droplevel(-1)

Index(['Doe', 'Lee', 'Lu', 'Mia'], dtype='object', name='name')

In [30]:
# 多个层级名称的合成
columns_name = ["_".join(i) for i in data2.columns.ravel()]
columns_name

['value_ALT', 'value_AST', 'value_CREA', 'value_GRU', 'value_HBV']