<b>Pandas之缺失值处理</b>

在Pandas中，缺失值用浮点型NaN表示，在其他环境下（像NumPy）经常被用到的“None”也作为缺失值处理。

In [1]:
import pandas as pd
import numpy as np

<b>1. 缺失值剔除</b>

In [2]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo

0       NaN
1        -3
2      None
3    foobar
dtype: object

In [3]:
foo.isnull()

0     True
1    False
2     True
3    False
dtype: bool

使用dropna方法也可以实现NaN值的剔除。axis参数用于控制行或列，跟其他不一样的是，axis=0表示操作行，axis=1表示操作列。how 参数可选的值为 any 或者 all。all 仅在切片元素全为 NA 时才抛弃该行(列)。另外一个有趣的参数是 thresh，该参数的类型为整数，它的作用是，比如 thresh=3，会在一行中至少有 3 个非 NA 值时将其保留。inplace参数用于控制是否在元数据上进行更改，默认False。

In [4]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, None, None],
                     'patient':[1, 1, 1, 1, 2, 2, 2, None],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', None]})
data

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,
7,,,


In [5]:
data.dropna(axis=0,how='any',inplace=False)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0


In [6]:
data.dropna(axis=0,how='all',inplace=False)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,


In [7]:
data.dropna(axis=1,how='any',inplace=False)

0
1
2
3
4
5
6
7


In [8]:
data.dropna(axis=1,how='all',inplace=False)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,
7,,,


<b>2. 填充缺失值</b>

fillna方法可以实现对缺失值的填充，

In [9]:
data.fillna(value=-1)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,-1.0
7,-1.0,-1,-1.0


In [10]:
data.fillna(value={'patient':0,'value':130})

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,130.0
7,0.0,,130.0


使用前一个值来填充NaN值。

In [11]:
data.fillna(method='ffill',axis=1)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632
1,1.0,Proteobacteria,1638
2,1.0,Actinobacteria,569
3,1.0,Bacteroidetes,115
4,2.0,Firmicutes,433
5,2.0,Proteobacteria,1130
6,2.0,Actinobacteria,Actinobacteria
7,,,


obj.combine_first(other) 方法的作用是使用 other 中的数据去填补 obj 中的 NA 值，就像打补丁。而且可以自动对齐。

In [12]:
data2 = data.copy()
data2

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,
7,,,


In [16]:
data2.ix[[6,7],['value']] = [1233,1344]
data2

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,1233.0
7,,,1344.0


In [17]:
data.combine_first(data2)

Unnamed: 0,patient,phylum,value
0,1.0,Firmicutes,632.0
1,1.0,Proteobacteria,1638.0
2,1.0,Actinobacteria,569.0
3,1.0,Bacteroidetes,115.0
4,2.0,Firmicutes,433.0
5,2.0,Proteobacteria,1130.0
6,2.0,Actinobacteria,1233.0
7,,,1344.0
