### 处理缺失数据
Pandas的设计目标之一就是让缺失数据的处理任务尽量轻松，例如：pandas对象上的所有描述统计都排除了缺失数据。pandas使用浮点值NaN(not a number)表示浮点和非浮点数组的缺失数据，它只是一个便于检测出来的标记而已

In [4]:
import pandas as pd
import numpy as np
string_data = pd.Series(['zhao','hu',np.nan,'ma'])

In [6]:
string_data

0    zhao
1      hu
2     NaN
3      ma
dtype: object

#### isnull方法
用于判断series中是否有NA值

In [9]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

python内置的None值也会被当作NA处理

In [10]:
string_data[0] = None

In [11]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

NA的处理方法
![](images/20190705190845.jpg)
#### 滤除缺失数据
过滤掉缺失数据的办法有很多种：1.纯手工；2.dropna方法

对于Series该方法返回一个仅包含非空数据和索引值的Series

In [12]:
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

可以通过手工（布尔型索引）达到目的

In [13]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame，可以丢弃全部NA或含有NA的行或列，而dropna默认丢弃任何含有缺失值的行

In [15]:
data = pd.DataFrame([[1,6.2,5],[3,NA,NA],[NA,NA,NA],[NA,3.7,5.6]])

In [16]:
cleaned = data.dropna()

In [17]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.2,5.0


传入how='all'，将只丢弃全为NA的行

In [18]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.2,5.0
1,3.0,,
3,,3.7,5.6


要丢弃列，传入axis=1即可

In [19]:
data[4] = NA

In [20]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.2,5.0,
1,3.0,,,
2,,,,
3,,3.7,5.6,


In [21]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.2,5.0
1,3.0,,
2,,,
3,,3.7,5.6


另一个滤除DataFrame行的问题，涉及时间序列数据，使用**thresh**参数，留下一部分观测数，thresh=n，保留至少有 n 个非 NA 数的行

In [50]:
df = pd.DataFrame(np.random.randn(7,3))

In [51]:
df

Unnamed: 0,0,1,2
0,-0.921776,-0.412738,0.741995
1,-1.006949,-0.925104,-0.529383
2,-1.362496,1.215897,0.405062
3,0.428054,-1.401715,1.12405
4,2.39146,-0.691009,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


In [52]:
df.loc[:4, 1]

0   -0.412738
1   -0.925104
2    1.215897
3   -1.401715
4   -0.691009
Name: 1, dtype: float64

In [53]:
df.loc[:4,1] = NA

In [54]:
df.loc[:2,2] = NA

In [55]:
df

Unnamed: 0,0,1,2
0,-0.921776,,
1,-1.006949,,
2,-1.362496,,
3,0.428054,,1.12405
4,2.39146,,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


In [56]:
df.dropna(thresh=1)

Unnamed: 0,0,1,2
0,-0.921776,,
1,-1.006949,,
2,-1.362496,,
3,0.428054,,1.12405
4,2.39146,,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


In [57]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
3,0.428054,,1.12405
4,2.39146,,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


#### 填充缺失数据
fillna方法是主要的函数
1. 替换常数值

In [58]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.921776,0.0,0.0
1,-1.006949,0.0,0.0
2,-1.362496,0.0,0.0
3,0.428054,0.0,1.12405
4,2.39146,0.0,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


2.通过字典调用fillna，实现对不同列填充不同的值

In [60]:
df.fillna({1:0.5,2:-1})

Unnamed: 0,0,1,2
0,-0.921776,0.5,-1.0
1,-1.006949,0.5,-1.0
2,-1.362496,0.5,-1.0
3,0.428054,0.5,1.12405
4,2.39146,0.5,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


fillna默认返回新对象，也可以使用inplace对现有对象进行就地修改

In [62]:
_ = df.fillna(0, inplace=False)

In [63]:
df

Unnamed: 0,0,1,2
0,-0.921776,,
1,-1.006949,,
2,-1.362496,,
3,0.428054,,1.12405
4,2.39146,,0.855354
5,-0.29414,-1.028768,0.006115
6,1.261771,0.213582,-0.183341


reindex的那些插入方法也可用于fillna

In [65]:
df = pd.DataFrame(np.random.randn(6,3))

In [66]:
df.loc[2:,1] = NA

In [67]:
df.loc[4:,2] = NA

In [68]:
df

Unnamed: 0,0,1,2
0,-0.439551,0.279464,0.612358
1,1.906098,-0.221094,2.2731
2,0.817039,,1.095471
3,1.541326,,-0.908089
4,-1.297587,,
5,-0.1229,,


In [69]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.439551,0.279464,0.612358
1,1.906098,-0.221094,2.2731
2,0.817039,-0.221094,1.095471
3,1.541326,-0.221094,-0.908089
4,-1.297587,-0.221094,-0.908089
5,-0.1229,-0.221094,-0.908089


In [70]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.439551,0.279464,0.612358
1,1.906098,-0.221094,2.2731
2,0.817039,-0.221094,1.095471
3,1.541326,-0.221094,-0.908089
4,-1.297587,,-0.908089
5,-0.1229,,-0.908089


下面列出敢fillna的参数
![](images/20190707204435.jpg)