# Pandas之缺失值的处理

## 什么是缺失值？

直观上理解，缺失值表示的是“缺失的数据”

### 创建数据

In [1]:
import numpy as np
import pandas as pd

index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
data = {
    "age": [18, 30, np.nan, 40, np.nan, 30],
    "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen", np.nan, " "],
    "sex": [None, "male", "female", "male", np.nan, "unknown"],
    "birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"]
}
user_info = pd.DataFrame(data=data, index=index)
# 将出生日期转为时间格式
user_info["birth"] = pd.to_datetime(user_info.birth)
user_info

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Andy,,NaT,,
Alice,30.0,1988-10-17,,unknown


### 识别出缺失值或非缺失值

In [2]:
# 识别缺失值
# True：缺失值
# False：非缺失值
user_info.isnull()

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,False,False,False,True
Bob,False,False,False,False
Mary,True,True,False,False
James,False,False,False,False
Andy,True,True,True,True
Alice,False,False,False,False


In [3]:
# 识别非缺失值
# True：非缺失值
# False：缺失值
user_info.notnull()

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,True,True,True,False
Bob,True,True,True,True
Mary,False,False,True,True
James,True,True,True,True
Andy,False,False,False,False
Alice,True,True,True,True


### 过滤掉一些缺失的行

In [4]:
# 过滤掉age为缺失值的行
user_info[user_info.age.notnull()]

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
James,40.0,1978-08-08,ShenZhen,male
Alice,30.0,1988-10-17,,unknown


## 丢弃缺失值

.dropna()

In [5]:
user_info.age.dropna()

name
Tom      18.0
Bob      30.0
James    40.0
Alice    30.0
Name: age, dtype: float64

Seriese 使用 dropna 比较简单，对于 DataFrame 来说，可以设置更多的参数。

* axis：用于控制行或列

  * axis=0 （默认）：操作行

  * axis=1：操作列

* how：参数可选

  * any（默认）：一行/列有任意元素为空时即丢弃

  * all： 一行/列所有值都为空时才丢弃。

* subset：表示删除时只考虑的索引或列名。

* thresh：参数的类型为整数，比如 thresh=3，会在一行/列中至少有 3 个非空值时将其保留。

In [6]:
# 一行数据只要有一个字段存在空值即删除
user_info.dropna(axis=0, how="any")

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,30.0,1988-10-17,ShangHai,male
James,40.0,1978-08-08,ShenZhen,male
Alice,30.0,1988-10-17,,unknown


In [7]:
# 一行数据所有字段都为空值才删除
user_info.dropna(axis=0, how="all")

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Alice,30.0,1988-10-17,,unknown


In [8]:
# 一行数据中只要 city 或 sex 存在空值即删除
user_info.dropna(axis=0, how="any", subset=["city", "sex"])

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,30.0,1988-10-17,ShangHai,male
Mary,,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Alice,30.0,1988-10-17,,unknown


## 填充缺失值　　

.fillna()

### 使用一个标量来填充

In [9]:
# 将有缺失的年龄都填充为0
user_info.age.fillna(0, inplace=True)
user_info

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,0.0,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Andy,0.0,NaT,,
Alice,30.0,1988-10-17,,unknown


In [10]:
# 将有缺失的城市都填充为china
user_info.city.fillna("china", inplace=True)   #inplace=False不会真的替换
# 注意：空格不会被代替
user_info

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,0.0,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Andy,0.0,NaT,china,
Alice,30.0,1988-10-17,,unknown


### 使用前一个或后一个有效值来填充

设置参数 method='pad' 或 method='ffill' 可以使用前一个有效值来填充。

In [22]:
index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
data = {
    "age": [18, 30, np.nan, 40, np.nan, 30],
    "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen", np.nan, " "],
    "sex": [None, "male", "female", "male", np.nan, "unknown"],
    "birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"]
}
user_info = pd.DataFrame(data=data, index=index)
# 将出生日期转为时间格式
user_info["birth"] = pd.to_datetime(user_info.birth)
user_info

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Andy,,NaT,,
Alice,30.0,1988-10-17,,unknown


In [23]:
user_info.age.fillna(method="ffill")

name
Tom      18.0
Bob      30.0
Mary     30.0
James    40.0
Andy     40.0
Alice    30.0
Name: age, dtype: float64

设置参数 method='bfill' 或 method='backfill' 可以使用后一个有效值来填充。

In [24]:
user_info.age.fillna(method="backfill")

name
Tom      18.0
Bob      30.0
Mary     40.0
James    40.0
Andy     30.0
Alice    30.0
Name: age, dtype: float64

### 使用线性差值来填充

.interpolate()

In [26]:
user_info.age.interpolate(inplace=True)
user_info

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,35.0,NaT,GuangZhou,female
James,40.0,1978-08-08,ShenZhen,male
Andy,35.0,NaT,,
Alice,30.0,1988-10-17,,unknown


## 替换缺失值

　　大家有没有想过一个问题：到底什么才是缺失值呢？你可能会奇怪说，前面不是已经说过了么， None 、 np.nan 、 NaT 这些都是缺失值。但是我也说过了，这些在 Pandas 的眼中是缺失值，有时候在我们人类的眼中，某些异常值我们也会当做缺失值来处理。
  
　　例如，在我们的存储的用户信息中，假定我们限定用户都是青年，出现了年龄为 40 的，我们就可以认为这是一个异常值。再比如，我们都知道性别分为男性（male）和女性（female），在记录用户性别的时候，对于未知的用户性别都记为了 “unknown”,很明显，我们也可以认为“unknown”是缺失值。此外，有的时候会出现空白字符串，这些也可以认为是缺失值。

### replace 方法：

In [27]:
# 将age = 40 的替换为NAN
user_info.age.replace(40, np.nan)

name
Tom      18.0
Bob      30.0
Mary     35.0
James     NaN
Andy     35.0
Alice    30.0
Name: age, dtype: float64

对于 DataFrame，可以指定每列要替换的值：

In [28]:
user_info.replace({"age": 40, "birth": pd.Timestamp("1978-08-08")},np.nan)

Unnamed: 0_level_0,age,birth,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18.0,2000-02-10,BeiJing,
Bob,30.0,1988-10-17,ShangHai,male
Mary,35.0,NaT,GuangZhou,female
James,,NaT,ShenZhen,male
Andy,35.0,NaT,,
Alice,30.0,1988-10-17,,unknown


### 将特定字符串进行替换

In [29]:
user_info.sex.replace("unknown", np.nan)

name
Tom        None
Bob        male
Mary     female
James      male
Andy        NaN
Alice       NaN
Name: sex, dtype: object

### 使用正则表达式来替换

In [30]:
#将空白字符串替换成空值
user_info.city.replace(r'\s+', np.nan, regex=True) 

name
Tom        BeiJing
Bob       ShangHai
Mary     GuangZhou
James     ShenZhen
Andy           NaN
Alice          NaN
Name: city, dtype: object

## 使用其他对象填充

将没有缺失值的 Series 中的元素传给有缺失值的

In [31]:
age_new = user_info.age.copy()
age_new.fillna(20, inplace=True)  
user_info.age.combine_first(age_new)

name
Tom      18.0
Bob      30.0
Mary     35.0
James    40.0
Andy     35.0
Alice    30.0
Name: age, dtype: float64