# Chapter 7 缺失数据
参考DataWhale：https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch7.html#id2

In [1]:
import numpy as np
import pandas as pd

## 1. 缺失值的统计和删除
### 1.1 缺失信息的统计
isna/isnull查看和判断是否缺失(T/F)，然后结合统计函数计算。

In [2]:
df=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/learn_pandas.csv',
              usecols=['Grade','Name','Gender','Height','Weight','Transfer'])
df.isna().head(3)

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [3]:
#missing percentage
df.isna().mean()

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

In [4]:
#show missing details (row)
df[df.Height.isna()].head(3)

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
12,Senior,Peng You,Female,,48.0,
26,Junior,Yanli You,Female,,48.0,N


In [5]:
#shwo missing details (col)
#all
sub_set=df[['Height','Weight','Transfer']]
df[sub_set.isna().all(axis=1)]

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
102,Junior,Chengli Zhao,Male,,,


In [6]:
#any
df[sub_set.isna().any(axis=1)].head(3)

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
9,Junior,Juan Xu,Female,164.8,,N
12,Senior,Peng You,Female,,48.0,


In [7]:
#notna()
df[sub_set.notna().all(axis=1)].head(3)

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N


## 1.2 缺失信息的删除
dropna函数参数为：axis（默认为0）、how（删除方式any/all）、thresh（删除的非缺失值个数阈值，非缺失值没有达到这个数量的相应维度会被删除）、subset（备选的删除子集）。

In [8]:
res=df.dropna(how='any', subset=['Height','Weight'])
res.shape

(174, 6)

In [9]:
res=df.dropna(axis=1, thresh=df.shape[0]-15)
res.shape

(200, 5)

In [10]:
#loc
res=df.loc[df[['Height','Weight']].notna().all(axis=1)]
res.shape

(174, 6)

In [11]:
res=df.loc[:,~(df.isna().sum()>15)]
res.shape

(200, 5)

## 2. 缺失值的填充和插值
### 2.1 利用fillna进行填充
fillna函数参数为：value（填充值，标量/字典映射）、method（填充方法，ffill/bfill）、limit（连续缺失值的最大填充次数）

In [12]:
s=pd.Series([np.nan,1,np.nan,np.nan,2,np.nan],list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

In [13]:
#用前面的值填充na，且最多填充一次
s.fillna(method='ffill', limit=1)

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

In [14]:
#用mean填充
s.fillna(s.mean())

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

In [15]:
#dictionary
s.fillna({'a':100, 'd':200})

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

In [16]:
#groupby+fillna
df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

In [17]:
#练一练
s=pd.Series([1,np.nan,3,np.nan,np.nan])
s

0    1.0
1    NaN
2    3.0
3    NaN
4    NaN
dtype: float64

In [18]:
s1=s.fillna(method='ffill', limit=1)
s1

0    1.0
1    1.0
2    3.0
3    3.0
4    NaN
dtype: float64

In [19]:
s2=s.fillna(method='bfill', limit=1)
s2

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

In [20]:
(s1+s2)/2

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

### 2.2 插值函数
interpolate函数参数为：（插值方法，默认linear）、limit_direction（控制方向forward/backward/both，默认forward）、limit（最大连续缺失值插值个数）

In [21]:
s=pd.Series([np.nan,np.nan,1,np.nan, np.nan, np.nan,2,np.nan,np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

In [22]:
res=s.interpolate(limit_direction='backward', limit=1)
res.values

array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])

In [23]:
res=s.interpolate(limit_direction='both', limit=1)
res.values

array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

最近邻插值，缺失值的元素和离它最近的非缺失元素一样。

In [24]:
s.interpolate('nearest').values

array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

索引插值，根据索引大小进行限行插值。

In [25]:
s=pd.Series([0,np.nan,10], index=[0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

In [26]:
#定义位置为1时的数值
s.interpolate(method='index')

0      0.0
1      1.0
10    10.0
dtype: float64

In [27]:
#时间戳索引插值
s=pd.Series([0,np.nan,10], index=pd.to_datetime(['20200101','20200102','20200111']))
s

2020-01-01     0.0
2020-01-02     NaN
2020-01-11    10.0
dtype: float64

In [28]:
s.interpolate(method='index')

2020-01-01     0.0
2020-01-02     1.0
2020-01-11    10.0
dtype: float64

## 3. Nullable类型
### 3.1 缺失记号及其缺陷
Python中缺失值用None表示，该元素值等于自己本身！！

In [29]:
print(None==None)
print(None==[])

True
False


numpy中用np.nan表示缺失值，该元素和所以元素都不相等（包括本身）！！

In [30]:
print(np.nan==np.nan)
print(np.nan==None)

False
False


另外，在对缺失值进行比较操作时，np.nan对应位置会返回False。但是equals函数检验相同性时，会跳过都是缺失值的位置，并返回True。

In [31]:
s1=pd.Series([1,np.nan])
s2=pd.Series([1,2])
s3=pd.Series([1,np.nan])
s1==1

0     True
1    False
dtype: bool

In [32]:
s1.equals(s3)

True

在时间序列的对象中，pandas利用pd.NaT来指代缺失值，作用与np.nan一样。

In [33]:
pd.to_timedelta(['30s',np.nan])

TimedeltaIndex(['0 days 00:00:30', NaT], dtype='timedelta64[ns]', freq=None)

In [34]:
pd.to_datetime(['20200101',np.nan])

DatetimeIndex(['2020-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

In [35]:
type(np.nan)

float

### 3.2 Nullable类型的性质
三种Nullable序列类型：Int、boolean、string。

In [36]:
pd.Series([np.nan,1], dtype='Int64')

0    <NA>
1       1
dtype: Int64

In [37]:
pd.Series([np.nan,True], dtype='boolean')

0    <NA>
1    True
dtype: boolean

In [38]:
pd.Series([np.nan, 'my_str'], dtype='string')

0      <NA>
1    my_str
dtype: string

对于boolean类型的序列，与bool序列有两点区别：
1. 带有缺失的bool列表无法进行索引器选择，boolean会把缺失值看作false。

In [39]:
s=pd.Series(['a','b'])
s_bool=pd.Series([True,np.nan])
s_boolean=pd.Series([True,np.nan]).astype('boolean')
s[s_boolean]

0    a
dtype: object

2. 在进行逻辑运算时，bool类型在缺失处返回False，boolean会根据逻辑运算是否能确定唯一结果来返回相应的值。     
True｜pd.NA 会返回True；     
False｜pd.NA 会返回pd.NA；    
False & pd.NA 会返回False。

In [40]:
s_boolean & True

0    True
1    <NA>
dtype: boolean

In [41]:
s_boolean | True

0    True
1    True
dtype: boolean

In [42]:
~s_boolean

0    False
1     <NA>
dtype: boolean

In [43]:
#在实际中，读取数据集后，先通过convert_dtypes转成Nullable类型
df=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/learn_pandas.csv')
df=df.convert_dtypes()
df.dtypes

School          string
Grade           string
Name            string
Gender          string
Height         float64
Weight           Int64
Transfer        string
Test_Number      Int64
Test_Date       string
Time_Record     string
dtype: object

### 3.3 缺失数据的计算和分组
1. 调用sum、prob时，缺失数据分别为0和1.

In [44]:
s=pd.Series([2,3,np.nan,4,5])
s.sum()

14.0

In [45]:
s.prod()

120.0

2. 调用cumsum时，缺失数据位置会被跳过。

In [46]:
s.cumsum()

0     2.0
1     5.0
2     NaN
3     9.0
4    14.0
dtype: float64

3. np.nan / pd.NA：单个标量运算时，np.nan ** 0 和 1 ** np.nan可以算出意外其他运算结果都是缺失；     
np.nan：比较操作返回False;       
pd.NA：比较操作返回pd.NA。

In [47]:
print(np.nan==0)
print(pd.NA==0)
print(np.nan>0)
print(pd.NA>0)
print(np.nan+1)
print(np.log(np.nan))
print(np.nan**0)
print(1**np.nan)
print(pd.NA**0)
print(1**pd.NA)

False
<NA>
False
<NA>
nan
nan
1.0
1.0
1
1


4. diff：凡是参与缺失计算的部分全部设为了缺失值；     
pct_change：缺失值位置会被设为0%的变化率。

In [48]:
s.diff()

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
dtype: float64

In [49]:
s.pct_change()

0         NaN
1    0.500000
2    0.000000
3    0.333333
4    0.250000
dtype: float64

5. 在groupby、get_dummies函数中可以设置参数将缺失作为一个类别处理，即增加缺失类别。

In [50]:
df_nan=pd.DataFrame({'category':['a','a','b',np.nan,np.nan], 'value':[1,3,5,7,9]})
df_nan

Unnamed: 0,category,value
0,a,1
1,a,3
2,b,5
3,,7
4,,9


In [51]:
df_nan.groupby('category', dropna=False)['value'].mean()

category
a      2
b      5
NaN    8
Name: value, dtype: int64

In [52]:
pd.get_dummies(df_nan.category, dummy_na=True)

Unnamed: 0,a,b,NaN
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,0,1


## 4. 练习
### Ex1: 缺失值与类别的相关性检验

In [53]:
df1=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/missing_chi.csv')
df1.head()

Unnamed: 0,X_1,X_2,y
0,,,0
1,,,0
2,,,0
3,43.0,,0
4,,,0


In [54]:
df1.isna().mean()

X_1    0.855
X_2    0.894
y      0.000
dtype: float64

In [55]:
df1.y.value_counts(normalize=True)

0    0.918
1    0.082
Name: y, dtype: float64

### Ex2: 用回归模型解决分类问题

In [56]:
df2=pd.read_excel('/Users/jie/Documents/Python/joyful-pandas-master/data/color.xlsx')
df2.head(3)

Unnamed: 0,X1,X2,Color
0,-2.5,2.8,Blue
1,-1.5,1.8,Blue
2,-0.8,2.8,Blue


In [57]:
#1.
#参考着答案写的
from sklearn.neighbors import KNeighborsRegressor
df2_dummies=pd.get_dummies(df2.Color)
df2_dummies.head(3)

Unnamed: 0,Blue,Green,Yellow
0,1,0,0
1,1,0,0
2,1,0,0


In [58]:
#先以dummy值为y求回归函数，然后带入预测点求最近6个点的在每个color中的平均值，由于predict只会得到最后一个值，所以要append所有三个维度在一起
stack_list=[]
for col in df2_dummies.columns:
    reg=KNeighborsRegressor(n_neighbors=6)
    reg.fit(df2.iloc[:,:2], df2_dummies[col])
    reg_result=reg.predict([[0.8,-0.2]]).reshape(-1,1)
    stack_list.append(reg_result)

stack_list

[array([[0.16666667]]), array([[0.33333333]]), array([[0.5]])]

In [59]:
#找到最大值也就是最高的概率，并转换成color名称
code_res=pd.Series(np.hstack(stack_list).argmax(1))
code_res

0    2
dtype: int64

In [60]:
df2_dummies.columns[code_res[0]]

'Yellow'

In [61]:
#2.
#也参考了答案
df2_2=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/audit.csv')
df2_2.head(3)

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,Private,Unmarried,81838.0,Female,72
1,1010229,35,Private,Absent,72099.0,Male,30
2,1024587,32,Private,Divorced,154676.74,Male,40


In [62]:
#已知employment是DV，需要把IV中的marital和gender也变成dummy形式，且IV中的其他需要标准化。按照答案不如最好重组一个数据集出来。
df2_2_concat=pd.concat([pd.get_dummies(df2_2[['Marital', 'Gender']]),
                        df2_2[['Age','Income','Hours']].apply(lambda x:(x-x.min())/(x.max()-x.min())),
                        df2_2.Employment],1)
df2_2_concat.head(3)

Unnamed: 0,Marital_Absent,Marital_Divorced,Marital_Married,Marital_Married-spouse-absent,Marital_Unmarried,Marital_Widowed,Gender_Female,Gender_Male,Age,Income,Hours,Employment
0,0,0,0,0,1,0,1,0,0.287671,0.168997,0.72449,Private
1,1,0,0,0,0,0,0,1,0.246575,0.148735,0.295918,Private
2,0,1,0,0,0,0,0,1,0.205479,0.320539,0.397959,Private


In [63]:
#然后split to train and test
X_train=df2_2_concat.loc[df2_2_concat.Employment.notna()]
X_test=df2_2_concat.loc[df2_2_concat.Employment.isna()]
print(X_train.shape)
print(X_test.shape)

(1900, 12)
(100, 12)


In [64]:
#把employment变成dummy variable
from sklearn.neighbors import KNeighborsRegressor
df2_2_dummies=pd.get_dummies(X_train.Employment)
df2_2_dummies.head(3)

Unnamed: 0,Consultant,PSFederal,PSLocal,PSState,Private,SelfEmp,Unemployed,Volunteer
0,0,0,0,0,1,0,0,0
1,0,0,0,0,1,0,0,0
2,0,0,0,0,1,0,0,0


In [65]:
stack_list1=[]
for col in df2_2_dummies.columns:
    reg1=KNeighborsRegressor(n_neighbors=6)
    reg1.fit(X_train.iloc[:,:-1], df2_2_dummies[col])
    reg1_result=reg1.predict(X_test.iloc[:,:-1]).reshape(-1,1)
    stack_list1.append(reg1_result)

In [66]:
code_res=pd.Series(np.hstack(stack_list1).argmax(1))
code_res

0     2
1     0
2     4
3     4
4     4
     ..
95    4
96    4
97    4
98    4
99    4
Length: 100, dtype: int64

In [67]:
#按照位置替换名称
cat_res=code_res.replace(dict(zip(list(range(df2_2_dummies.shape[0])), df2_2_dummies.columns)))
cat_res

0        PSLocal
1     Consultant
2        Private
3        Private
4        Private
         ...    
95       Private
96       Private
97       Private
98       Private
99       Private
Length: 100, dtype: object

In [68]:
#最后合成一个数据集
df2_2.loc[df2_2.Employment.isna(), 'Employment']=cat_res.values
df2_2

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,Private,Unmarried,81838.00,Female,72
1,1010229,35,Private,Absent,72099.00,Male,30
2,1024587,32,Private,Divorced,154676.74,Male,40
3,1038288,45,Private,Married,27743.82,Male,55
4,1044221,60,Private,Married,7568.23,Male,40
...,...,...,...,...,...,...,...
1995,9957280,62,Private,Married,24080.59,Male,40
1996,9964393,35,Consultant,Married,57497.30,Male,40
1997,9972967,32,Private,Married,30538.18,Male,44
1998,9991103,34,Private,Unmarried,113425.67,Male,45
