用分组后的统计值填充对应的缺失值

In [2]:
import pandas as pd
import numpy as np

In [21]:
A = pd.DataFrame({"Class":[1,2,2,3,1,3,1,3], "Math":[80, 60, np.nan, 76, np.nan, np.nan, 90, 70], "Chinese":[70, np.nan, 45, 76,  90, 70, np.nan, np.nan,]})
A

Unnamed: 0,Class,Math,Chinese
0,1,80.0,70.0
1,2,60.0,
2,2,,45.0
3,3,76.0,76.0
4,1,,90.0
5,3,,70.0
6,1,90.0,
7,3,70.0,


#### 默认的fillna为填入某固定数

In [22]:
A.fillna(A['Math'].mean())

Unnamed: 0,Class,Math,Chinese
0,1,80.0,70.0
1,2,60.0,75.2
2,2,75.2,45.0
3,3,76.0,76.0
4,1,75.2,90.0
5,3,75.2,70.0
6,1,90.0,75.2
7,3,70.0,75.2


#### 还可以用一个同维的Series或者DataFrame进行填充

Series对Series的fillna需要满足同维

In [23]:
C = pd.Series(np.linspace(80, 90, num=A.shape[0]))
C

0    80.000000
1    81.428571
2    82.857143
3    84.285714
4    85.714286
5    87.142857
6    88.571429
7    90.000000
dtype: float64

In [25]:
# Series对Series进行fillna
A['Math'].fillna(C)

0    80.000000
1    60.000000
2    82.857143
3    76.000000
4    85.714286
5    87.142857
6    90.000000
7    70.000000
Name: Math, dtype: float64

DataFrame对DataFrame的fillna，除了满足index方向的同维，源和目标的column名字也应当一致

In [32]:
D1 = pd.DataFrame({"Math": np.linspace(80, 90, num=A.shape[0])})
D1

Unnamed: 0,Math
0,80.0
1,81.428571
2,82.857143
3,84.285714
4,85.714286
5,87.142857
6,88.571429
7,90.0


In [33]:
A.fillna(D1)

Unnamed: 0,Class,Math,Chinese
0,1,80.0,70.0
1,2,60.0,
2,2,82.857143,45.0
3,3,76.0,76.0
4,1,85.714286,90.0
5,3,87.142857,70.0
6,1,90.0,
7,3,70.0,


In [34]:
D2 = pd.DataFrame({"Math": np.linspace(80, 90, num=A.shape[0]), "Chinese": np.linspace(40, 50, num=A.shape[0])})
D2

Unnamed: 0,Math,Chinese
0,80.0,40.0
1,81.428571,41.428571
2,82.857143,42.857143
3,84.285714,44.285714
4,85.714286,45.714286
5,87.142857,47.142857
6,88.571429,48.571429
7,90.0,50.0


In [35]:
A.fillna(D2)

Unnamed: 0,Class,Math,Chinese
0,1,80.0,70.0
1,2,60.0,41.428571
2,2,82.857143,45.0
3,3,76.0,76.0
4,1,85.714286,90.0
5,3,87.142857,70.0
6,1,90.0,48.571429
7,3,70.0,50.0


#### groupby之后的aggregate聚合操作和transform转换操作

groupby之后，直接采用聚合函数，会得到一个减维的dataframe

In [36]:
A.groupby('Class').mean()

Unnamed: 0_level_0,Math,Chinese
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
1,85.0,80.0
2,60.0,45.0
3,73.0,73.0


而采用transform会得到一个和输入参数同维的dataframe，各[index, column]下的数根据groupby之后的transform函数得到

In [38]:
A.groupby('Class').transform(np.mean)

Unnamed: 0,Math,Chinese
0,85.0,80.0
1,60.0,45.0
2,60.0,45.0
3,73.0,73.0
4,85.0,80.0
5,73.0,73.0
6,85.0,80.0
7,73.0,73.0


#### 利用transform结果进行缺失值的填空

In [39]:
A.fillna(A.groupby('Class').transform(np.mean))

Unnamed: 0,Class,Math,Chinese
0,1,80.0,70.0
1,2,60.0,45.0
2,2,60.0,45.0
3,3,76.0,76.0
4,1,85.0,90.0
5,3,73.0,70.0
6,1,90.0,80.0
7,3,70.0,73.0


In [None]:
df.groupby(['A']).transform(lambda x : x.fillna(x.mean))

In [41]:
A['Math'].fillna(A.groupby('Class')['Math'].transform(np.mean))

0    80.0
1    60.0
2    60.0
3    76.0
4    85.0
5    73.0
6    90.0
7    70.0
Name: Math, dtype: float64