## 練習處理遺漏值
### 資料來源 : https://www.kaggle.com/competitions/titanic/data
#### 練習參考來源 : https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/#What_Is_a_Missing_Value?

### 1. 載入套件

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

### 2.讀入資料與開始練習

In [2]:
row_TrainData = pd.read_csv(r'E:\DataLearn\4-Titanic\data\train.csv')
row_TestData = pd.read_csv(r'E:\DataLearn\4-Titanic\data\test.csv')

In [3]:
row_TrainData.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [6]:
missing_TrainData = row_TrainData.isnull().sum()
missing_TrainData 

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### 3. 填補遺漏值
因為欄位 Age 為數值型資料，所以我們以下使用不同方式來填補年紀

1.計算 age 平均數mean() ，四捨五入取整數

In [11]:
mean_age_value = row_TrainData['Age'].mean().round().astype(int)
print('age欄位平均數 : ', mean_age_value)

age欄位平均數 :  30


2.計算中位數 median()

In [13]:
median_age = row_TrainData['Age'].median()
print('age欄位中位數 : ', median_age)

age欄位中位數 :  28.0


3.計算眾數 mode()

In [17]:
mode_age = row_TrainData['Age'].mode()
print('age欄位最多的眾數 : ', mode_age[0])

age欄位最多的眾數 :  24.0


4.計算標準差 std()

In [18]:
std_age = row_TrainData['Age'].std()
print('age欄位標準差: ', std_age)

age欄位標準差:  14.526497332334044


舉例使用中位數填補遺漏值

In [24]:
# 先備份一個要用來展示處理 age 遺漏值的數據集
fillna_median_tain = row_TrainData.copy()
fillna_median_tain.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [25]:
# 填補遺漏值
# 填補遺漏值
fillna_median_tain['Age'].fillna(median_age, inplace = True)
fillna_median_tain['Age'].isnull().sum()
# 填補遺漏值
fillna_median_tain['Age'].fillna(median_age, inplace = True)
fillna_median_tain['Age'].isnull().sum()

0

### 以下舉例其他方式來填補遺漏值

#### method = fill 使用前一個數值補充遺漏值

In [28]:
data = {'A': [1, 2, None, 4, None, 6, 7],
        'B': [None, 22, 23, None, 25, 26, None]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1.0,
1,2.0,22.0
2,,23.0
3,4.0,
4,,25.0
5,6.0,26.0
6,7.0,


In [29]:
df_filled = df.fillna(method='ffill')
print(df_filled)

     A     B
0  1.0   NaN
1  2.0  22.0
2  2.0  23.0
3  4.0  23.0
4  4.0  25.0
5  6.0  26.0
6  7.0  26.0


#### 使用 interpolate 默認的 Linear 來補空值

In [30]:
df_fillnull_Linear = df.interpolate()
df_fillnull_Linear

Unnamed: 0,A,B
0,1.0,
1,2.0,22.0
2,3.0,23.0
3,4.0,24.0
4,5.0,25.0
5,6.0,26.0
6,7.0,26.0


#### 處理類別型資料方式

#### 使用 sklearn 的 SimpleInputer
使用資料中最常出現的類別來填補類別型欄位資料

In [31]:
X = pd.DataFrame({'shape':['Apple', 'banana', 'banna', 'orange', np.nan]})
X

Unnamed: 0,shape
0,Apple
1,banana
2,banna
3,orange
4,


In [32]:
from sklearn.impute import SimpleImputer

In [33]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)

array([['Apple'],
       ['banana'],
       ['banna'],
       ['orange'],
       ['Apple']], dtype=object)

#### 填補數值使用 Sklearn 的 KNNImputer 或是 IterativeImputer

In [42]:
IterativeImputerData = row_TrainData.copy()
cols = ['SibSp', 'Fare', 'Age']
X1data = KnnImputerData[cols]
X1data.head()

Unnamed: 0,SibSp,Fare,Age
0,1,7.25,22.0
1,1,71.2833,38.0
2,0,7.925,26.0
3,1,53.1,35.0
4,0,8.05,35.0


#### 使用 IterativeImputer

In [43]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [44]:
impute_it = IterativeImputer()
impute_it.fit_transform(X1data)

array([[ 1.        ,  7.25      , 22.        ],
       [ 1.        , 71.2833    , 38.        ],
       [ 0.        ,  7.925     , 26.        ],
       ...,
       [ 1.        , 23.45      , 26.82938751],
       [ 0.        , 30.        , 26.        ],
       [ 0.        ,  7.75      , 32.        ]])

#### KNNImputer

In [45]:
KnnImputerData = row_TrainData.copy()
cols = ['SibSp', 'Fare', 'Age']
X2data = KnnImputerData[cols]
X2data.head()

Unnamed: 0,SibSp,Fare,Age
0,1,7.25,22.0
1,1,71.2833,38.0
2,0,7.925,26.0
3,1,53.1,35.0
4,0,8.05,35.0


In [46]:
from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=2)
imputer_knn.fit_transform(X2data)

array([[ 1.    ,  7.25  , 22.    ],
       [ 1.    , 71.2833, 38.    ],
       [ 0.    ,  7.925 , 26.    ],
       ...,
       [ 1.    , 23.45  , 29.    ],
       [ 0.    , 30.    , 26.    ],
       [ 0.    ,  7.75  , 32.    ]])

### 使用 Missingness 遺漏值做為特徵

In [47]:
age = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
age

Unnamed: 0,Age
0,20.0
1,30.0
2,10.0
3,
4,10.0


In [49]:
from sklearn.impute import SimpleImputer
# impute the mean
imputer = SimpleImputer()
imputer.fit_transform(age)

array([[20. ],
       [30. ],
       [10. ],
       [17.5],
       [10. ]])

In [51]:
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(age)

array([[20. ,  0. ],
       [30. ,  0. ],
       [10. ,  0. ],
       [17.5,  1. ],
       [10. ,  0. ]])