### Data Imputation and Pre-processing example
We'll be implementing:
- Deleting the incomplete features
- Deleting the incomplete instances
- Perform imputation with pandas
- Perform interpolation imputation using pandas
- simple imputation using sklearn
- KNN-based imputation using sklearn.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
#load and explore the data
#use the titanic data

titanic_data = pd.read_csv("https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv", na_values=['?']) 

titanic_data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Missing Values
There many ways to represent missing values in both the dataset file and pandas
Missing values in the data might be blank entries or ?, or something else that data collectors agreed on to represent unobserved data. In this case it is '?'.

Pandas can also represent missing values like NaN as the default missing value marker, however we need to be able to easily detect this value with data or different types

In [4]:
#there are missing values, we can drop some features that we will not consider here.
titanic_data.drop(['name','ticket', 'embarked', 'boat' ,'body' ,'home.dest'], axis=1, inplace=True)

In [5]:
from sklearn.model_selection import train_test_split

y=titanic_data['survived']
X=titanic_data.drop(['survived'], axis=1)
X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
#Now if we perform classification it might not work for most classifiers
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
#classifier=SVC()
classifier.fit(X_titanic_train, y_titanic_train)

ValueError: could not convert string to float: 'male'

Some features contain string values, like "sex" and "cabin". We can encode these features.

In [7]:
# Encoding categorical features with preserving the missing values in incomplete features
from sklearn.preprocessing import OrdinalEncoder

X_titanic_train_encoded=X_titanic_train.copy()
encoder_gender = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=np.nan)
X_titanic_train_encoded['sex'] = encoder_gender.fit_transform(X_titanic_train_encoded['sex'].values.reshape(-1, 1))

#Now lets encode the incomplete Cabin feature
encoder_cabin = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=np.nan) #You can use the same encoder for both but we use two for the sake of clarfication
X_titanic_train_encoded['cabin'] = encoder_cabin.fit_transform(X_titanic_train_encoded['cabin'].values.reshape(-1, 1).astype(str))
# #get the code of the "nan" value for the cabin categorical feature
# cabin_nan_code=encoder_cabin.transform([['nan']])[0][0]
# print(cabin_nan_code)
# #Now, retrive the nan values to be missing in the encoded data
# X_titanic_train_encoded['cabin'].replace(cabin_nan_code,np.nan)

In [8]:
#now let's see the result for the encoded data.
X_titanic_train_encoded.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
1214,3,1.0,,0,0,8.6625,146.0
677,3,1.0,26.0,0,0,7.8958,146.0
534,2,0.0,19.0,0,0,26.0,146.0
1174,3,0.0,,8,2,69.55,146.0
864,3,0.0,28.0,0,0,7.775,146.0


Next, we need to handle missing values before performing classification. Let's show the number of missing values in each feature of the encoded training data.

In [9]:
print("The number of missing values ")
print(X_titanic_train_encoded.isnull().sum())

The number of missing values 
pclass      0
sex         0
age       187
sibsp       0
parch       0
fare        1
cabin       0
dtype: int64


Age, Fare, and Cabin are incomplete, so we can try deleting them.

In [10]:
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=1, inplace=True)
X_titanic_train_complete

Unnamed: 0,pclass,sex,sibsp,parch,cabin
1214,3,1.0,0,0,146.0
677,3,1.0,0,0,146.0
534,2,0.0,0,0,146.0
1174,3,0.0,8,2,146.0
864,3,0.0,0,0,146.0
...,...,...,...,...,...
1095,3,0.0,0,0,146.0
1130,3,0.0,0,0,146.0
1294,3,1.0,0,0,146.0
860,3,0.0,0,0,146.0


In [11]:
#Check the number of missing values
print(X_titanic_train_complete.isnull().sum())

pclass    0
sex       0
sibsp     0
parch     0
cabin     0
dtype: int64


In [12]:
#now delete the incomplete instances
X_titanic_train_complete=X_titanic_train_encoded.copy()
X_titanic_train_complete.dropna(axis=0, inplace=True)
#The difference is axis=0 instead of 1
X_titanic_train_complete

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
677,3,1.0,26.0,0,0,7.8958,146.0
534,2,0.0,19.0,0,0,26.0000,146.0
864,3,0.0,28.0,0,0,7.7750,146.0
895,3,0.0,1.0,1,1,11.1333,146.0
745,3,0.0,30.0,0,0,6.9500,146.0
...,...,...,...,...,...,...,...
466,2,1.0,34.0,1,0,26.0000,146.0
1130,3,0.0,18.0,0,0,7.7750,146.0
1294,3,1.0,28.5,0,0,16.1000,146.0
860,3,0.0,26.0,0,0,7.9250,146.0


Another important point for the instance deletion approach is that there is a need to remove the target values (from y_train) that correspond to the incomplete (deleted) data instances

In [13]:
#Check the number of missing values
print(X_titanic_train_complete.isnull().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


The deletion approach has several drawbacks. It reduces the availlable data, which limits the learning ability, especially when there are many missing values.

Let's try imputation with pandas.

In [14]:
#Mean for numeric values
X_titanic_data_complete=X_titanic_train_encoded.copy()
X_titanic_data_complete['age']=X_titanic_data_complete['age'].fillna(X_titanic_data_complete['age'].mean())
X_titanic_data_complete['fare']=X_titanic_data_complete['fare'].fillna(X_titanic_data_complete['fare'].mean())
X_titanic_data_complete['cabin']=X_titanic_data_complete['cabin'].fillna(X_titanic_data_complete['cabin'].mean())
# Show the number of missing values
print(X_titanic_data_complete.isnull().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [15]:
X_titanic_data_complete.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
1214,3,1.0,29.102309,0,0,8.6625,146.0
677,3,1.0,26.0,0,0,7.8958,146.0
534,2,0.0,19.0,0,0,26.0,146.0
1174,3,0.0,29.102309,8,2,69.55,146.0
864,3,0.0,28.0,0,0,7.775,146.0


Interpolation (Pandas)

In [16]:
X_titanic_data_complete = X_titanic_train_encoded.copy()
X_titanic_data_complete = X_titanic_data_complete.interpolate()
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete = pd.DataFrame(X_titanic_train_complete)
print(X_titanic_train_complete.isna().sum())

pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


We can also perform imputation using sklearn

In [17]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()

X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)

#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [18]:
X_titanic_train_encoded

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin
1214,3,1.0,,0,0,8.6625,146.0
677,3,1.0,26.0,0,0,7.8958,146.0
534,2,0.0,19.0,0,0,26.0000,146.0
1174,3,0.0,,8,2,69.5500,146.0
864,3,0.0,28.0,0,0,7.7750,146.0
...,...,...,...,...,...,...,...
1095,3,0.0,,0,0,7.6292,146.0
1130,3,0.0,18.0,0,0,7.7750,146.0
1294,3,1.0,28.5,0,0,16.1000,146.0
860,3,0.0,26.0,0,0,7.9250,146.0


 The default strategy for sklearn simple imputer is the "mean", you can change it using the strategy parameter

In [19]:
imputer = SimpleImputer(strategy="median")
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


We can also perform imputing with the KNNImputer

In [20]:
from sklearn.impute import KNNImputer
imputer = KNNImputer()
X_titanic_train_complete = imputer.fit_transform(X_titanic_train_encoded)
#The output is 'numpy.ndarray' so we convert it to dataframe for consistency
X_titanic_train_complete=pd.DataFrame(X_titanic_train_complete, columns=X_titanic_train_encoded.columns)
print("The number of missing values :\n", X_titanic_train_complete.isnull().sum())

The number of missing values :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [21]:
#Let's apply encoders on the test data
#The learnt encoder_sex should be used to encode the test data, NOTE there is NO fit here, just transform
X_titanic_test_encoded=X_titanic_test.copy()
X_titanic_test_encoded['sex'] = encoder_gender.transform(X_titanic_test_encoded['sex'].values.reshape(-1, 1))

#The learnt encoder2 should be used to encode the test data, NOTE there is NO fit here, just transform
X_titanic_test_encoded['cabin'] = encoder_cabin.transform(X_titanic_test_encoded['cabin'].values.reshape(-1, 1).astype(str))

In [22]:
#now, use the learned imputer to estimate the missing values in the test data.
print("The number of missing values in the test data before imputation :\n", X_titanic_test_encoded.isnull().sum())
X_titanic_test_complete = imputer.transform(X_titanic_test_encoded)
X_titanic_test_complete=pd.DataFrame(X_titanic_test_complete, columns=X_titanic_test_encoded.columns)
print("The number of missing values in the test data after imputation :\n", X_titanic_test_complete.isnull().sum())

The number of missing values in the test data before imputation :
 pclass     0
sex        0
age       76
sibsp      0
parch      0
fare       0
cabin     47
dtype: int64
The number of missing values in the test data after imputation :
 pclass    0
sex       0
age       0
sibsp     0
parch     0
fare      0
cabin     0
dtype: int64


In [23]:
#we can perform classification using the imputed complete data
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)
#classifier=SVC()
classifier.fit(X_titanic_train_complete, y_titanic_train)
print("F1 score after imputation = ", f1_score(classifier.predict(X_titanic_test_complete), y_titanic_test))

F1 score after imputation =  0.7261146496815286
