### Deal with missing values

### Six important ways for imputing missing values
- you can impute missing values using machine learning models. this process is known as data imputation and is commonly used in data preprocessing to handle missing or incomplete data. There are several method and models you cn use, depending on the mature of your data and missing values.
- 1. `Simple imputation techniques:`
      + **Mean / Modeian imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.
      + **Model imputation:** Replace missing valus with the mode (most frquent value) of the column. Useful for categorical data.
- 2. `K-Nearest Neighbors (KNN):`
      + This algorithm can be used to impute missing values based on the similarity of rows.
- 3. `Regression imputation:` 
      + Use a regression model to predict the missing values based on other vaiables in your dataset.
- 4. `Decision Trees and Random Forests:` These cn handle missing values inherently. They can also be used to predict missing values based  on the patterns learned from the other data.
- 5. `Advancd Techniques:`
      + **Multiple Imputation by Chained Equation(MICE):** This is a more sophisticated  technique that models each variable with missing values as a function of other variables in a round-robin fashion.
      + **Deep learning methods:** Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.
- 6. `Time Series Specific Methods:`  for time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.



# 1. Simple imputation Techniques
### 1.1. Mean/Median Imputation

Mean/Median imputation replace missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations, It reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

In [95]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# lod the titanic dataset
data = sns.load_dataset('titanic')
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [96]:
#  check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We can see that the age column has 177 missing values. Let's replace these missing values with the mean of the column.

In [97]:
# drop the 'deck' column as it has too many missing values
data = data.drop('deck', axis=1)
# data = data.drop('deck', axis=1, inplace=True)


In [98]:
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [99]:
# impute missing values in the 'age' column with the mean
data['age'] = data['age'].fillna(data['age'].mean())

# check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

We can see that the missing values in the age column have been replaced with the mean of the column.

### 1.1.1. Median Imputation

In [100]:
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [101]:
# drop deck colunm
df = df.drop('deck', axis=1)

In [102]:
#  impute missing values in the 'age' column with the median
df['age'] = df['age'].fillna(df['age'].median())

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

### 1.2. Mode Imputation
Mode imputation replace the missing values with the mode(most frequet value) of the column. This is useful for imputing categorical columns, such as `embarked` and `embark_town` in the titanic.

In [103]:
# impute missing values with the mode
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

We can see that the missing values in the `embark_town` column and `embarked` column have been replaced with the mode of the column.

### 2. K-Nearest Neighbors(KNN)

KNN is a machine learning algorithm that can be used for imputing missing values. it works by finding the most similar data points to the one with the missing value based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

In [104]:
# load the titanic dataset
df1 = sns.load_dataset('titanic')
# check the number of missing values in each column
df1.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [105]:
# impute missing values with KNN imputer
from sklearn.impute import KNNImputer

# call the KNN class with number of neighbors=4
imputer = KNNImputer(n_neighbors=4)

# impute missing values with KNN imputer
df1['age'] = imputer.fit_transform(df1[['age']])

# check the number of missing values in each column
df1.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

### 3. Regression Imputation

Regreesion imputation uses a regression model to predict the missing values based on other variables in the dataset. it works well for both categorical and numerical data.

In [106]:
# load tha dataset

df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [107]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [108]:
# impute missing values with regression imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class max_iter=10
imputer = IterativeImputer(max_iter=10)

# impute missing values with regression imputation
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

### 4. Random Forests for Imputing Missing Values

Random forests can handle missing values inherntly. They can also be used to predict missing values based on the patterns learned from the other data.

In [109]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error   
from sklearn.impute import SimpleImputer

# 1. load the dataset
df2 = sns.load_dataset('titanic')

# 2. check the number of missing values in each column
df2.isnull().sum().sort_values(ascending=False)



deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We will remove the deck column from the dataset because it has too many missing values.

In [110]:
# remove the deck column
df2.drop('deck', axis=1, inplace=True)

In [111]:
#  check the number of missing values in each column
df2.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [112]:
df2.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [113]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder

# columns to encode
cols_to_encode = ['sex', 'embarked', 'embark_town','class','who', 'alive']

# Dictionary to store label encoders
label_encoders = {}

# loop to apply label encoding to each column
for column in cols_to_encode:
    # create a new label encoder for the column
    le = LabelEncoder()
    
    # fit and transform the data , then inverse transform to get the encoded values
    df2[column] = le.fit_transform(df2[column])
    
    # store the label encoder in the dictionary
    label_encoders[column] = le
    
# check the first few rows of the encoded dataframe
df2.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


We have to first impute the missing value in the age column before we can use it to predict the missing values in the  `embarked` and `emark_town` columns.

In [114]:
# split the dataset into two parts: one with missing values and one without
df_missing = df2[df2['age'].isna()]

# dropna removes all rows with any missing values
df_no_missing = df2.dropna()


In [115]:
print("The shape of the original dataset is:", df2.shape)
print("The shape of the dataset with missing values is:", df_missing.shape)
print("The shape of the dataset without missing values is:", df_no_missing.shape)


The shape of the original dataset is: (891, 14)
The shape of the dataset with missing values is: (177, 14)
The shape of the dataset without missing values is: (714, 14)


In [122]:
df_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


In [123]:
df_no_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [125]:
# check the columns names
print("Columns in the dataset:", df2.columns)

Columns in the dataset: Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')


In [128]:
# Regression imputation using Random Forest

# split the data into X and Y and we will only take the columns with no missing values
X = df_no_missing.drop(['age'], axis=1)
Y = df_no_missing['age']

# split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, Y_train)

# evaluate the model
Y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation:", np.sqrt(mean_squared_error(Y_test, Y_pred)))
print("R2 Score for Random Forest Imputation:", r2_score(Y_test, Y_pred))
print("MAE for Random Forest Imputation:", mean_absolute_error(Y_test, Y_pred))
print("MAPE for Random Forest Imputation:", mean_absolute_percentage_error(Y_test, Y_pred))

RMSE for Random Forest Imputation: 11.081260589808045
R2 Score for Random Forest Imputation: 0.33769388288226154
MAE for Random Forest Imputation: 8.666661815622195
MAPE for Random Forest Imputation: 0.40839466096086574


In [132]:
# predict the missing values
Y_pred = rf_model.predict(df_missing.drop(['age'], axis=1))
Y_pred

array([32.97658333, 35.64221825, 18.347     , 35.57148611, 20.65142857,
       26.7619855 , 36.648     , 18.63142857, 21.80633333, 33.55618169,
       31.06587652, 35.90741667, 18.63142857, 24.824     , 31.03      ,
       39.405     , 25.849     , 26.7619855 , 31.06587652, 19.41142857,
       31.06587652, 31.06587652, 26.7619855 , 26.27095821, 29.23514286,
       31.06587652, 48.25650595, 27.94      , 31.87071429, 31.99628481,
       30.015     , 20.85816667, 33.755     , 60.19168831, 26.00185714,
       26.24316667, 28.91733333, 49.31      , 28.55277778, 48.25650595,
       18.63142857, 20.85816667, 33.78929167, 26.7619855 , 26.63      ,
       32.01066667, 28.22883333, 28.55277778, 31.99628481, 29.72904762,
       48.25650595, 27.67733333, 56.26333333, 18.63142857, 34.65645944,
       60.44168831, 39.405     , 35.7725    , 18.63142857, 24.78266667,
       34.305     , 31.06587652, 31.602     , 20.85816667, 25.296     ,
       36.97133333, 26.7619855 , 24.85777778, 55.52      , 35.57

In [130]:
# remove warning messages
import warnings
warnings.filterwarnings("ignore")

# replace the missing values in the 'age' column with the predicted values

df_missing['age'] = Y_pred

# check the missing values
df_missing.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [131]:
# concatenate the two dataframes

df_complete = pd.concat([df_no_missing, df_missing], axis=0)

# print the shape of the complete dataset
print("The shape of the complete dataset is:", df_complete.shape)

# check the first few rows of the complete dataset
df_complete.head()

The shape of the complete dataset is: (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [133]:
for column in cols_to_encode:
    # inverse transform the encoded values to get the original values
    
    le = label_encoders[column]
    
    # inverse transform the encoded values
    df_complete[column] = le.inverse_transform(df_complete[column])
    
# check the first few rows of the complete dataset with original values
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [134]:
# print the shape of complete dataframe
print("The shape of the complete dataframe is:", df_complete.shape)

The shape of the complete dataframe is: (891, 14)


In [135]:
# check the number of missing values in each column
df_complete.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

In [136]:
# please save the complete dataframe to a CSV file
df_complete.to_csv('titanic_complete.csv', index=False)
print("Complete dataset saved to 'titanic_complete.csv'")

Complete dataset saved to 'titanic_complete.csv'


## 5. Advanced Techniques
### 5.1 Multiple imutation by Chained Equation(MICE)

Multiple imputation by chaned equations(MICE) is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion. it works well for both categorical and numerical data.
To demonstrate Multiple Imputation by Chained Equation(MICE) in pyhton, we can use the iterativeimputer class from the sklearn.impute module. MICE is a sophisticated method of imputation that models each feature with missing values as a function of other features, and it uses that estimate for imputation. it does this in a round-robin fashion: each feature is modeled in turn. The MICE algorithm is implemented in the iterativeimputer class.

In [116]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# load the titanic dataset
df2 = sns.load_dataset('titanic')
df2.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [117]:
# check the number of missing values in each column
df2.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [118]:
from sklearn.preprocessing import LabelEncoder

# create a LabelEcoder object using the LabelEncoder() in for loop categorical columns
# columns to encode
columns_to_encode = ['sex', 'embarked', 'embark_town','class','who','deck','alive']

# Dictionary to store the label encoders for each column
label_encoders = {}

# loop to apply label encoding to each column
for column in columns_to_encode:
    # create a new LabelEncoder object for the column
    le = LabelEncoder()
    # fit and transform the data
    df2[column] = le.fit_transform(df2[column])
    # store the label encoder in the dictionary
    label_encoders[column] = le
    
df2.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [119]:
# impute missing values using IterativeImputer
# call the IterativeImputer class with max_iter=10 
imputer = IterativeImputer(max_iter=6)

# impute missing values using iterative imputer in a for loop for age, embarked, deck,  and embark_town columns

# columns to impute
columns_to_impute = ['age', 'embarked', 'deck', 'embark_town']

# loop to impute each column
for column in columns_to_impute:
    
    df2[column] = imputer.fit_transform(df2[[column]])
    
# check the number of missing values in each column
df2.isnull().sum().sort_values(ascending=False)


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [120]:
df2.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,2,1,True,7.0,2.0,0,False
1,1,1,0,38.0,1,0,71.2833,0.0,0,2,False,2.0,0.0,1,False
2,1,3,0,26.0,0,0,7.925,2.0,2,2,False,7.0,2.0,1,True
3,1,1,0,35.0,1,0,53.1,2.0,0,2,False,2.0,2.0,1,False
4,0,3,1,35.0,0,0,8.05,2.0,2,1,True,7.0,2.0,0,True


In [121]:
# Inverse treansform the encoded columns 

for column in columns_to_encode:
    # Retrive the corresponding label encoder for the column
    le = label_encoders[column]
    # Inverse transform the data and convert to integer type
    df2[column] = le.inverse_transform(df2[column].astype(int))
    
df2.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
