# Imputing Missing values in Python

## What are missing values?

- `missing values` are the values that are not present in the dataset.
- They are represented by `NaN`, `NA`, `null` or `None`
- Missing values can be caused by various reasons such as `data corruption`, `data entry errors`, or `missing data`.
- Missing values can be handled by:
  - `removing` the rows or columns with missing values,
  - `imputing` the missing values, or
  - `using` algorithms that can handle missing values

In this notebook, I will show you how to handle missing values using the `pandas` library.

But before you start, you need to know how to detect missing values in your dataset and this blog post will help you with that:




## Six Important Ways for Imputing Missing Values

You can impute missing values using machine learning models. This process is known as **data imputation** and is commonly used in data preprocessing to handle missing or incomplete data. There are several methods you can use depending on the nature of your data and the missing values:

---

### 1. Simple Imputation Techniques

- **Mean / Median Imputation**  
  Replace missing values with the **mean** or **median** of the column.  
  Suitable for **numerical** data.

- **Mode Imputation**  
  Replace missing values with the **mode** (most frequent value) of the column.  
  Useful for **categorical** data.

- **K-Nearest Neighbors (KNN) Imputation**  
  Use KNN to impute missing values based on the **similarity of rows**.

---

### 2. Regression Imputation

- Use a **regression model** to predict the missing values based on other variables in your dataset.

---

### 3. Decision Trees and Random Forests

- These models can **handle missing values inherently**.  
- They can also be used to **predict missing values** based on the patterns learned from the other data.

---

### 4. Multiple Imputation by Chained Equations (MICE)

- A more sophisticated technique that models each variable with missing values as a function of other variables in a **round-robin fashion**.

---

### 5. Deep Learning Methods

- **Neural networks**, especially **autoencoders**, can be effective in imputing missing values in **complex datasets**.

---

### 6. Time Series Specific Methods

For **time-series data**, you might use:

- `Interpolation`  
- `Forward-fill`  
- `Backward-fill`

---

## ⚠️ Choosing the Right Method

When selecting an imputation method, consider:

- The **type of data** (numerical, categorical, time-series)
- The **pattern of missingness** (e.g., MCAR, MAR, NMAR)
- The **amount of missing data**

> ⚠️ **Note:** Imputation can introduce **bias** or affect the **distribution** of your data.  
> Always apply it with **caution** and proper understanding of its impact.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## **1. Handling missing values by fillna or droping columns**

### **1.1. Mean/Median Imputation**
 
 Mean/median imputation replaces missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

Let's see how to implement mean/median imputation in Python using the Titanic dataset.

In [20]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [21]:
df.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [22]:
df.drop(columns=["deck"], inplace=True)

In [23]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
embark_town      2
alive            0
alone            0
dtype: int64

In [24]:
df['age'] = df['age'].fillna(df['age'].mean())

In [27]:
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df['embarked'   ] = df['embarked'].fillna(df['embarked'].mode()[0])

In [28]:
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

## **2. Handling missing values by KNN Imputer**

KNN is a machine learning algorithm that can be used for imputing missing values. It works by finding the most similar data points to the one with the missing value based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

### **Single Column**

In [29]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [30]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=4)
df['age'] = imputer.fit_transform(df[['age']])
df.isnull().sum()


survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

### **Multiple Column**

In [31]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [32]:
# Data encode by label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
col_to_encode = ['sex', 'embarked', 'embark_town', 'class', 'who','deck', 'alive']
label_enc = {}
for col in col_to_encode:
    df[col] = le.fit_transform(df[col])
    label_enc[col] = le
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [34]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10)
cols_to_impute = ['age', 'deck', 'embark_town', 'embarked']
for col in cols_to_impute:
    df[col] = imputer.fit_transform(df[[col]])
df.isnull().sum()


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

## **3. Random Forests for Imputing Missing Values**
Random forests can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

Let's see how to implement random forests in Python using the Titanic dataset.

In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# 1. load the dataset
df = sns.load_dataset('titanic')

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

In [42]:
# remove deck column
df.drop('deck', axis=1, inplace=True)

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
parch            0
sibsp            0
class            0
fare             0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [43]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()

    # Fit and transform the data, then inverse transform it
    df[col] = le.fit_transform(df[col])

    # Store the encoder in the dictionary
    label_encoders[col] = le

# Check the first few rows of the DataFrame
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [44]:
df_with_missing = df[df['age'].isna()]
# dropna removes all rows with missing values
df_without_missing = df.dropna()

print("The shape of the original dataset is: ", df.shape)
print("The shape of the dataset with missing values removed is: ", df_without_missing.shape)
print("The shape of the dataset with missing values is: ", df_with_missing.shape)

The shape of the original dataset is:  (891, 14)
The shape of the dataset with missing values removed is:  (714, 14)
The shape of the dataset with missing values is:  (177, 14)


In [45]:
X = df_without_missing.drop(['age'], axis=1)
y = df_without_missing['age']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# evaluate the model
y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.081260589808045
R2 Score for Random Forest Imputation:  0.33769388288226154
MAE for Random Forest Imputation:  8.666661815622195
MAPE for Random Forest Imputation:  0.40839466096086574


In [46]:
# Predict missing values
y_pred = rf_model.predict(df_with_missing.drop(['age'], axis=1))

In [47]:
import warnings
warnings.filterwarnings('ignore')

# replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [49]:
df_complete = pd.concat([df_with_missing, df_without_missing], axis=0)
# print the shape of the complete dataframe
print("The shape of the complete dataframe is: ", df_complete.shape)

#check the first 5 rows of the complete dataframe
df_complete

The shape of the complete dataframe is:  (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,32.976583,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,35.642218,0,0,13.0000,2,1,1,True,2,1,True
19,1,3,0,18.347000,0,0,7.2250,0,2,2,False,0,1,True
26,0,3,1,35.571486,0,0,7.2250,0,2,1,True,0,0,True
28,1,3,0,20.651429,0,0,7.8792,1,2,2,False,1,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,0,39.000000,0,5,29.1250,1,2,2,False,1,0,False
886,0,2,1,27.000000,0,0,13.0000,2,1,1,True,2,0,True
887,1,1,0,19.000000,0,0,30.0000,2,0,2,False,2,1,True
889,1,1,1,26.000000,0,0,30.0000,0,0,1,True,0,1,True


In [50]:
for col in columns_to_encode:
    # Retrieve the corresponding LabelEncoder for the column
    le = label_encoders[col]

    # Inverse transform the data
    df_complete[col] = le.inverse_transform(df[col])
    
# check the first 5 rows of the complete dataframe
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,male,32.976583,0,0,8.4583,S,Third,man,True,Southampton,no,True
17,1,2,female,35.642218,0,0,13.0,C,First,woman,True,Cherbourg,yes,True
19,1,3,female,18.347,0,0,7.225,S,Third,woman,False,Southampton,yes,True
26,0,3,female,35.571486,0,0,7.225,S,First,woman,True,Southampton,yes,True
28,1,3,male,20.651429,0,0,7.8792,S,Third,man,False,Southampton,no,True


In [52]:
df_complete.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64

In [58]:
df_complete.isnull().sum()

np.int64(0)

In [61]:
# please save the data into csv
df_complete.to_csv('titanic_complete.csv', index=False)