# **Filling in the Blanks: A Step-by-Step Guide to Imputing Null Values in Python**

**Author Details:**


**Dr. Muhammad Aammar Tufail**

PhD Data Science in Agriculture

[<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/youtube.svg" width="50" height="50">](https://www.youtube.com/channel/UCmNXJXWONLNF6bdftGY0Otw/)
[<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/linkedin.svg" width="50" height="50">](https://www.linkedin.com/in/dr-muhammad-aammar-tufail-02471213b/)
[<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/github.svg" width="50" height="50">](https://github.com/AammarTufail)
[<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/twitter.svg" width="50" height="50">](https://twitter.com/aammar_tufail)
[<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/facebook.svg" width="50" height="50">](https://www.facebook.com/groups/codanics/permalink/1872283496462303/)


## **Introduction:**

Missing data can pose significant challenges in data analysis and modeling. Incomplete datasets can lead to biased results and inaccurate conclusions. 

Fortunately, Python offers a range of powerful techniques for handling missing values. In this blog post, we will explore different methods to impute null values in Python using the popular pandas library. 

Let's dive in!

1. **Mean/Median/Mode Imputation:**
Missing values in numerical data can be replaced with the mean, median, or mode of the non-null values in the same column. These imputation methods are straightforward and suitable for datasets with numeric features.

2. **Forward Fill (pad) or Backward Fill (bfill):**
When dealing with time series or ordered data, forward fill (pad) and backward fill (bfill) methods come in handy. Forward fill replaces null values with the most recent non-null value before it, while backward fill uses the next non-null value after it. These methods help maintain the chronological sequence of observations.

3. **Interpolation:**
Interpolation involves estimating missing values based on data trends. There are various interpolation methods available, such as linear interpolation, polynomial interpolation, and spline interpolation. These techniques are particularly useful for datasets with continuous variables.

4. **Machine Learning-based Imputation:**
For more advanced imputations, machine learning-based approaches can be employed. The K-Nearest Neighbors (KNN) algorithm finds the K nearest neighbors based on other features and uses their values to estimate the missing values. Random Forest imputation utilizes a Random Forest model to predict missing values based on the available features. Multivariate Imputation by Chained Equations (MICE) generates multiple imputations by building regression models on the observed data and iteratively imputing missing values based on other variables.

In [1]:
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Display the information about missing values
print(titanic_df.isnull().sum())

# Impute missing values in the 'age' column using mean imputation
titanic_df['age'].fillna(titanic_df['age'].mean(), inplace=True)

# Impute missing values in the 'embark_town' column using mode imputation
titanic_df['embark_town'].fillna(titanic_df['embark_town'].mode()[0], inplace=True)

# Display the updated information about missing values
print(titanic_df.isnull().sum())

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64


## **Conclusion:**
Handling missing data is crucial for accurate data analysis and modeling. In this blog post, we explored various methods to impute null values in Python using pandas. From simple mean imputation to more advanced machine learning-based approaches, each method offers unique advantages depending on the dataset characteristics and context of the problem.

By leveraging these techniques, you can ensure that your data analysis is more robust, reliable, and yields meaningful insights. So, the next time you encounter missing values in your Python projects, feel confident in applying these imputation methods to enhance the quality of your analyses.

Remember, selecting the appropriate imputation method requires careful consideration of your dataset's nature and the goals of your analysis. Experiment with different approaches and choose the one that best suits your needs.


--- 
***Happy coding and data exploration!***
