# Handling Missing Values

Missing values are common in real-world datasets. If not handled properly, they can cause errors or reduce the performance of machine learning models.

In this notebook, we'll cover:
- Identifying missing values
- Removing missing values
- Imputing missing values (mean, median, mode)
- Using forward/backward fill
- Handling categorical missing values

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = {
    'Age': [25, 30, np.nan, 35, 40],
    'Salary': [50000, np.nan, 55000, np.nan, 65000],
    'Country': ['India', 'USA', np.nan, 'UK', 'India']
}

df = pd.DataFrame(data)
df

## 1. Identifying Missing Values

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

## 2. Removing Missing Values
- Drop rows or columns with missing values.
- Not always recommended, as it can cause data loss.

In [None]:
# Drop rows with missing values
df_drop_rows = df.dropna()
df_drop_rows

In [None]:
# Drop columns with missing values
df_drop_cols = df.dropna(axis=1)
df_drop_cols

## 3. Imputing Missing Values (Numerical)
- Replace missing values with mean, median, or mode.
- More useful than dropping data.

In [None]:
imputer_mean = SimpleImputer(strategy='mean')
df['Age'] = imputer_mean.fit_transform(df[['Age']])
df['Salary'] = imputer_mean.fit_transform(df[['Salary']])
df

In [None]:
imputer_median = SimpleImputer(strategy='median')
df['Salary'] = imputer_median.fit_transform(df[['Salary']])
df

## 4. Forward Fill and Backward Fill
- Fill missing values based on neighboring values.
- Forward fill: takes the last valid value.
- Backward fill: takes the next valid value.

In [None]:
df_ffill = df.fillna(method='ffill')
df_ffill

In [None]:
df_bfill = df.fillna(method='bfill')
df_bfill

## 5. Handling Missing Categorical Values
- Use the most frequent value (mode).
- Or use a placeholder like `'Unknown'`.

In [None]:
df['Country'].fillna(df['Country'].mode()[0], inplace=True)
df

In [None]:
df['Country'] = df['Country'].fillna('Unknown')
df

## ✅ Summary
- Identified missing values.
- Removed rows/columns with missing data.
- Imputed missing values using mean/median.
- Used forward/backward fill.
- Handled categorical missing values.

👉 Handling missing values is a crucial first step in building reliable ML models.