# 🧹 Data Cleaning: Categorical Missing Value Imputation

In this notebook, we explore techniques for handling **missing categorical values**.
Missing values in categorical columns can be filled using methods like:

- Mode (most frequent category)
- A placeholder like `'Missing'`
- Scikit-learn's `SimpleImputer`


In [1]:
# 📦 Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# 🧪 Create sample dataset with missing categorical values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Gender': ['Female', 'Male', np.nan, 'Male', np.nan],
    'City': ['Delhi', np.nan, 'Mumbai', 'Delhi', 'Chennai']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Gender,City
0,Alice,Female,Delhi
1,Bob,Male,
2,Charlie,,Mumbai
3,David,Male,Delhi
4,Eva,,Chennai


In [3]:
df.shape

(5, 3)

In [5]:
df.info

<bound method DataFrame.info of       Name  Gender     City
0    Alice  Female    Delhi
1      Bob    Male      NaN
2  Charlie     NaN   Mumbai
3    David    Male    Delhi
4      Eva     NaN  Chennai>

In [6]:
# Find categorical columns from dataframe
cat_vars = df.select_dtypes(include='object')  # here object refer to catagorical data 
cat_vars.head()

Unnamed: 0,Name,Gender,City
0,Alice,Female,Delhi
1,Bob,Male,
2,Charlie,,Mumbai
3,David,Male,Delhi
4,Eva,,Chennai


In [None]:
cat_vars.shape

In [None]:
# Finding number of null data
cat_vars.isnull().sum()

In [None]:
# Finding percentage for null data
miss_val_per = cat_vars.isnull().mean()*100
miss_val_per

In [None]:
# 🔍 Check for missing values
df.isnull().sum()

## ✅ Method 1: Fill missing values with most frequent category (mode)

In [None]:
# Replace missing 'Gender' values with mode
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

# Replace missing 'City' values with mode
df['City'].fillna(df['City'].mode()[0], inplace=True)
df

## ✅ Method 2: Fill missing values using Scikit-Learn

In [None]:
from sklearn.impute import SimpleImputer

# Sample data again with missing values
df2 = pd.DataFrame({
    'Gender': ['Female', 'Male', np.nan, 'Male', np.nan],
    'City': ['Delhi', np.nan, 'Mumbai', 'Delhi', 'Chennai']
})

# Create SimpleImputer for 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform
imputed = imputer.fit_transform(df2)

# Convert back to DataFrame
pd.DataFrame(imputed, columns=['Gender', 'City'])

### ✅ Summary:
- Use `.fillna(mode)` or `SimpleImputer(strategy='most_frequent')` to fill missing categorical data
- Choose method based on simplicity or automation needs
- Avoid dropping data unnecessarily — imputing retains valuable records

# 🧹 Data Cleaning: Categorical Missing Value Imputation

In this notebook, we explore techniques for handling **missing categorical values**.
Missing values in categorical columns can be filled using methods like:

- Mode (most frequent category)
- A placeholder like `'Missing'`
- Scikit-learn's `SimpleImputer`


In [7]:
# 📦 Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [8]:
# 🧪 Create sample dataset with missing categorical values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Gender': ['Female', 'Male', np.nan, 'Male', np.nan],
    'City': ['Delhi', np.nan, 'Mumbai', 'Delhi', 'Chennai']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Gender,City
0,Alice,Female,Delhi
1,Bob,Male,
2,Charlie,,Mumbai
3,David,Male,Delhi
4,Eva,,Chennai


In [9]:
df.shape

(5, 3)

In [10]:
df.info

<bound method DataFrame.info of       Name  Gender     City
0    Alice  Female    Delhi
1      Bob    Male      NaN
2  Charlie     NaN   Mumbai
3    David    Male    Delhi
4      Eva     NaN  Chennai>

In [11]:
# Find categorical columns from dataframe
cat_vars = df.select_dtypes(include='object')  # here object refer to catagorical data 
cat_vars.head()

Unnamed: 0,Name,Gender,City
0,Alice,Female,Delhi
1,Bob,Male,
2,Charlie,,Mumbai
3,David,Male,Delhi
4,Eva,,Chennai


In [12]:
cat_vars.shape

(5, 3)

In [13]:
# Finding number of null data
cat_vars.isnull().sum()

Name      0
Gender    2
City      1
dtype: int64

In [14]:
# Finding percentage for null data
miss_val_per = cat_vars.isnull().mean()*100
miss_val_per

Name       0.0
Gender    40.0
City      20.0
dtype: float64

In [15]:
# 🔍 Check for missing values
df.isnull().sum()

Name      0
Gender    2
City      1
dtype: int64

## ✅ Method 1: Fill missing values with most frequent category (mode)

In [16]:
# Replace missing 'Gender' values with mode
df['Gender']=dffillna(df['Gender'].mode()[0], inplace=True)

# Replace missing 'City' values with mode
df['City'].fillna(df['City'].mode()[0], inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['City'].fillna(df['City'].mode()[0], inplace=True)


Unnamed: 0,Name,Gender,City
0,Alice,Female,Delhi
1,Bob,Male,Delhi
2,Charlie,Male,Mumbai
3,David,Male,Delhi
4,Eva,Male,Chennai


## ✅ Method 2: Fill missing values using Scikit-Learn

In [17]:
from sklearn.impute import SimpleImputer

# Sample data again with missing values
df2 = pd.DataFrame({
    'Gender': ['Female', 'Male', np.nan, 'Male', np.nan],
    'City': ['Delhi', np.nan, 'Mumbai', 'Delhi', 'Chennai']
})

# Create SimpleImputer for 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform
imputed = imputer.fit_transform(df2)

# Convert back to DataFrame
pd.DataFrame(imputed, columns=['Gender', 'City'])

Unnamed: 0,Gender,City
0,Female,Delhi
1,Male,Delhi
2,Male,Mumbai
3,Male,Delhi
4,Male,Chennai


In [None]:
#END