<br>

# <center> Categorical Missing Data Handling

<br>

---

<br>

One of the two approaches can be used to handle missing categorical data -

1.   Impute a new catagory : 'Missing'
2.   Frequent catagory imputation

<br>


<br>

## Import Libraries

In [2]:
# importing all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# importing modules from 'mltoolsh' loacal package
# Documentation : https://github.com/Shohrab-Hossain/mltoolsh
import mltoolsh.missingValues as _mv
import mltoolsh.correlation as _corr

<br>

## Dataset Overview

In [3]:
# dataset overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  

<br>

### The `'emp_title'` column will be used in this illustration to handle categorical missing data.


In [4]:
# checking the data type of the column
df['emp_title'].dtype

dtype('O')

> comment : The 'emp_title' column is a catagorical column.

<br>

## 1. Impute a new catagory : 'Missing'

In [5]:
# creating a copy of the original dataset
df = originalDF.copy()

In [6]:
# checking how many data are missing in the column 'emp_title'
_mv.hasMissingValues('emp_title', df)

This column has 22927 missing values : 5.79 %


> comment : The 'emp_title' column has 5.79% missing values.

In [7]:
# counting the catagory that the column has
df['emp_title'].value_counts()

Supervisor    124744
Manager       124338
Teacher       124021
Name: emp_title, dtype: int64

comment : The 'emp_title' column has 03 catagories.

In [8]:
# filling the missing value with 'Missing' catagpory
df['emp_title'].fillna(value='Missing', inplace=True)

In [9]:
# checking the column has any missing values left
_mv.hasMissingValues('emp_title', df)

This column has no missing value.


comment : The column 'emp_title' has no missing values. Missing values are filled with a new catagory 'Missing'.

In [10]:
# counting the catagory after imputation
df['emp_title'].value_counts()

Supervisor    124744
Manager       124338
Teacher       124021
Missing        22927
Name: emp_title, dtype: int64

comment : The 'emp_title' column now has a new catagory named 'Missing' and this column count is 22927 which is equal to the number of missing values before imputation.

<br>

## 2. Frequent catagory imputation

In [11]:
# creating a copy of the original dataset
df = originalDF.copy()

In [12]:
# checking how many data are missing in the column 'mort_acc'
_mv.hasMissingValues('emp_title', df)

This column has 22927 missing values : 5.79 %


> comment +: The 'emp_title' column has 5.79% missing values.

In [13]:
# counting the catagory that the column has
df['emp_title'].value_counts()

Supervisor    124744
Manager       124338
Teacher       124021
Name: emp_title, dtype: int64

In [14]:
# finding the frequent catagory
catagories = df['emp_title'].value_counts()
frequentCatagory = catagories.keys()[0]

In [15]:
frequentCatagory

'Supervisor'

In [16]:
# filling the missing values with frequent catagory
df['emp_title'].fillna(value=frequentCatagory, inplace=True)

In [17]:
# checking the column has any missing values left
_mv.hasMissingValues('emp_title', df)

This column has no missing value.


comment : The column 'emp_title' has no missing values. Missing values are filled with frequent catagory.

In [18]:
# counting the catagory after imputation
df['emp_title'].value_counts()

Supervisor    147671
Manager       124338
Teacher       124021
Name: emp_title, dtype: int64

comment : The 'emp_title' column now has same number of catagory as before. But the frequent catagory now has more count of 22927 which is equal to the number of missing values before imputation.