<br>

# <center> Categorical Data Encoding

<br>

---

<br>

One of the following practices can be used to perform categorical data encoding -

1.   Using Pandas 'get_dummies()' Function
2.   Using Pandas 'replace()' Function
3.   Apply Custom Function

<br>

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<br>

## Import Libraries

In [2]:
# importing all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# importing modules from 'mltoolsh' loacal package
# Documentation : https://github.com/Shohrab-Hossain/mltoolsh
import mltoolsh.missingValues as _mv

<br>

## Dataset Overview

In [3]:
# dataset overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  

> comment : This dataset has 27 columns.

<br>

## 1. Using Pandas 'get_dummies()' Function

<br>

### The `'loan_status'` column and `'home_ownership'` column will be used in this illustration to performe categorical data encoding using pandas built-in function.


<br>

**a. Encoding the 'loan_status' column**

In [4]:
# viewing the column
df['loan_status'].head(6)

0     Fully Paid
1     Fully Paid
2     Fully Paid
3     Fully Paid
4    Charged Off
5     Fully Paid
Name: loan_status, dtype: object

In [5]:
# counting the category of 'loan_status' column
df['loan_status'].value_counts()

Fully Paid     318357
Charged Off     77673
Name: loan_status, dtype: int64

comment : The 'loan_status' column has two categories - Fully Paid and Charged Off. These categories will be encoded to 1 and 0.

In [6]:
# encoding using pandas built-in 'get_dummies()' function
pd.get_dummies(df['loan_status'])

Unnamed: 0,Charged Off,Fully Paid
0,0,1
1,0,1
2,0,1
3,0,1
4,1,0
...,...,...
396025,0,1
396026,0,1
396027,0,1
396028,0,1


comment : The 'get_dummies()' function generates two new collumn named by the category and the values are encoded. But we do not need both 'Charged Off' and 'Fully Paid' column. Because 'Full paid' 1 means 'Charged Off' is 0. Which means we can delete one column and keep the other.

In [7]:
# encoding and droping the first column
pd.get_dummies(df['loan_status'], drop_first=True)

Unnamed: 0,Fully Paid
0,1
1,1
2,1
3,1
4,0
...,...
396025,1
396026,1
396027,1
396028,1


comment : Now the 'get_dummies()' function generates only one column and holds all the required encoding.

In [8]:
# replacing the 'loan_status' column value with these encoded data
df['loan_status'] = pd.get_dummies(df['loan_status'], drop_first=True)

In [9]:
# checking the categories of the 'loan_status' column
df['loan_status'].value_counts()

1    318357
0     77673
Name: loan_status, dtype: int64

comment : 'Charged Off' and 'Fully Paid' categories are encoded to 0 and 1.

<br>

**b. Encoding the 'home_ownership' column**

In [10]:
# viewing the column
df['home_ownership'].head()

0        RENT
1    MORTGAGE
2        RENT
3        RENT
4    MORTGAGE
Name: home_ownership, dtype: object

In [11]:
# counting the category of 'home_ownership' column
df['home_ownership'].value_counts()

MORTGAGE    198348
RENT        159790
OWN          37746
OTHER          112
NONE            31
ANY              3
Name: home_ownership, dtype: int64

comment : The 'home_ownership' column has six categories. Last two category contain very little data. So, replacing these two category data to 'OTHER' category.

In [12]:
# replacing last two categories data to 'OTHER' category
df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER', inplace=True)

In [13]:
# checking the available categories of 'home_ownership' column
df['home_ownership'].value_counts()

MORTGAGE    198348
RENT        159790
OWN          37746
OTHER          146
Name: home_ownership, dtype: int64

comment : Now the column has four categories. And these four categories will be encoded.

In [14]:
# encoding using pandas built-in 'get_dummies()' function
pd.get_dummies(df['home_ownership'], drop_first=True)

Unnamed: 0,OTHER,OWN,RENT
0,0,0,1
1,0,0,0
2,0,0,1
3,0,0,1
4,0,0,0
...,...,...,...
396025,0,0,1
396026,0,0,0
396027,0,0,1
396028,0,0,0


comment : The 'get_dummies()' function generates Three columns. Now we can not replace 'home_ownership' column with these generated data, because there are  more than one encoded column. So, we will delete the 'home_ownership' column entirely and add the newly created three encoded column to the main dataframe.

In [15]:
# adding the three encoded column to the dataframe
dummies = pd.get_dummies(df['home_ownership'], drop_first=True)
df = pd.concat([df, dummies], axis=1)

In [16]:
# droping the 'home_ownership' column
df.drop('home_ownership', axis=1, inplace=True)

In [17]:
# checking the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 29 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   annual_inc            396030 non-null  float64
 9   verification_status   396030 non-null  object 
 10  issue_d               396030 non-null  object 
 11  loan_status           396030 non-null  uint8  
 12  purpose               396030 non-null  object 
 13  title                 394275 non-null  object 
 14  dti                   396030 non-null  float64
 15  

In [18]:
# checking which column exist in the dataframe and which is not
columnList = ['OTHER', 'RENT', 'OWN', 'home_ownership']

for col in columnList:
  if col in df.columns:
    print(f"'{col}' column exists in the dataframe.")
  else:
    print(f"'{col}' column does not exist in the dataframe.")

'OTHER' column exists in the dataframe.
'RENT' column exists in the dataframe.
'OWN' column exists in the dataframe.
'home_ownership' column does not exist in the dataframe.


comment : The 'home_ownership' column is encoded in 03 new separate columns.

<br>

## 2. Using Pandas 'repalce()' Function

<br>

### The `'application_type'` column will be used in this illustration to performe categorical data encoding using custom function.


In [19]:
# counting the cataegories of the 'application_type' column
df['application_type'].value_counts()

INDIVIDUAL    395319
JOINT            425
DIRECT_PAY       286
Name: application_type, dtype: int64

In [20]:
# extracting the categories of 'application_type' column in a list
category = df['application_type'].value_counts().keys()
category

Index(['INDIVIDUAL', 'JOINT', 'DIRECT_PAY'], dtype='object')

In [21]:
# creating a dictionary using the category-list elements and their indices
replaceDict = dict( (i, index) for (index, i) in enumerate(category) )
replaceDict

{'DIRECT_PAY': 2, 'INDIVIDUAL': 0, 'JOINT': 1}

In [22]:
# encoding the categories using the dictionary
df['application_type'].replace(replaceDict, inplace=True)

<br>

If we want to hardcode the category dictionary. Then this is similar to :


```
replaceDict = {'INDIVIDUAL': 0, 'JOINT': 1, 'DIRECT_PAY': 2}
df['application_type'].replace(replaceDict, inplace=True)
```

<br>




In [23]:
# checking the encoded categories of the column 'application_type'
df['application_type'].value_counts()

0    395319
1       425
2       286
Name: application_type, dtype: int64

comment : All the catagories of the column 'application_type' are encoded.

<br>

## 3. Apply Custom Function

<br>

### The `'application_type'` column will be used in this illustration to performe categorical data encoding using custom function.


In [24]:
# creating a copy of the original dataset
df = originalDF.copy()

In [25]:
# counting the cataegories of the 'application_type' column
df['application_type'].value_counts()

INDIVIDUAL    395319
JOINT            425
DIRECT_PAY       286
Name: application_type, dtype: int64

In [26]:
# defining a custom function to encode the 'application_type' column
def column_encoder (item):

  '''
    This function takes each row of the 'application_type' column and
    encode them into 1, 2, or 3.
  '''

  item = item.upper()

  if item == 'INDIVIDUAL' :
    return 1
  elif item == 'JOINT' :
    return 2
  else :
    return 3

In [27]:
# applying the custom function to 'application_type' column
df['application_type'] = df['application_type'].apply(column_encoder)

In [28]:
# checking the encoded categories of the column 'application_type'
df['application_type'].value_counts()

1    395319
2       425
3       286
Name: application_type, dtype: int64

comment : All the catagories of the column 'application_type' are encoded.