# Encoding & Decoding in Machine Learning

#### Step-1 Filling Null Values of Dataset.
#### Step-2 One Hot Encoding & Dummy Variables.
#### Step-3 What is Label Ecoding & How can use it.
#### step-4 Ordinal encoding & its methods

# Step-1 Filling Null Values of Dataset

In [173]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [174]:
import warnings
warnings.filterwarnings("ignore")

In [175]:
dataset = pd.read_csv(r"C:\Users\jites\OneDrive\Desktop\archive\loan.csv")
dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [176]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [177]:
dataset.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [178]:
#dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True)  

In [179]:
#dataset["Married"].fillna(dataset["Married"].mode()[0],inplace=True)

In [180]:
for i in dataset.select_dtypes(include="object").columns:        # By this line of code Only "Object type" data are affect
    dataset[i].fillna(dataset[i].mode()[0],inplace=True)

In [181]:
for i in dataset.select_dtypes(include="float64").columns:       # By this line of code Only "float64 type" data are affect
    dataset[i].fillna(dataset[i].mode()[0],inplace=True)

In [182]:
dataset.isnull().sum()                                          # remove all null values 

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

# Step-2 One Hot Encoding & Dummy Variables

###  TWo Types of doing One Hot Encoding
####  1. "Get Dummy's Method (Pandas)"
####  2. OneHotEncoder() Method (Sklearn Module)

##  1. "Get Dummy's Method (Pandas)"

In [183]:
endata = dataset[["Gender","Married"]]
endata

Unnamed: 0,Gender,Married
0,Male,No
1,Male,Yes
2,Male,Yes
3,Male,Yes
4,Male,No
...,...,...
609,Female,No
610,Male,Yes
611,Male,Yes
612,Male,Yes


In [184]:
pd.get_dummies(endata).head(10)

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,False,True,True,False
1,False,True,False,True
2,False,True,False,True
3,False,True,False,True
4,False,True,True,False
5,False,True,False,True
6,False,True,False,True
7,False,True,False,True
8,False,True,False,True
9,False,True,False,True


##  2. OneHotEncoder() Method (Sklearn Module)

In [185]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
arr = ohe.fit_transform(endata).toarray()
arr

array([[0., 1., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       ...,
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.]], shape=(614, 4))

In [186]:
pd.DataFrame(arr,columns=["Gender_Female","Gender_Male","Married_No","Married_Yes"])

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,0.0,1.0,1.0,0.0
1,0.0,1.0,0.0,1.0
2,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,1.0
4,0.0,1.0,1.0,0.0
...,...,...,...,...
609,1.0,0.0,1.0,0.0
610,0.0,1.0,0.0,1.0
611,0.0,1.0,0.0,1.0
612,0.0,1.0,0.0,1.0


#### ****  Doing short dataset , remove unwanted columns

In [187]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop="first")
arr = ohe.fit_transform(endata).toarray()
arr

array([[1., 0.],
       [1., 1.],
       [1., 1.],
       ...,
       [1., 1.],
       [1., 1.],
       [0., 0.]], shape=(614, 2))

In [188]:
pd.DataFrame(arr,columns=["Gender_Male","Married_Yes"])

Unnamed: 0,Gender_Male,Married_Yes
0,1.0,0.0
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0
4,1.0,0.0
...,...,...
609,0.0,0.0
610,1.0,1.0
611,1.0,1.0
612,1.0,1.0


# Step-3 What is Label Ecoding & How can use it.

#### Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important pre-processing step in a machine-learning project. It assigns a unique integer to each category in the data, making it suitable for machine learning models that work with numerical inputs.

In [189]:
dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,120.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [190]:
dataset["Property_Area"].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [191]:
from sklearn.preprocessing import LabelEncoder

In [192]:
label_en = LabelEncoder()
label_en.fit(dataset["Property_Area"])

In [193]:
dataset["Property_Area"]=label_en.transform(dataset["Property_Area"])   # change the data of "Property_Area" with LabelEncoder.
dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,120.0,360.0,1.0,2,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,2,Y


In [194]:
dataset["Property_Area"].unique()                                        # changed data...

array([2, 0, 1])

# Step-4 Ordinal encoding & Its methods 

#### Ordinal encoding is a technique to transform categorical features into a numerical format. In ordinal encoding, labels are translated to numbers based on their ordinal relationship to one another. For example, if one feature contains - {low, medium, high}, it can be converted into {1,2,3}, where 1 represents low, 2 represents medium, and 3 represents high

### Two method to perform Ordinal Encoding
#### 1. Sklearn     
#### 2. Map function (Pandas)

# 1. Sklearn 

In [200]:
sk_dataset = pd.read_csv(r"C:\Users\jites\OneDrive\Desktop\archive\loan.csv")
sk_dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [201]:
ord_data = [['Urban', 'Rural', 'Semiurban']]

In [202]:
from sklearn.preprocessing import OrdinalEncoder

In [203]:
Or_en=OrdinalEncoder(categories=ord_data)
Or_en.fit_transform(sk_dataset[["Property_Area"]])
Ord_dataset["Property_Area"]=Or_en.fit_transform(sk_dataset[["Property_Area"]])
Ord_dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,0.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,0.0,Y


# 2. Map function (Pandas)

In [204]:
map_dataset = pd.read_csv(r"C:\Users\jites\OneDrive\Desktop\archive\loan.csv")
map_dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [210]:
ord_data1 = {'Urban':9, 'Rural':999, 'Semiurban':99}

In [211]:
map_dataset["Property_Area"]=map_dataset["Property_Area"].map(ord_data1)

In [213]:
map_dataset.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,9,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,999,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,9,Y
