# One-hot-encoding

To this exercise you will implement the one-hot-encoding algorithm from scratch.
You can use pandas but not use scikit-learn to implement such algorithm

You can use the code bellow or use a code of your own from scratch

Hands on!

A popular method in machine learning for representing categorical variables also called nominal variables, is one-hot encoding.

As it stands, most machine learning algorithms cannot handle categorical variables. 

They demand that the information be encoded using numbers. 

Categorical variables can be used with these algorithms by converting them into vectors of zeros and ones using one-hot encoding.


In [12]:
import pandas as pd

# Load the loans prediction dataset
df = pd.read_csv("data/data_loans.csv")

Before encoding

In [13]:
df

Unnamed: 0,LoanID,Gender,Married,Dependents,Education,SelfEmployed,ApplicantIncome,CoapplicantIncome,LoanAmount,LoanAmountTerm,CreditHistory,PropertyArea,LoanStatus
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [14]:

def one_hot_encoding(df, column_name):
    """
    Perform one-hot-encoding on the specified column of the input dataframe
    :param df: Dataframe to be encoded
    :param column_name: Column name to be encoded
    :return: Dataframe with one-hot-encoded columns
    """
   #your code here
    # Let's use pandas 'get_dummies' method for one-hot encoding
    one_hot_encoding = pd.get_dummies(df, columns=[column_name])

    # Let's concatenate the dataframe column with the 'column_name' data encoded in one-hot.
    df_new = pd.concat([df, one_hot_encoding] , axis=1)

    # drop the original column
    df_new = df_new.drop(column_name,axis=1)

    # Display the result
    return df_new

# One-hot-encode the categorical column 'Loan_Status'
df = one_hot_encoding(df, 'LoanStatus')



After encoding

In [15]:
df

Unnamed: 0,LoanID,Gender,Married,Dependents,Education,SelfEmployed,ApplicantIncome,CoapplicantIncome,LoanAmount,LoanAmountTerm,...,Education.1,SelfEmployed.1,ApplicantIncome.1,CoapplicantIncome.1,LoanAmount.1,LoanAmountTerm.1,CreditHistory,PropertyArea,LoanStatus_N,LoanStatus_Y
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,...,Graduate,No,5849,0.0,,360.0,1.0,Urban,0,1
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,...,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1,0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,...,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0,1
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,...,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,...,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,...,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,0,1
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,...,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,0,1
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,...,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,0,1
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,...,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,0,1
