# One Hot Encoding
    

- Converting Categorical Data to numerical Data
    - Categorical values- names, labels, strings
    - Causes difficulties in the ML models
    - Types
        - Label encoding
        - One hot Encoding

- One hot encoding
    - process of creating dummies
    - when order does not matter
    - features are nominal(no order)
    - for every categorical feature a new variable is created
    - and then mapped with binary value
    - also called Dummy Encoding

- Dummy variable trap:
    - multicollinearity: two or more input variables are higly correlated- affects overall accuracy of the model
    - Solution: 
        - we remove one of the newly created columns by one-hot encoding
        - can be done because dummy variables have redundant information
        - we have n number of categories we will drop one new created column and use n-1 dummy variables

In [1]:
import pandas as pd
import  numpy as np

In [2]:
# create DataFrame
df=pd.DataFrame({'team':['A','A','B','B','B','B','C','C'],
                'points':[25,12,15,14,19,23,25,29]})

In [3]:
df

Unnamed: 0,team,points
0,A,25
1,A,12
2,B,15
3,B,14
4,B,19
5,B,23
6,C,25
7,C,29


# Perform One-Hot Encoding

In [4]:
# OneHotEncoder() function from sklearn library is used
from sklearn.preprocessing import OneHotEncoder

In [5]:
#creating instance of one-hot-encoder
encoder=OneHotEncoder(handle_unknown='ignore')

#Perform one-hot encoding on 'team' column
encoder_df=pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())

In [6]:
encoder_df

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,0.0,1.0,0.0
6,0.0,0.0,1.0
7,0.0,0.0,1.0


In [7]:
#merge one_hot encoded columns back with original DataFrame

final_df=df.join(encoder_df)

In [8]:
#view final df
print(final_df)

  team  points    0    1    2
0    A      25  1.0  0.0  0.0
1    A      12  1.0  0.0  0.0
2    B      15  0.0  1.0  0.0
3    B      14  0.0  1.0  0.0
4    B      19  0.0  1.0  0.0
5    B      23  0.0  1.0  0.0
6    C      25  0.0  0.0  1.0
7    C      29  0.0  0.0  1.0


In [9]:
# All the three new columns are added in the end of the original dataframe

# Drop the original Categorical Variable

In [10]:
#drop 'team' column
final_df.drop('team',axis=1,inplace=True)

In [11]:
#View Final df
final_df

Unnamed: 0,points,0,1,2
0,25,1.0,0.0,0.0
1,12,1.0,0.0,0.0
2,15,0.0,1.0,0.0
3,14,0.0,1.0,0.0
4,19,0.0,1.0,0.0
5,23,0.0,1.0,0.0
6,25,0.0,0.0,1.0
7,29,0.0,0.0,1.0


# Practice Example

- Create a dataframe with columns as 'Gender','Age','Degree'
- Create 10 data values in this data set
- Apply one-hot encoding on the columns
   - Gender
   - Degree

In [22]:
df1=pd.DataFrame({'Gender':['M','F','F','M','F','M','F','M','F','M'],
                 'Age':[19,20,21,24,18,22,21,20,19,20],
                 'Degree':['BSc','MSc','BTech','BTech','BSc','MSc','BTech','BSc','BTech','MSc']})


In [23]:
df1

Unnamed: 0,Gender,Age,Degree
0,M,19,BSc
1,F,20,MSc
2,F,21,BTech
3,M,24,BTech
4,F,18,BSc
5,M,22,MSc
6,F,21,BTech
7,M,20,BSc
8,F,19,BTech
9,M,20,MSc


In [24]:
#creating instance of one-hot-encoder
encoder1=OneHotEncoder(handle_unknown='ignore')

#Perform one-hot encoding on 'Gender' column and 'Degree' column
encoder_df1=pd.DataFrame(encoder1.fit_transform(df1[['Gender','Degree']]).toarray())

In [25]:
encoder_df1

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0
5,0.0,1.0,0.0,0.0,1.0
6,1.0,0.0,0.0,1.0,0.0
7,0.0,1.0,1.0,0.0,0.0
8,1.0,0.0,0.0,1.0,0.0
9,0.0,1.0,0.0,0.0,1.0


In [26]:
# Joining the encoder with the original df
final_df1=df1.join(encoder_df1)

In [27]:
# Viewing the joined dataframe
final_df1

Unnamed: 0,Gender,Age,Degree,0,1,2,3,4
0,M,19,BSc,0.0,1.0,1.0,0.0,0.0
1,F,20,MSc,1.0,0.0,0.0,0.0,1.0
2,F,21,BTech,1.0,0.0,0.0,1.0,0.0
3,M,24,BTech,0.0,1.0,0.0,1.0,0.0
4,F,18,BSc,1.0,0.0,1.0,0.0,0.0
5,M,22,MSc,0.0,1.0,0.0,0.0,1.0
6,F,21,BTech,1.0,0.0,0.0,1.0,0.0
7,M,20,BSc,0.0,1.0,1.0,0.0,0.0
8,F,19,BTech,1.0,0.0,0.0,1.0,0.0
9,M,20,MSc,0.0,1.0,0.0,0.0,1.0


In [28]:
# Dropping Gender and Degree column
final_df1.drop(['Gender','Degree'],axis=1,inplace=True)

In [29]:
# Viewing the final df
final_df1

Unnamed: 0,Age,0,1,2,3,4
0,19,0.0,1.0,1.0,0.0,0.0
1,20,1.0,0.0,0.0,0.0,1.0
2,21,1.0,0.0,0.0,1.0,0.0
3,24,0.0,1.0,0.0,1.0,0.0
4,18,1.0,0.0,1.0,0.0,0.0
5,22,0.0,1.0,0.0,0.0,1.0
6,21,1.0,0.0,0.0,1.0,0.0
7,20,0.0,1.0,1.0,0.0,0.0
8,19,1.0,0.0,0.0,1.0,0.0
9,20,0.0,1.0,0.0,0.0,1.0
