<a href="https://colab.research.google.com/github/Shivam311201/Beginning-with-AI-ML/blob/master/Machine%20Learning/Linear%20Regression/OneHotEncoding/Dummy_Variables_%26_One_Hot_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

In [3]:
df=pd.read_csv('/content/sample_data/homeprices_HotEncoding.csv')
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [4]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [8]:
df=pd.concat([df,dummies],axis=1)
df

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


### **We need to drop one column of hot encoding to avoid dummy variable trap.**

- The dummy variable trap, also k/a the dummy variable multicollinearity, is a phenomenon that occurs when two or more dummy variables in a regression model are highly correlated.
- This can lead to multicollinearity issues, where the independent variables are not independent of each other, making it difficult to interpret the coefficients of the regression model accurately.
- The dummy variable trap specifically occurs when one dummy variable can be predicted from the others. For example, if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0.
- If you include both dummy variables in the regression model, it will lead to multicollinearity because one dummy variable's value can be perfectly predicted from the other.

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you**

In [12]:
df.drop(['west windsor','town'],inplace=True,axis=1)
df

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [14]:
model.fit(df[['area','monroe township','robinsville']],df.price)

In [15]:
model.predict(df[['area','monroe township','robinsville']])

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [16]:
model.score(df[['area','monroe township','robinsville']],df.price)

0.9573929037221872

In [17]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor



array([681241.66845839])

In [18]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville



array([590775.63964739])


# Using sklearn OneHotEncoder

First step is to use label encoder to convert town names into numbers


In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [21]:
dfle = pd.read_csv('/content/sample_data/homeprices_HotEncoding.csv')
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [23]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#passthrough means remaining columns i.e.,!=[0], won't have any transformation
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [25]:
X = ct.fit_transform(dfle[['town','area']])
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [26]:
X = X[:,1:]

In [28]:
model.fit(X,dfle['price'])

In [29]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

array([681241.6684584])

In [30]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])