<a href="https://colab.research.google.com/github/Ar-Anik/Machine_Learning/blob/main/Dummy_Variables_%26_One_Hot_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
# One-Hot Encoding is popular technique for treating categorical variables
# One-Hot Encoding is the process of creating dummy variables.
# Dummy variables are “proxy” variables for categorical data in regression models.
# In One-Hot Encoding each categorical data allocate 1 coloumn that each data called dummy variables.

import pandas

data = pandas.read_csv('LandPrice.csv')

data

Unnamed: 0,Town,Area,Price
0,Dhaka,2600,550000
1,Dhaka,3000,565000
2,Dhaka,3200,610000
3,Dhaka,3600,680000
4,Dhaka,4000,725000
5,Chittagong,2600,585000
6,Chittagong,2800,615000
7,Chittagong,3300,650000
8,Chittagong,3600,710000
9,Khulna,2600,575000


In [50]:
# Create dummy variable column by pandas.get_dummies() function

pandas.get_dummies(data.Town)

Unnamed: 0,Chittagong,Dhaka,Khulna
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


In [51]:
dummies = pandas.get_dummies(data.Town)

dummies

Unnamed: 0,Chittagong,Dhaka,Khulna
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


In [52]:
# concat two dataframe data and dummies

merged = pandas.concat([data, dummies], axis='columns')

merged

Unnamed: 0,Town,Area,Price,Chittagong,Dhaka,Khulna
0,Dhaka,2600,550000,0,1,0
1,Dhaka,3000,565000,0,1,0
2,Dhaka,3200,610000,0,1,0
3,Dhaka,3600,680000,0,1,0
4,Dhaka,4000,725000,0,1,0
5,Chittagong,2600,585000,1,0,0
6,Chittagong,2800,615000,1,0,0
7,Chittagong,3300,650000,1,0,0
8,Chittagong,3600,710000,1,0,0
9,Khulna,2600,575000,0,0,1


[Muticollinearity](https://https://www.youtube.com/watch?v=ekuD8JUdL6M)

**Dummy Variable Trap** : <br>
The Dummy variable trap is a scenario where there are attributes which are highly correlated (Multicollinear) and one variable predicts the value of others. When we use one hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy variables.
Using all dummy variables for regression models lead to dummy variable trap. So, the regression models should be designed excluding one dummy variable.

**For Example** : <br>
Let’s consider the case of gender having two values male (0 or 1) and female (1 or 0). Including both the dummy variable can cause redundancy because if a person is not male in such case that person is a female, hence, we don’t need to use both the variables in regression models. This will protect us from dummy variable trap.

In [53]:
# Now Drop Town and Chittagong column

final = merged.drop(['Town', 'Chittagong'], axis='columns')

final

Unnamed: 0,Area,Price,Dhaka,Khulna
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [54]:
# For Linear Regression create Linear Regression model

from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [55]:
# Train model

x = final.drop('Price', axis='columns')

x

Unnamed: 0,Area,Dhaka,Khulna
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [57]:
y = final.Price

y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: Price, dtype: int64

In [58]:
model.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [59]:
# predict area price for Area=2700, Town=Dhaka

model.predict([[2700, 1, 0]])

array([552399.48398195])

In [60]:
# predict area price for Area=3000, Town=Khulna

model.predict([[3000, 0, 1]])

array([616155.12792948])

In [61]:
# predict area price for Area=3500, Town=Chittagong

model.predict([[3500, 0, 0]])

array([693931.41259943])

In [62]:
# How much Linear Regression model correct measured by score() function
# in score function supply x
# score() function predict value for all x data then compare y value
# thus this function measured the correctness of a model

model.score(x, y)

0.9573929037221873

In [63]:
# start One-Hot Encoding

# By ColumnTransformer transformers to columns of an array or pandas DataFrame
# By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. 

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

column = ColumnTransformer([('Town', OneHotEncoder(), [0])], remainder = 'passthrough')


In [None]:
# Now use one-hot encoder to create dummy variables for each of the town

x = column.fit_transform(data[['Town', 'Area']])

x

array([[0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 4.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 2.8e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.3e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.9e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.1e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03]])

In [None]:
x = x[ : , 1 : ]

x

array([[1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 3.6e+03]])

In [None]:
y = data.Price.values

y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000])

In [64]:
# Again create Linear Regression model object 
# Then train Linear Regression model by endcoding data which is get by one-hot encoding
# then predict the price and compare above dummy variable price

model_object = LinearRegression()
model_object.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [66]:
# predict area price for Area=2700, Town=Dhaka

model_object.predict([[2700, 1, 0]])        # same result as dummy variable

array([552399.48398195])

In [67]:
# predict area price for Area=3000, Town=Khulna

model_object.predict([[3000, 0, 1]])       # same result as dummy variable

array([616155.12792948])

In [68]:
# predict area price for Area=3500, Town=Chittagong

model_object.predict([[3500, 0, 0]])       # same result as dummy variable

array([693931.41259943])