## Dummy_variable_&_OneHotEncoding

Given this dataset we want to build a prediction function to predict price of a home.

- With 3400 sqr ft area in west windsor
- 2800 sqr ft home in robbinsville

The question is : How do we handle the text data in numeric mode?
One way to do it is to use integer encoding or label encoding to convert the name of the town to numerical

In [1]:
import pandas as pd

In [2]:
# Read in the dataset
df = pd.read_csv("text_houseprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robbinsville,2600,575000


In [9]:
# Creating dummy variable columns using pandas_getdummies
# by passing in the categorical column
dummies = pd.get_dummies(df.town, drop_first=True)

In [10]:
dummies

Unnamed: 0,robbinsville,west windsor
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,1
6,0,1
7,0,1
8,0,1
9,1,0


In [11]:
# concatenate the two dataframes
merged = pd.concat([df, dummies], axis= 1)

In [12]:
merged

Unnamed: 0,town,area,price,robbinsville,west windsor
0,monroe township,2600,550000,0,0
1,monroe township,3000,565000,0,0
2,monroe township,3200,610000,0,0
3,monroe township,3600,680000,0,0
4,monroe township,4000,725000,0,0
5,west windsor,2600,585000,0,1
6,west windsor,2800,615000,0,1
7,west windsor,3300,650000,0,1
8,west windsor,3600,710000,0,1
9,robbinsville,2600,575000,1,0


### Avoid the dummy variable trap
The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models

In [13]:
final = merged.drop(["town"], axis= 1)
final

Unnamed: 0,area,price,robbinsville,west windsor
0,2600,550000,0,0
1,3000,565000,0,0
2,3200,610000,0,0
3,3600,680000,0,0
4,4000,725000,0,0
5,2600,585000,0,1
6,2800,615000,0,1
7,3300,650000,0,1
8,3600,710000,0,1
9,2600,575000,1,0


In [15]:
# Import the linear regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [16]:
# storing the independent variables into X
X = final.drop("price", axis= 1)
X

Unnamed: 0,area,robbinsville,west windsor
0,2600,0,0
1,3000,0,0
2,3200,0,0
3,3600,0,0
4,4000,0,0
5,2600,0,1
6,2800,0,1
7,3300,0,1
8,3600,0,1
9,2600,1,0


In [17]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [18]:
# Tranin the model
model.fit(X, y)

LinearRegression()

In [25]:
# predict price for house in west robbinsville
model.predict([[2800, 0, 0]])

array([565089.22812299])

In [26]:
# Predict price for house in west windsor
model.predict([[3400, 0, 1]])

array([681241.6684584])

In [14]:
model.predict([[3400, 0, 0]])



array([641227.69296925])

In [27]:
# Checking the accuracy of the model
model.score(X, y)

0.9573929037221872

In [28]:
# SECOND METHOD
# Doing thesame process using sklearn LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [29]:
dfle = df
dfle.head(5)

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [30]:
# transform the town to values to numerical
dfle.town = le.fit_transform(dfle.town)

In [26]:
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [27]:
# Put the predictors into one variable
X = dfle[["town", "area"]].values
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [28]:
# Put the independent variable in one variable
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [29]:
# Creating the dummy variables
# import the OneHotEncoder to get the dummy variables
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

In [41]:
X[:,:1]

array([[0],
       [0],
       [0],
       [0],
       [0],
       [2],
       [2],
       [2],
       [2],
       [1],
       [1],
       [1],
       [1]], dtype=int64)

In [42]:
# we are making use of the first variable (town) as the categorical feature
encode = ohe.fit_transform(X[:,:1]).toarray()

In [43]:
encode

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.]])

In [44]:
ohe.categories_

[array([0, 1, 2], dtype=int64)]

In [45]:
# creating a dataframe from the encoded variables
encode = pd.DataFrame(encode, columns=ohe.categories_)

In [46]:
encode

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0
6,0.0,0.0,1.0
7,0.0,0.0,1.0
8,0.0,0.0,1.0
9,0.0,1.0,0.0


In [47]:
# input the area from the previous X into the dataframe
encode["3"] = X[:, 1]

In [48]:
encode

Unnamed: 0,0,1,2,3
0,1.0,0.0,0.0,2600
1,1.0,0.0,0.0,3000
2,1.0,0.0,0.0,3200
3,1.0,0.0,0.0,3600
4,1.0,0.0,0.0,4000
5,0.0,0.0,1.0,2600
6,0.0,0.0,1.0,2800
7,0.0,0.0,1.0,3300
8,0.0,0.0,1.0,3600
9,0.0,1.0,0.0,2600


In [52]:
# Avoid the dummy variable trap by dropping an extra column
X = encode.drop(columns={0})

In [53]:
X

Unnamed: 0,1,2,3
0,0.0,0.0,2600
1,0.0,0.0,3000
2,0.0,0.0,3200
3,0.0,0.0,3600
4,0.0,0.0,4000
5,0.0,1.0,2600
6,0.0,1.0,2800
7,0.0,1.0,3300
8,0.0,1.0,3600
9,1.0,0.0,2600


In [54]:
# Train the model
model.fit(X, y)



LinearRegression()

In [106]:
# predict price for house in west windsor
model.predict([[1, 0, 2800]])

array([590775.63964739])

In [107]:
# Predict price for house in robbinsville
model.predict([[0, 1, 3400]])

array([681241.66845839])

In [55]:
# predict price for house in monroe township
model.predict([[0, 0, 2800]])

array([565089.22812299])