# Machine Learning Tutorial Python - 6: Dummy Variables & One Hot Encoding

### Categorical Variables,Dummy Variables and One Hot Encoding
### 1.Using pandas get_dummies
### 2.Using sklearn OneHotEncode

#### Task:
#### Build a predictor function to predict price of a home
#### 1. 3400 sqr ft area in west windsor
#### 2. 2800 sqr ft home in robbinsville
#### use integer encoding(one hot encoding) to deal with string values to obtain categorical data

In [2]:
import pandas as pd
df = pd.read_csv("homeprices_ML6.csv")
df


Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


# create dummy variables columns using pandas get_dummies

In [4]:

dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [5]:
# combining dummy with original df
merged = pd.concat([df, dummies], axis= 'columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [6]:
# dropping town column
# then drop one dummy columns using dummy trap, randomly select
final = merged.drop(['town', 'west windsor'], axis= 'columns')
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [8]:
# creating linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [9]:
# define x and y for training
# x = all columns except price 
x = final.drop('price', axis= 'columns')
x


Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [10]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [11]:
# training
model.fit(x,y)

LinearRegression()

In [12]:
# prediction
# 2800sq ft in robinsville
model.predict([[2800, 0, 1]])

array([590775.63964739])

In [13]:
# predict 3400 sq ft west windsor
model.predict([[3400, 0, 0]])

array([681241.66845839])

In [14]:
# test accuracy
model.score(x, y)

0.9573929037221873

# using sklearn one hot encoder

In [48]:
df

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [49]:
# doing label encoding on town column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()


In [50]:
# creating a new dtaframe
dfle = df
le.fit_transform(dfle.town)
 # assign to df column
dfle.town = le.fit_transform(dfle.town)
dfle
# categotries converted into integer numbers

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [51]:
# creating th x variable
x = df[['town', 'area']].values
x
# .values gives a 2d array

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [52]:
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [53]:
# Now use one hot encoder to create dummy variables for each of the town
# create dummy variable columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
one = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [54]:
x = one.fit_transform(x)
x

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [55]:
# dropping using dummy variable trap
x = x[:, 1:]
x
# first take all rows, then all colimns from index one onwards

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [56]:
# training
model.fit(x, y)

LinearRegression()

In [59]:
# prediction
# model.predict([[robinsville,westwindsor,sqrft]])
 # 2800 sqr ft home in robbinsville
model.predict([[1,0,2800]])

array([590775.63964739])

In [60]:
# 3400 sqr ft home in west windsor
model.predict([[0,1,3400]])

array([681241.6684584])

# Exercise

##### At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv. This file has car sell prices for 3 different models. First plot data points on a scatter plot chart to see if linear regression model can be applied. If yes, then build a model that can answer following questions,

##### 1) Predict price of a mercedez benz that is 4 yr old with mileage 45000

##### 2) Predict price of a BMW X5 that is 7 yr old with mileage 86000

##### 3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())

##### use carprices_exercise_ML6.csv