# One Hot Encoding

Build a predictor function to predict the price of homes in Monroe Town.

One of the columns in the dataset contains a text and looks like:

Town  | Area | Price
----- | -----|-------
Monroe Township  | 2600 | 550000
....  | .... | .....

## Dealing with Categorical Variables

The text is never understood by machines. Name of the towns must be converted to numbers. So in Categorical data are variables that contain label values rather than numeric values.

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

    1. Ordinal Encoding - Typically used in "Place" variables such as "first", "second" and "third". We know 1>2>3. There exists a numerical relationship between the cateogories.
    
    2. One-Hot Encoding - Typically used in “color” variables with the values: “red“, “green” and “blue“, where there exists no natural relationship between red, green and blue. Red !(=, > or <) Green...
    3. Dummy Variable Encoding

----------| Nominal  | Ordinal
----------|-------- | --------
Example| Town 01,02,03.. | Satisfied, neutral, dis-satisfied
Example| Male, female | High, medium low
Example| Green, blue, red | graduate, masters, phd

    * Nominal Variable (Categorical): Variable comprises a finite set of discrete values with no relationship between values.
    * Ordinal Variable: Variable comprises a finite set of discrete values with a ranked ordering between values.

Some examples include:

    A “pet” variable with the values: “dog” and “cat“.
    A “color” variable with the values: “red“, “green” and “blue“.
    A “place” variable with the values: “first”, “second” and “third“.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.
The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.
A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. 

## Propety of One Hot Encoding
The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

To avoid "Dummy variable trap" we drop one of the columns.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

## Importing

In [2]:
df = pd.read_csv("data/homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


## Data pre-processing

In [3]:
# checking for null values
df.isna().sum()

town     0
area     0
price    0
dtype: int64

### Dummy Variable encoding

In [4]:
# Convert the categorical data into numbers - Dummy variable Encoding

dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [7]:
merged_df = pd.concat([df, dummies], axis = 'columns')
merged_df

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [8]:
# drop columns town and any town column. Here we are dropping monroe township column

dropped_df = merged_df.drop(['town','monroe township'], axis = 'columns')
dropped_df

Unnamed: 0,area,price,robinsville,west windsor
0,2600,550000,0,0
1,3000,565000,0,0
2,3200,610000,0,0
3,3600,680000,0,0
4,4000,725000,0,0
5,2600,585000,0,1
6,2800,615000,0,1
7,3300,650000,0,1
8,3600,710000,0,1
9,2600,575000,1,0


In [9]:
# from sklearn.linear_model import LinearRegression
model = LinearRegression()

# x = features
x = dropped_df.drop('price', axis='columns')

In [31]:
x

Unnamed: 0,area,robinsville,west windsor
0,2600,0,0
1,3000,0,0
2,3200,0,0
3,3600,0,0
4,4000,0,0
5,2600,0,1
6,2800,0,1
7,3300,0,1
8,3600,0,1
9,2600,1,0


In [10]:
y = dropped_df.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [11]:
model.fit(x,y);

In [12]:
# price of a home in monroe township (1, 0, 0) or (0, 0)
print(f'Area = 2600,  monroe township is {model.predict([[2600, 0, 0]])}')

# price of a home in robinsville (0, 1, 0) or (1, 0)
print(f'Area = 2600, robinsville is {model.predict([[2600, 1, 0]])}')

# price of a home in west windsor (0, 0, 1) or (0, 1)
print(f'Area = 2600, west windsor is {model.predict([[2600, 0, 1]])}')

Area = 2600,  monroe township is [539709.7398409]
Area = 2600, robinsville is [565396.1513653]
Area = 2600, west windsor is [579723.71533004]


### Hot Encoding 

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# df label encoded, df = dataframe
dfle = df
dfle

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [13]:
# fit and transform the town column from words to numbers
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [14]:
# .values - converts the pandas dataframe to numpy array
#x features
x = dfle[['town', 'area']].values
x

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [15]:
# y prediction
y = dfle.price.values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# column transformer = ct. Encodes town column using one hot encoder  to column 0
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder='passthrough')

In [17]:
x = ct.fit_transform(x)
# x = pd.DataFrame(x)
x

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [18]:
# drop the monroe township column
x = x[:,1:]
x = pd.DataFrame(x)
x

Unnamed: 0,0,1,2
0,0.0,0.0,2600.0
1,0.0,0.0,3000.0
2,0.0,0.0,3200.0
3,0.0,0.0,3600.0
4,0.0,0.0,4000.0
5,0.0,1.0,2600.0
6,0.0,1.0,2800.0
7,0.0,1.0,3300.0
8,0.0,1.0,3600.0
9,1.0,0.0,2600.0


In [19]:
# fit the model (One hot encoding model = ohe_model)
# from sklearn.linear_model import LinearRegression

ohe_model = LinearRegression()
ohe_model.fit(x,y)

# making predictions
# monroe township (1, 0, 0) or (0, 0)
# robinsville (0, 1, 0) or (1, 0)
# west windsor (0, 0, 1) or (0, 1)

print('Dummy variable results \n')
print(f'Area = 2800, Town = monroe township is {model.predict([[2800, 0, 0]])}<-')
print(f'Area = 2800, Town = robinsville is {model.predict([[2800, 1, 0]])}')
print(f'Area = 2800, Town = west windsor is {model.predict([[2800, 0, 0]])}')

print('\n')
print(f'Area = 3400, Town = monroe township is {model.predict([[3400, 0, 0]])}---')
print(f'Area = 3400, Town = robinsville is {model.predict([[3400, 1, 0]])}')
print(f'Area = 3400, Town = west windsor is {model.predict([[3400, 0, 1]])}')

Dummy variable results 

Area = 2800, Town = monroe township is [565089.22812299]<-
Area = 2800, Town = robinsville is [590775.63964739]
Area = 2800, Town = west windsor is [565089.22812299]


Area = 3400, Town = monroe township is [641227.69296925]---
Area = 3400, Town = robinsville is [666914.10449366]
Area = 3400, Town = west windsor is [681241.6684584]


### Comparing Dummy variable vs Hot Encoding

In [20]:
# making predictions - dummy variable
# monroe township (1, 0, 0) or (0, 0)
# robinsville     (0, 1, 0) or (1, 0)
# west windsor    (0, 0, 1) or (0, 1)

print('Dummy variable results \n')
print(f'Area = 2800, Town = monroe township is {model.predict([[2800, 0, 0]])}<-')
print(f'Area = 2800, Town = robinsville is {model.predict([[2800, 1, 0]])}')
print(f'Area = 2800, Town = west windsor is {model.predict([[2800, 0, 1]])}')

print('\n')
print(f'Area = 3400, Town = monroe township is {model.predict([[3400, 0, 0]])}---')
print(f'Area = 3400, Town = robinsville is {model.predict([[3400, 1, 0]])}')
print(f'Area = 3400, Town = west windsor is {model.predict([[3400, 0, 1]])}')

# ---------------------------------------------------------------------------------
# making predictions - Hot Encoding
# monroe township (1, 0, 0) or (0, 0)
# robinsville (0, 1, 0) or (1, 0)
# west windsor (0, 0, 1) or (0, 1)
print('\n')
print('Hot Encoding results \n')
print(f'Area = 2800, Town = monroe township is {ohe_model.predict([[0, 0, 2800]])}<-')
print(f'Area = 2800, Town = robinsville is {ohe_model.predict([[1, 0, 2800]])}')
print(f'Area = 2800, Town = west windsor is {ohe_model.predict([[0, 1, 2800]])}')

print('\n')
print(f'Area = 3400, Town = monroe township is {ohe_model.predict([[0, 0, 3400]])}---')
print(f'Area = 2600, Town = robinsville is {ohe_model.predict([[1, 0, 3400]])}')
print(f'Area = 3400, Town = west windsor is {ohe_model.predict([[0, 1, 3400]])}')


Dummy variable results 

Area = 2800, Town = monroe township is [565089.22812299]<-
Area = 2800, Town = robinsville is [590775.63964739]
Area = 2800, Town = west windsor is [605103.20361213]


Area = 3400, Town = monroe township is [641227.69296925]---
Area = 3400, Town = robinsville is [666914.10449366]
Area = 3400, Town = west windsor is [681241.6684584]


Hot Encoding results 

Area = 2800, Town = monroe township is [565089.22812299]<-
Area = 2800, Town = robinsville is [590775.63964739]
Area = 2800, Town = west windsor is [605103.20361213]


Area = 3400, Town = monroe township is [641227.69296925]---
Area = 2600, Town = robinsville is [666914.10449366]
Area = 3400, Town = west windsor is [681241.6684584]


In [21]:
# table for reference. Monroe column is dropped.
merged_df

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [22]:
# clean code for dummy variable encoding:

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
df = pd.read_csv("data/homeprices.csv")
dummies = pd.get_dummies(df.town)
merged_df = pd.concat([df, dummies], axis='columns')
dropped_df = merged_df.drop(['town', 'monroe township'], axis='columns')

model = LinearRegression()
x = dropped_df.drop('price', axis='columns')
y = dropped_df.price
model.fit(x, y)

print('Dummy variable results \n')
print(f'Area = 2800, Town = monroe township is {model.predict([[2800, 0, 0]])}<-')
print(f'Area = 2800, Town = robinsville is {model.predict([[2800, 1, 0]])}')
print(f'Area = 2800, Town = west windsor is {model.predict([[2800, 0, 1]])}')

print('\n')
print(f'Area = 3400, Town = monroe township is {model.predict([[3400, 0, 0]])}---')
print(f'Area = 3400, Town = robinsville is {model.predict([[3400, 1, 0]])}')
print(f'Area = 3400, Town = west windsor is {model.predict([[3400, 0, 1]])}')

Dummy variable results 

Area = 2800, Town = monroe township is [565089.22812299]<-
Area = 2800, Town = robinsville is [590775.63964739]
Area = 2800, Town = west windsor is [605103.20361213]


Area = 3400, Town = monroe township is [641227.69296925]---
Area = 3400, Town = robinsville is [666914.10449366]
Area = 3400, Town = west windsor is [681241.6684584]


In [24]:
# clean code for hot encoding:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

dfle = pd.read_csv("data/homeprices.csv")
dfle.town = le.fit_transform(dfle.town)
x = dfle[['town', 'area']].values
y = dfle.price.values
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder='passthrough')
x = ct.fit_transform(x)
x = x[:,1:]

ohe_model = LinearRegression()
ohe_model.fit(x,y)


print('Dummy variable results \n')
print(f'Area = 2800, Town = monroe township is {model.predict([[2800, 0, 0]])}<-')
print(f'Area = 2800, Town = robinsville is {model.predict([[2800, 1, 0]])}')
print(f'Area = 2800, Town = west windsor is {model.predict([[2800, 0, 0]])}')

print('\n')
print(f'Area = 3400, Town = monroe township is {model.predict([[3400, 0, 0]])}---')
print(f'Area = 3400, Town = robinsville is {model.predict([[3400, 1, 0]])}')
print(f'Area = 3400, Town = west windsor is {model.predict([[3400, 0, 1]])}')

Dummy variable results 

Area = 2800, Town = monroe township is [565089.22812299]<-
Area = 2800, Town = robinsville is [590775.63964739]
Area = 2800, Town = west windsor is [565089.22812299]


Area = 3400, Town = monroe township is [641227.69296925]---
Area = 3400, Town = robinsville is [666914.10449366]
Area = 3400, Town = west windsor is [681241.6684584]
