<h1><center>Categorical Variables and One Hot Encoding</center></h1>

In [1]:
import pandas as pd 

In [5]:
df = pd.read_csv("homeprices.csv")
df
# There are dummy variables on town column. 
#All values are string and we need to assign them with numbers

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [6]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


**concat** method which you can call to join two data frames

It takes an array as an input where you will supply two data frames that you want to join 

You also need to tell how do you want to join. So the axis you want to use is of course columns!

In [7]:
merged = pd.concat([df, dummies], axis='columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [19]:
final = merged.drop(['town'], axis='columns')
final

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


## Dummy Variable Trap

When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you**

In [20]:
final = final.drop(['west windsor'], axis='columns')
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [10]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [12]:
X = final.drop('price',axis='columns')
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [14]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [15]:
model.fit(X, y)

In [16]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville



array([590775.63964739])

In [17]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor



array([681241.66845839])

It is telling me a 681241 dollars.
3300 square foot home in west windsor was 650000. 

**This result little bit bigger.**

If you want to know how accurate your model is you can use **model.score()** method supply your x and y .

What it will do is will calculate the predicted values for all of the rows in x and then it will compare predictor value with the actual values which is this y. After that it will use some formula to calculate the score .

Here the score is 0.95 which means my model is 95 accurate if the score is 1 it means it is like perfect

In [18]:
model.score(X, y)

0.9573929037221872

## Using sklearn OneHotEncoder

First step is to use label encoder to convert town names into numbers

In [21]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [24]:
dfle = df
dfle.town = le.fit_transform(df.town) # Use label encoder to fit and transform.
dfle # fit & transform means it takes the label column as an input so my label column is town

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [27]:
# X is your training data set and y is your dependent variable
X = dfle[['town','area']].values
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [28]:
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

What will happen it will assume all my x columns to be categorical variables hence. What you need to do when you create onehotencoder is you need to specify categorical features. 

Later on whatever x I'm supplying the zeroth column in that x is my categorical feature

In [34]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ohe = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [37]:
X = ohe.fit_transform(X)
X

array([[0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [38]:
X = X[:,1:]
# Take all the rows and for the columns takes all the columns from index 1 onwards
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])