<h2 align = "center">Categorical Variables and One Hot Encoding</h2>

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [3]:
df.replace(to_replace=['monroe township','west windsor','robinsville'], value=["Faysal", "El Helmya","Haram"], inplace=True)

In [4]:
df

Unnamed: 0,town,area,price
0,Faysal,2600,550000
1,Faysal,3000,565000
2,Faysal,3200,610000
3,Faysal,3600,680000
4,Faysal,4000,725000
5,El Helmya,2600,585000
6,El Helmya,2800,615000
7,El Helmya,3300,650000
8,El Helmya,3600,710000
9,Haram,2600,575000


In [5]:
df.groupby("town").mean()[["price"]]

Unnamed: 0_level_0,price
town,Unnamed: 1_level_1
El Helmya,640000.0
Faysal,626000.0
Haram,622500.0


In [5]:
df.groupby("town").median()[["price"]]

Unnamed: 0_level_0,price
town,Unnamed: 1_level_1
El Helmya,632500.0
Faysal,610000.0
Haram,610000.0


In [8]:
df.groupby("town")[["price"]].mean()

Unnamed: 0_level_0,price
town,Unnamed: 1_level_1
El Helmya,640000.0
Faysal,626000.0
Haram,622500.0


In [5]:
df.groupby("town").mean()

Unnamed: 0_level_0,area,price
town,Unnamed: 1_level_1,Unnamed: 2_level_1
El Helmya,3075.0,640000.0
Faysal,3280.0,626000.0
Haram,3050.0,622500.0


In [6]:
df.town.value_counts()

Faysal       5
El Helmya    4
Haram        4
Name: town, dtype: int64

* Cheapest location : Faysal          
* Most expensive : Helmya 

Data Encoding <br> 
Wrong way : <br> 
faysal:  0 <br>
Haram :  2 <br>
Helmya: 1 <br> 

<h2 style='color:purple'>Using pandas to create dummy variables</h2>

In [9]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,El Helmya,Faysal,Haram
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


In [10]:
merged = pd.concat([df,dummies],axis='columns') # concatination of the two dataframes 
merged

Unnamed: 0,town,area,price,El Helmya,Faysal,Haram
0,Faysal,2600,550000,0,1,0
1,Faysal,3000,565000,0,1,0
2,Faysal,3200,610000,0,1,0
3,Faysal,3600,680000,0,1,0
4,Faysal,4000,725000,0,1,0
5,El Helmya,2600,585000,1,0,0
6,El Helmya,2800,615000,1,0,0
7,El Helmya,3300,650000,1,0,0
8,El Helmya,3600,710000,1,0,0
9,Haram,2600,575000,0,0,1


In [11]:
merged.describe() 

Unnamed: 0,area,price,El Helmya,Faysal,Haram
count,13.0,13.0,13.0,13.0,13.0
mean,3146.153846,629230.769231,0.307692,0.384615,0.307692
std,453.900475,57621.109914,0.480384,0.50637,0.480384
min,2600.0,550000.0,0.0,0.0,0.0
25%,2800.0,585000.0,0.0,0.0,0.0
50%,3100.0,615000.0,0.0,0.0,0.0
75%,3600.0,680000.0,1.0,1.0,1.0
max,4000.0,725000.0,1.0,1.0,1.0


In [12]:
final = merged.drop(['town'], axis='columns')
final

Unnamed: 0,area,price,El Helmya,Faysal,Haram
0,2600,550000,0,1,0
1,3000,565000,0,1,0
2,3200,610000,0,1,0
3,3600,680000,0,1,0
4,4000,725000,0,1,0
5,2600,585000,1,0,0
6,2800,615000,1,0,0
7,3300,650000,1,0,0
8,3600,710000,1,0,0
9,2600,575000,0,0,1


In [22]:
0,1,0
1,0,0
0,0,1

0,1
1,0
0,0

(0, 0)

<h3 style='color:purple'>Dummy Variable Trap</h3>

When you can derive one variable from other variables, they are known to be multi-colinear. Here
if you know values of california and georgia then you can easily infer value of new jersey state, i.e. 
california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this
situation linear regression won't work as expected. Hence you need to drop one column. 

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the 
    state columns it is going to work, however we should make a habit of taking care of dummy variable
    trap ourselves just in case library that you are using is not handling this for you**

In [13]:
final = final.drop(['Haram'], axis='columns')
final

Unnamed: 0,area,price,El Helmya,Faysal
0,2600,550000,0,1
1,3000,565000,0,1
2,3200,610000,0,1
3,3600,680000,0,1
4,4000,725000,0,1
5,2600,585000,1,0
6,2800,615000,1,0
7,3300,650000,1,0
8,3600,710000,1,0
9,2600,575000,0,0


In [14]:
X = final.drop('price', axis='columns')
X

Unnamed: 0,area,El Helmya,Faysal
0,2600,0,1
1,3000,0,1
2,3200,0,1
3,3600,0,1
4,4000,0,1
5,2600,1,0
6,2800,1,0
7,3300,1,0
8,3600,1,0
9,2600,0,0


In [15]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [16]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [17]:
# m1x1 + m2x2 + m3x3 + c 
model.fit(X,y) #training process

LinearRegression()

In [18]:
model.predict(X) # 2600 sqr ft home in new jersey

array([539709.73984091, 590468.71640508, 615848.20468716, 666607.18125133,
       717366.1578155 , 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [19]:
model.score(X,y)

0.9573929037221873

In [21]:
model.predict([[2000,0,0]]) # 2000 sqr ft home in Haram



array([489257.68651906])

In [22]:
model.predict([[2000,0,1]]) # 2800 sqr ft home in Faysal



array([463571.27499465])

In [23]:
model.predict([[2000,1,0]]) # 2000 sqr ft home in Helmya



array([503585.25048379])

Equation: 
- y = m1 * town + m2 * area + c 
- y = m1* faysal + m2* elhlmya + m3 *  area + c 

### Binary Encoder

In [6]:
import category_encoders as ce 
binary_encoder = ce.BinaryEncoder(cols= ["town"], return_df= True)
encoded_towns = binary_encoder.fit_transform(df)
encoded_towns

Unnamed: 0,town_0,town_1,area,price
0,0,1,2600,550000
1,0,1,3000,565000
2,0,1,3200,610000
3,0,1,3600,680000
4,0,1,4000,725000
5,1,0,2600,585000
6,1,0,2800,615000
7,1,0,3300,650000
8,1,0,3600,710000
9,1,1,2600,575000


<h2 style='color:purple'>Using sklearn OneHotEncoder</h2>

First step is to use label encoder to convert town names into numbers

In [22]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [23]:
dfle = df
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,1,2600,550000
1,1,3000,565000
2,1,3200,610000
3,1,3600,680000
4,1,4000,725000
5,0,2600,585000
6,0,2800,615000
7,0,3300,650000
8,0,3600,710000
9,2,2600,575000


In [24]:
X = dfle[['town','area']].values

In [27]:
X

array([[   1, 2600],
       [   1, 3000],
       [   1, 3200],
       [   1, 3600],
       [   1, 4000],
       [   0, 2600],
       [   0, 2800],
       [   0, 3300],
       [   0, 3600],
       [   2, 2600],
       [   2, 2900],
       [   2, 3100],
       [   2, 3600]], dtype=int64)

In [26]:
y = dfle.price.values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

Now use one hot encoder to create dummy variables for each of the town

In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [26]:
X = ct.fit_transform(X)
X

array([[0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 4.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 2.8e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.3e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.9e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.1e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03]])

In [46]:
X = X[:,1:]

In [47]:
X

array([[1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 3.6e+03]])

In [48]:
model.fit(X,y)

LinearRegression()

In [49]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

array([666914.10449366])

In [35]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

<h2 style='color:green'>Exercise</h2>

At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv.
This file has car sell prices for 3 different models. First plot data points on a scatter plot chart
to see if linear regression model can be applied. If yes, then build a model that can answer
following questions,

**1) Predict price of a mercedez benz that is 4 yr old with mileage 45000**

**2) Predict price of a BMW X5 that is 7 yr old with mileage 86000**

**3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())**