## Categorical Variables and One Hot Encoding

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('homeprices.csv')
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


### Using pandas to create dummy variables

Can use ``drop_first=True`` to drop the first column in order to avoid dummy trap.
``pd.get_dummies(df.town, drop_first=True)``

In [3]:
pd.get_dummies(df.town)

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [4]:
dummies = pd.get_dummies(df.town)
dummies.head()

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [5]:
merged = pd.concat([df,dummies], axis='columns')
merged.head()

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0


In [6]:
final = merged.drop(['town'], axis='columns')
final.head()

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0


### Dummy Variable Trap

When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you.**

In [7]:
final = final.drop(['west windsor'], axis='columns')

In [8]:
# # Other way of droping a column without using a new dataframe.
# merged.drop(['town', 'west windsor'], axis='columns', inplace=True)
# merged

In [9]:
X = final.drop('price', axis='columns')
X.head()

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0


In [10]:
y = final.price
y.head()

0    550000
1    565000
2    610000
3    680000
4    725000
Name: price, dtype: int64

In [11]:
from sklearn import linear_model
model = linear_model.LinearRegression()

In [12]:
model.fit(X,y)

LinearRegression()

In [13]:
model.coef_

array([   126.89744141, -40013.97548914, -14327.56396474])

In [14]:
model.intercept_

249790.36766292533

In [15]:
model.predict(X) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [16]:
model.score(X,y)

0.9573929037221873

In [17]:
model.predict([[3400, 0, 0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

In [18]:
model.predict([[2800, 0, 1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

### Using sklearn OneHotEncoder

First step is to use label encoder to convert town names into numbers

In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [20]:
dfle = df.copy()
le.fit_transform(dfle.town)

array([0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1])

In [21]:
dfle.town = le.fit_transform(dfle.town)
dfle.head()

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000


In [22]:
X = dfle[['town', 'area']].values
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [23]:
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

As ``OneHotEncoder`` doesn't take column names as a parameter, like in the code below. We used ``ColumnTransformer`` to tackle that

```
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])

X = ohe.fit_transform(X).toarray()
```

In [24]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], 
                       remainder='passthrough')

In [25]:
X = ct.fit_transform(X)
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [26]:
X = X[:, 1:]
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [27]:
model.fit(X,y)

LinearRegression()

In [28]:
model.predict(X) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781552, 579723.71533004, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.1513653 , 603465.38378843, 628844.87207052,
       692293.59277575])

In [29]:
model.score(X,y)

0.9573929037221874

In [30]:
model.predict([[0, 1, 3400]]) # 3400 sqr ft home in west windsor

array([681241.6684584])

In [31]:
model.predict([[1, 0, 2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

## Using OneHotEncoding with some modifications

### https://www.statology.org/one-hot-encoding-in-python/

In [32]:
import pandas as pd
import numpy as np

In [33]:
df = pd.read_csv('homeprices.csv')
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [34]:
from sklearn.preprocessing import OneHotEncoder

In [35]:
# creating instcance of one-hot-encoder
ohe = OneHotEncoder(handle_unknown='ignore')

In [36]:
# perform one-hot encoding on 'town' column
encoder_df = pd.DataFrame(
    ohe.fit_transform(df[['town']]).toarray())

In [37]:
# merge one_hot encoded columns back with original DataFrame
final_df = df.join(encoder_df)

In [38]:
print(final_df.head())

              town  area   price    0    1    2
0  monroe township  2600  550000  1.0  0.0  0.0
1  monroe township  3000  565000  1.0  0.0  0.0
2  monroe township  3200  610000  1.0  0.0  0.0
3  monroe township  3600  680000  1.0  0.0  0.0
4  monroe township  4000  725000  1.0  0.0  0.0


In [39]:
final_df.columns = ['town', 'area', 'price', 'monroe township', 
                    'robinsville', 'west windsor']

In [40]:
final_df.drop(['town', 'west windsor'], 
              axis='columns', inplace=True)
final_df.head()

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1.0,0.0
1,3000,565000,1.0,0.0
2,3200,610000,1.0,0.0
3,3600,680000,1.0,0.0
4,4000,725000,1.0,0.0


In [41]:
X = final_df.drop('price', axis='columns')
X.head()

Unnamed: 0,area,monroe township,robinsville
0,2600,1.0,0.0
1,3000,1.0,0.0
2,3200,1.0,0.0
3,3600,1.0,0.0
4,4000,1.0,0.0


In [42]:
y = final_df.price
y.head()

0    550000
1    565000
2    610000
3    680000
4    725000
Name: price, dtype: int64

In [43]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [44]:
model.fit(X,y)

LinearRegression()

In [45]:
model.predict(X) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [46]:
model.score(X,y)

0.9573929037221873

In [47]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

In [48]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

### https://www.geeksforgeeks.org/ml-one-hot-encoding/

In [49]:
import pandas as pd
import numpy as np

In [50]:
df = pd.read_csv('homeprices.csv')
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [51]:
# extract categorical columns from the dataframe
# here we extract the columns with object datatype as
    # they are the cotegorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
categorical_columns

['town']

In [52]:
# initialize OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [53]:
# apply one-hot encoding to the categorical columns
one_hot_encoded = ohe.fit_transform(df[categorical_columns])

In [54]:
# create a DataFrame with the one-hot encoded columns
# we use get_feature_names() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, 
                          columns=ohe.get_feature_names(categorical_columns))

In [55]:
# concatenate the one_hot encoded dataframe with the original dataframe
df_encoded = pd.concat([df, one_hot_df], axis='columns')
df_encoded.head()

Unnamed: 0,town,area,price,town_monroe township,town_robinsville,town_west windsor
0,monroe township,2600,550000,1.0,0.0,0.0
1,monroe township,3000,565000,1.0,0.0,0.0
2,monroe township,3200,610000,1.0,0.0,0.0
3,monroe township,3600,680000,1.0,0.0,0.0
4,monroe township,4000,725000,1.0,0.0,0.0


In [56]:
df_encoded.drop(['town', 'town_west windsor'], 
              axis='columns', inplace=True)
df_encoded.head()

Unnamed: 0,area,price,town_monroe township,town_robinsville
0,2600,550000,1.0,0.0
1,3000,565000,1.0,0.0
2,3200,610000,1.0,0.0
3,3600,680000,1.0,0.0
4,4000,725000,1.0,0.0


In [57]:
X = df_encoded.drop('price', axis='columns')
X.head()

Unnamed: 0,area,town_monroe township,town_robinsville
0,2600,1.0,0.0
1,3000,1.0,0.0
2,3200,1.0,0.0
3,3600,1.0,0.0
4,4000,1.0,0.0


In [58]:
y = df_encoded.price
y.head()

0    550000
1    565000
2    610000
3    680000
4    725000
Name: price, dtype: int64

In [59]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [60]:
model.fit(X,y)

LinearRegression()

In [61]:
model.predict(X) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [62]:
model.score(X,y)

0.9573929037221873

In [63]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

In [64]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

## Tried to pass column as a parameter in OneHotEncoding but didn't work

In [65]:
import numpy as np
import pandas as pd
from sklearn import linear_model

In [66]:
df = pd.read_csv('homeprices.csv')
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [67]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [68]:
dfle = df
le.fit_transform(dfle.town)

array([0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1])

In [69]:
dfle.town = le.fit_transform(dfle.town)
dfle.head()

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000


In [70]:
inputs = dfle.drop('price', axis = 'columns')
inputs
# inputs.to_numpy()

Unnamed: 0,town,area
0,0,2600
1,0,3000
2,0,3200
3,0,3600
4,0,4000
5,2,2600
6,2,2800
7,2,3300
8,2,3600
9,1,2600


In [71]:
outputs = dfle.price
outputs

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [72]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [73]:
one_hot_encoded = ohe.fit_transform([inputs['town']])

In [74]:
# one_hot_df = pd.DataFrame(one_hot_encoded, 
#                           columns=ohe.get_feature_names(inputs['town']))

### https://builtin.com/articles/one-hot-encoding

In [75]:
import pandas as pd
import numpy as np

In [76]:
df = pd.read_csv('homeprices.csv')
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [77]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [78]:
dfle = df.copy()
# dfle.town = le.fit_transform(dfle.town)
# dfle

In [79]:
X = dfle.drop('price', axis='columns')
X.head()

Unnamed: 0,town,area
0,monroe township,2600
1,monroe township,3000
2,monroe township,3200
3,monroe township,3600
4,monroe township,4000


In [80]:
y = dfle.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [81]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories=[['monroe township','west windsor','robinsville']])

In [82]:
# ohe.fit_transform(X['town'])