# **Categorical, Dummy Variables, One Hot Encoding**  
  
### Using  
  
## 1. pandas _get_dummies_  
## 2. sklearn _OneHotEncode_

## **Dataset:**  
  
| town | area | price |
|:--------------------:|:--------------------:|:--------------------:|
| monroe township | 2600 | 550000 |
| monroe township | 3000 | 565000 |
| monroe township | 3200 | 610000 |
| monroe township | 3600 | 680000 |
| monroe township | 4000 | 725000 |
| west windsor | 2600 | 585000 |
| west windsor | 2800 | 615000 |
| west windsor | 3300 | 650000 |
| west windsor | 3600 | 710000 |
| robbinsville | 2600 | 575000 |
| robbinsville | 2900 | 600000 |
| robbinsville | 3100 | 620000 |
| robbinsville | 3600 | 695000 |

## **Task:**  
  
### Build a predictor function to predict price of a home:  
  
### 1. with 3400 sqr ft area in west windsor  
  
### 2. 2800 sqr ft home in robbinsville

## **Handling the text data:**  
  
We have to handle the first column of dataset which is in text format because ML model only works with numercial data.  
  
### Ways to do that:  
  
#### 1. Use Integer Encoding:  
  
Assign each variable to a specific integer.  
  
1 -> monroe township  
  
2 -> west windsor  
  
3 -> robbinsville  
  
The problem with this approach is that, the model will assume the numbers and it may make assumptions like:  
  
> monroe township > west windsor > robbinsville  
  
or  
  
> monroe township + west windsor = robbinsville

## **Categorical Variables**  
  
The names of townships are called categorical variables.  
  
### Types of Categorical Variables:  
  
1. **Nominal:**  
  
The category don't have any numeric ordering between them. No order relationship between them.  
  
e.g.  
  
- male, Female  
  
- green, red, blue  
  
- monroe township, west windsor, robbinsville  
  
2. **Ordinal:**  
  
The category has specific order relationship between variables.  
  
e.g.  
  
- satisfied > neutral > dissatisfied  
  
- graduate < masters < phd  
  
- high > medium > low  

## **One Hot Encoding**  
  
Now we know that, we are using nominal type of categorical variable. To handle them we will use a technique called one hot encoding.  
  
One Hot Encoding works in way that it creates a sperate column for each variable in dataset and assign binary(0/1) value to each row. If the vairable is present in a rwo, the value will be True(1), if not the value will be False(0).  
  
> The **Dataset** will becomes:  
  
| town | area | price | monroe township | west windsor | robbinsville |
|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
| monroe township | 2600 | 550000 | 1 | 0 | 0 |
| monroe township | 3000 | 565000 | 1 | 0 | 0 |
| monroe township | 3200 | 610000 | 1 | 0 | 0 |
| monroe township | 3600 | 680000 | 1 | 0 | 0 |
| monroe township | 4000 | 725000 | 1 | 0 | 0 |
| west windsor | 2600 | 585000 | 0 | 1 | 0|
| west windsor | 2800 | 615000 | 0 | 1 | 0|
| west windsor | 3300 | 650000 | 0 | 1 | 0|
| west windsor | 3600 | 710000 | 0 | 1 | 0|
| robbinsville | 2600 | 575000 | 0 | 0 | 1 |
| robbinsville | 2900 | 600000 | 0 | 0 | 1 |
| robbinsville | 3100 | 620000 | 0 | 0 | 1 |
| robbinsville | 3600 | 695000 | 0 | 0 | 1 |

> The extra variables which are created are called **Dummy Variables**

## **Pandas _get_dummies_**

In [3]:
import pandas as pd

In [56]:
# Ignoring the warnings

import warnings
warnings.filterwarnings('ignore')

In [57]:
df = pd.read_csv("resources/homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [5]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False
5,False,False,True
6,False,False,True
7,False,False,True
8,False,False,True
9,False,True,False


In [6]:
merged = pd.concat([df,dummies], axis='columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,True,False,False
1,monroe township,3000,565000,True,False,False
2,monroe township,3200,610000,True,False,False
3,monroe township,3600,680000,True,False,False
4,monroe township,4000,725000,True,False,False
5,west windsor,2600,585000,False,False,True
6,west windsor,2800,615000,False,False,True
7,west windsor,3300,650000,False,False,True
8,west windsor,3600,710000,False,False,True
9,robinsville,2600,575000,False,True,False


We're going to drop the "town" column because the model won't work on dataset with test data and also we're going to drop one of the dummy variable's column to save the model from _dummy variable trap_.

## **Dummy Variable Trap:**  
  
When we use **one-hot encoding**, we create multiple columns (called *dummy variables*) for a categorical feature.  
  
The **dummy variable trap** happens when **one of those columns can be perfectly predicted using the others** — meaning the variables are **linearly dependent**.  
  
This causes problems in some models (like **linear regression**) because they can’t handle redundant information — it confuses the model.  
  
---  
  
### Example  
  
Suppose we have a categorical feature called **Color** with 3 possible values:  
  
> Color = Red, Blue, Green  
  
After **one-hot encoding**, we get:  
  
| Color_Red | Color_Blue | Color_Green |
|------------|-------------|-------------|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
  
---  
  
### Observation  

If we know the first two columns (`Color_Red` and `Color_Blue`), we can always figure out the third one:  
  
> Color_Green = 1 - (Color_Red + Color_Blue)  
  
That means one column is **redundant** — it doesn’t add any new information.  
  
This redundancy is what we call the **dummy variable trap**.  
  
---  
   
### How to Fix It  
  
To avoid it, simply **drop one dummy column** (any one).  
  
Example (dropping `Color_Green`):  
  
| Color_Red | Color_Blue |
|------------|-------------|
| 1 | 0 |
| 0 | 1 |
| 0 | 0 |
| 1 | 0 |
  
Now, if both are `0`, we know the color must be **Green** — no information lost, and no redundancy!  
  
---  

In [7]:
# Dropping one column to avoid the dummy variable trap

final = merged.drop(['town','west windsor'], axis = 'columns')
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,True,False
1,3000,565000,True,False
2,3200,610000,True,False
3,3600,680000,True,False
4,4000,725000,True,False
5,2600,585000,False,False
6,2800,615000,False,False
7,3300,650000,False,False
8,3600,710000,False,False
9,2600,575000,False,True


In [8]:
# Create a linear regression model

from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [9]:
X = final.drop('price', axis = 'columns')
X

Unnamed: 0,area,monroe township,robinsville
0,2600,True,False
1,3000,True,False
2,3200,True,False
3,3600,True,False
4,4000,True,False
5,2600,False,False
6,2800,False,False
7,3300,False,False
8,3600,False,False
9,2600,False,True


In [11]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [13]:
model.fit(X,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [48]:
# predicting the price of a 2800 sqr ft home in robbinsville
model.predict([[2800,True, False]])

array([565089.22812299])

In [47]:
# predicting the price of a house with 3400 sqr ft area in west windsor

model.predict([[3400,False, False]])

array([681241.66845839])

In [16]:
# Checking the score of the model

model.score(X,y)

0.9573929037221872

## **sklearn _onehotencode_**

In [58]:
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


> We need to do label encoding of the "town" column in order to use onehotencoder

In [59]:
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
le = LabelEncoder()

In [60]:
df_labeled = df
df_labeled.town = le.fit_transform(df_labeled.town)
df_labeled

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [61]:
X = df_labeled[['town', 'area']].values # OneHotEncoder requires the input to be in 2D array format
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]])

In [62]:
y = df_labeled.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [63]:
from sklearn.preprocessing import OneHotEncoder

ohe = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(), [0]) # It tells the encoder to apply encoding only on 0th column
    ],
    remainder='passthrough'
)

In [64]:
X = ohe.fit_transform(X)
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [65]:
X = X[:,1:] # Dropping the first column
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [66]:
model.fit(X,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [68]:
# predicting the price of a 2800 sqr ft home in robbinsville

model.predict([[1,0,2800]])

array([590775.63964739])

In [69]:
# predicting the price of a house with 3400 sqr ft area in west windsor

model.predict([[0,1,3400]])

array([681241.6684584])