# Dummy Variables

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('Cars.csv', skiprows=1)
df.head()

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4


In [3]:
dummies = pd.get_dummies(df['Car Model'])
dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


## Dummy Variable Trap
When we have a direct correlation between data to make Dummy variables on, this would result in Dummy variable trap. Let's say we have a dataset with male and female in it

| Gender |
| :-: |
|Male|
|Female|
|Male|
|Female|
| ... |
| Male |

Our dummy variables be<br>
Gender_male = [1, 0, 1, 0, ..., 1]<br>
Female_male = [0, 1, 0, 1, ..., 0]<br>
We know that $x_{male}$ is correlated to $x_{female}$ and we can rewrite it as $x_{male} = 1 - x_{female}$<br>
In a regression model like: <br>
$
f_{w, b}(x) = w_1x_{male} + w_2x_{female} + b
$<br>
Like it was mentioned we can substitude $x_{male} = 1 - x_{female}$ <br>
$
f_{w, b}(x) = (w_1 - w_2)x_{male} + b + w_2
$<br>
Where $(w_1 - w_2)$ could be our new coefficient and $b + w_2$ our new intercept.
If we are to cope with Dummy Varaible Trap, we should drop one of the possible values; If we have $n$ values, we should only care for $n-1$ of them since we can find the $n$th value using the previous ones.

In [4]:
dummies = dummies.iloc[:, 1:]

In [5]:
dummies

Unnamed: 0,BMW X5,Mercedez Benz C class
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,1


also use pandas get_dummies function with "drop_first" to be true

In [6]:
df = pd.get_dummies(df, columns=['Car Model'], prefix='CarModel', drop_first=True)
df.head()

Unnamed: 0,Mileage,Sell Price($),Age(yrs),CarModel_BMW X5,CarModel_Mercedez Benz C class
0,69000,18000,6,1,0
1,35000,34000,3,1,0
2,57000,26100,5,1,0
3,22500,40000,2,1,0
4,46000,31500,4,1,0


In [7]:
df.to_csv('Cars_Result.csv')