In [1]:
import numpy as np
import pandas as pd

In [6]:
home_prices_df = pd.read_csv('homeprices.csv')
home_prices_df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


We use One Hot Encoding when we have Categorical values and need to handle to text data. How it works is that we create a new column for each of our categories and assign 1 to the rows that have the category or 0 to the rows that don't have the category.

In [8]:
dummies = pd.get_dummies(home_prices_df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [9]:
merged = pd.concat([home_prices_df, dummies], axis='columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


The dummy variable trap is a common issue that can arise when using categorical variables, particularly in regression analysis. It occurs when two or more categorical variables are highly correlated because they are used to represent the same information. This can lead to multicollinearity, which can make it difficult to interpret the results of a regression analysis.

Here's a simplified example to illustrate the concept:

Suppose you are conducting a regression analysis to predict the price of a car based on its color, and you decide to represent the car's color using dummy variables. You might create two dummy variables, "Red" and "Blue," where:

* Red = 1 if the car is red, 0 otherwise.
* Blue = 1 if the car is blue, 0 otherwise.

Now, you have a dummy variable for both red and blue cars. However, if you include both "Red" and "Blue" in your regression model, you will run into the dummy variable trap. Why? Because if a car is not red (i.e., Red = 0), then it must be some other color (including blue). So, the information about whether a car is blue or not is already contained in the "Red" variable. In this case, the two dummy variables are perfectly correlated, and this can cause problems in the regression analysis.

To avoid the dummy variable trap, you typically need to omit one of the dummy variables. For example, you can drop the "Red" dummy variable and include only the "Blue" dummy variable in your regression model. This approach ensures that there is no perfect multicollinearity between the dummy variables.

In [10]:
final = merged.drop(['town', 'west windsor'], axis='columns')
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1
