# Dummy Variables in Python

In Python, **dummy variables** (also known as *indicator variables* or *binary variables*) are used in statistical modeling, particularly in regression analysis, to represent categorical data as numerical values. Categorical data refers to data that can be divided into distinct categories or groups, such as "red," "green," "blue" for colors or "male" and "female" for gender. These variables are essential because many machine learning algorithms, like linear regression or logistic regression, require numerical input features, and you can't directly use categorical variables with these models.

Dummy variables are created by converting each category or level of a categorical variable into a separate binary (0 or 1) variable. Each binary variable represents the presence or absence of a specific category. For example, if you have a "color" variable with three categories: "red," "green," and "blue," you would create three dummy variables:

1. **Red:** 1 if the color is red, 0 otherwise.
2. **Green:** 1 if the color is green, 0 otherwise.
3. **Blue:** 1 if the color is blue, 0 otherwise.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [3]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [4]:
# Concat or join the two dataframes
df_new = pd.concat([df,dummies], axis='columns')
df_new

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [5]:
# We need to drop the town column since we have the dummies one and that will not work on our Linear Regression model
# Dropping Rows (axis=0):
# Dropping Columns (axis=1):
df_new  = df_new.drop(['town'], axis=1)
df_new

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


# Understanding the Dummy Variable Trap

In statistical modeling and regression analysis, it's crucial to comprehend the **Dummy Variable Trap**. This concept arises when dealing with categorical variables, which are variables that represent distinct categories or groups. To integrate categorical variables into regression models, we often use **dummy variables**, also known as *indicator variables* or *binary variables*.

## What is the Dummy Variable Trap?

The **dummy variable trap** is a situation where predictor variables, specifically dummy variables representing categories of a categorical variable, exhibit **high correlation** with one another. This high correlation can lead to a problem known as **multicollinearity**, which can seriously affect the integrity of regression analysis.

## Why Does the Trap Occur?

The Dummy Variable Trap arises because when all three dummy variables (D1, D2, and D3) are included in the regression model, they become **perfectly correlated**. This is because for any given observation, if D1 = 0 and D2 = 0, then D3 must be 1. This perfect multicollinearity poses several issues:

- **Redundancy**: Including all three dummy variables is redundant since they provide identical information.

- **Model Instability**: Perfect multicollinearity makes it impossible for the regression algorithm to accurately estimate coefficients for each dummy variable. The coefficients become unstable and challenging to interpret.

## How to Avoid the Dummy Variable Trap

To sidestep the Dummy Variable Trap, we must omit one of the dummy variables while including the others in the model. Typically, one category level is dropped, and the omitted category serves as the **reference category**.



In conclusion, the Dummy Variable Trap is a critical consideration when working with categorical variables in regression analysis. To steer clear of the trap, omit one category level, allowing you to create a robust and interpretable model while capturing the effects of each category.




In [6]:
df_new

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0
5,2600,585000,0,0,1
6,2800,615000,0,0,1
7,3300,650000,0,0,1
8,3600,710000,0,0,1
9,2600,575000,0,1,0


In [7]:
df_new = df_new.drop(['west windsor'], axis=1)

In [8]:
df_new

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


In [9]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [10]:
X = df_new.drop('price', axis = 'columns') # Since the precicted output will be judged on 3 parameters (area , monroe township ,robinsville) and not on price (target variable)
y = df[['price']]
model.fit(X,y)

In [11]:
# 0 = Absence of variable
# 1 = Presence of variable
model.predict([[3600 , 0 , 1]])



array([[692293.59277574]])

In [12]:
# You might be wondering how would we get the prediction for west windsor?
model.predict([[3600 , 0 , 0]])
# If not from both towns then the only option is the third one



array([[706621.15674048]])

In [13]:
# Check accuracy of model (multiply by 100 for percentage)
(model.score(X,y))*100

95.73929037221872

# Understanding the Differences: Dummy Variables vs. One Hot Encoding

In the realm of data preprocessing and machine learning, two common techniques for handling categorical variables are **Dummy Variables** and **One Hot Encoding**. While they both transform categorical data into numerical format, they exhibit distinct characteristics that are crucial to grasp.

## Dummy Variables: A Closer Look

**Dummy Variables** are binary (0 or 1) variables designed to represent the presence or absence of each category within a categorical variable. They are created by converting each category into a separate binary column. Key aspects of Dummy Variables include:

- **Representation**: Each category is transformed into a binary column.
- **Number of Columns**: One less than the total number of unique categories to avoid multicollinearity.
- **Multicollinearity**: Potential for multicollinearity when all columns are included.
- **Dimensionality**: Generally results in lower dimensionality.
- **Use Cases**: Useful when specifying a reference category or guided by domain knowledge.

## One Hot Encoding: A Deeper Dive

**One Hot Encoding**, like Dummy Variables, transforms categorical variables into binary columns, but it does so differently. In One Hot Encoding, each unique category gets its dedicated binary column with a 1 representing the category's presence and 0s for all other categories. Key aspects of One Hot Encoding include:

- **Representation**: Each category is represented by an independent binary column.
- **Number of Columns**: One column for each unique category, including all categories.
- **Multicollinearity**: Typically avoids multicollinearity as each category is isolated.
- **Dimensionality**: Often leads to higher dimensionality due to including all categories.
- **Use Cases**: Ideal when representing all categories independently or when dealing with algorithms that require such representations.

## Making the Right Choice

The choice between Dummy Variables and One Hot Encoding depends on your specific data, modeling approach, and the requirements of your machine learning analysis. Understanding their differences empowers you to select the most appropriate technique for your categorical data handling needs.


In [14]:
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [15]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # Create class object

In [16]:
le.fit_transform(df.town)
# Returns Labels

# monroe township = 0
# west windsor = 2
# robinsville = 1

array([0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1])

In [17]:
df_new = df
df_new['town'] = le.fit_transform(df['town'])
df_new

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [18]:
X = df_new[['town', 'area']].values # We need a "2D array" not a "dataframe"
y = df_new[['price']].values

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=float)

In [20]:
X
# 2.6e+03 = 2600 square foot
# 1.0e+00, 0.0e+00, 0.0e+00 == 3 dummy variable columns

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [21]:
# To avoid the dummy variable trap I need to drop one of the columns
X = X[:, 1:] # Take all rows and all columns from index 1
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [22]:
model.fit(X,y)

In [23]:
# [0.0e+00,        0.0e+00,       2.6e+03]
  # Robinsville   West windsor     area

model.predict([[1,0,2800]])

array([[590775.63964739]])

In [24]:
model.predict([[0,1,3400]])

array([[681241.6684584]])