# One-Hot Encoding vs Dummy Encoding

This notebook demonstrates the difference between **One-Hot Encoding** and **Dummy Encoding**, and explains when to use each.

# ## 1. Introduction to Categorical Encoding
#
# Machine learning algorithms typically work with numerical data. When we have categorical variables (text labels), we need to convert them into numerical format.
# Two common approaches are:
# - **One-Hot Encoding**: Creates binary columns for each category
# - **Dummy Encoding**: Similar to one-hot but drops one column to avoid multicollinearity

In [20]:
import pandas as pd
import numpy as np

In [14]:
# %%
# Create sample data
data = {
    'color': ['red', 'blue', 'green', 'blue', 'red', 'green', 'green', 'red'],
    'size': ['S', 'M', 'L', 'M', 'S', 'L', 'M', 'S'],
    'price': [10, 15, 20, 18, 12, 22, 16, 11]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nShape: {df.shape}")

Original DataFrame:
   color size  price
0    red    S     10
1   blue    M     15
2  green    L     20
3   blue    M     18
4    red    S     12
5  green    L     22
6  green    M     16
7    red    S     11

Shape: (8, 3)


In [11]:

# ## 2. One-Hot Encoding

# %%
# Method 1: Using pandas get_dummies (one-hot encoding)
one_hot_encoded = pd.get_dummies(df, columns=['color', 'size'], prefix=['color', 'size'])
print("One-Hot Encoded DataFrame:")
print(one_hot_encoded)
print(f"\nShape: {one_hot_encoded.shape}")

# %%
# Method 2: Using sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Initialize the encoder
encoder = OneHotEncoder(sparse_output=False, drop=None)

# Fit and transform the categorical columns
encoded_array = encoder.fit_transform(df[['color', 'size']])

# Get feature names
feature_names = encoder.get_feature_names_out(['color', 'size'])

# Create DataFrame
one_hot_sklearn = pd.DataFrame(encoded_array, columns=feature_names)
one_hot_sklearn = pd.concat([df[['price']], one_hot_sklearn], axis=1)

print("One-Hot Encoded with sklearn:")
print(one_hot_sklearn)

One-Hot Encoded DataFrame:
   price  color_blue  color_green  color_red  size_L  size_M  size_S
0     10       False        False       True   False   False    True
1     15        True        False      False   False    True   False
2     20       False         True      False    True   False   False
3     18        True        False      False   False    True   False
4     12       False        False       True   False   False    True
5     22       False         True      False    True   False   False
6     16       False         True      False   False    True   False
7     11       False        False       True   False   False    True

Shape: (8, 7)
One-Hot Encoded with sklearn:
   price  color_blue  color_green  color_red  size_L  size_M  size_S
0     10         0.0          0.0        1.0     0.0     0.0     1.0
1     15         1.0          0.0        0.0     0.0     1.0     0.0
2     20         0.0          1.0        0.0     1.0     0.0     0.0
3     18         1.0          0

In [16]:
# ## 3. Dummy Encoding

# %%
# Using pandas get_dummies with drop_first=True for dummy encoding
dummy_encoded = pd.get_dummies(df, columns=['color', 'size'], prefix=['color', 'size'], drop_first=True)
print("Dummy Encoded DataFrame:")
print(dummy_encoded)
print(f"\nShape: {dummy_encoded.shape}")

# %%
# Using sklearn OneHotEncoder with drop='first'
encoder_dummy = OneHotEncoder(sparse_output=False, drop='first')
encoded_array_dummy = encoder_dummy.fit_transform(df[['color', 'size']])

# Get feature names
feature_names_dummy = encoder_dummy.get_feature_names_out(['color', 'size'])

# Create DataFrame
dummy_sklearn = pd.DataFrame(encoded_array_dummy, columns=feature_names_dummy)
dummy_sklearn = pd.concat([df[['price']], dummy_sklearn], axis=1)

print("Dummy Encoded with sklearn:")
print(dummy_sklearn)

Dummy Encoded DataFrame:
   price  color_green  color_red  size_M  size_S
0     10        False       True   False    True
1     15        False      False    True   False
2     20         True      False   False   False
3     18        False      False    True   False
4     12        False       True   False    True
5     22         True      False   False   False
6     16         True      False    True   False
7     11        False       True   False    True

Shape: (8, 5)
Dummy Encoded with sklearn:
   price  color_green  color_red  size_M  size_S
0     10          0.0        1.0     0.0     1.0
1     15          0.0        0.0     1.0     0.0
2     20          1.0        0.0     0.0     0.0
3     18          0.0        0.0     1.0     0.0
4     12          0.0        1.0     0.0     1.0
5     22          1.0        0.0     0.0     0.0
6     16          1.0        0.0     1.0     0.0
7     11          0.0        1.0     0.0     1.0
