<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/03.%20Why%20to%20avoid%20using%20pd.get_dummies()%20in%20Model%20deployment%3F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **The Pitfalls of `pd.get_dummies()` in Model Deployment**

#### Understanding `pd.get_dummies()`: Usage, Limitations, and Best Practices

<div align="justify">

Encoding categorical variables is an essential step in the machine learning pipeline. Most machine learning models require numerical input, and they cannot directly interpret text or category labels. Therefore, we must convert categorical features into numeric form — a process known as **encoding**.

One of the most commonly used methods for encoding in pandas is `pd.get_dummies()`.

</div>

#### 📘 In this notebook, we will:
- Explore how `pd.get_dummies()` works.
- Discuss its advantages and **limitations**.
- Compare it with better alternatives like `OneHotEncoder` from scikit-learn.


In [1]:
import pandas as pd

In [2]:
flowers = pd.DataFrame({
    'color' : ['red','green','red','green','red','green','red','green','blue','blue'],
    'height': [4,9,4,8,4,7,4,7.5,20,19],
    'petals': [3,9,1,8,1,10,2,8,50,47],
    'days'  : [6,16,7,15,8,17,5,12,40,45]
})

In [3]:
flowers

Unnamed: 0,color,height,petals,days
0,red,4.0,3,6
1,green,9.0,9,16
2,red,4.0,1,7
3,green,8.0,8,15
4,red,4.0,1,8
5,green,7.0,10,17
6,red,4.0,2,5
7,green,7.5,8,12
8,blue,20.0,50,40
9,blue,19.0,47,45


🚨 Remember to always use `train_test_split` first, then apply `pd.get_dummies()` (or any encoding) — to prevent Data Leakage.

In [4]:
# separating independent and dependent features
X = flowers.drop('days', axis=1)
y = flowers['days']

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

In [6]:
x_train

Unnamed: 0,color,height,petals
8,blue,20.0,50
1,green,9.0,9
2,red,4.0,1
9,blue,19.0,47
0,red,4.0,3
5,green,7.0,10
7,green,7.5,8
6,red,4.0,2


In [7]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 8 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   color   8 non-null      object 
 1   height  8 non-null      float64
 2   petals  8 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 256.0+ bytes


In [8]:
x_train['color'].unique()

array(['blue', 'green', 'red'], dtype=object)

#### **🧩 What Does `pd.get_dummies()` Do?**

<div align="justify">

`pd.get_dummies()` converts categorical variables into **dummy/indicator variables**.

Each categorical value is converted into a new column containing 0 or 1: (One-hot encoding method)
- `1` if the row has that category
- `0` otherwise

Each categorical variable is expanded into as many binary (0/1) columns as there are unique values.

If the input is a DataFrame, the name of the original variable is prepended to each category value to create the new column names.  
For example, in our DataFrame we have one categorical feature called `color`, and its unique values are `red`, `green`, and `blue`. So, the dummy columns created will be:  `color_red`, `color_green`, and `color_blue`.

</div>


In [9]:
pd.get_dummies(x_train)

Unnamed: 0,height,petals,color_blue,color_green,color_red
8,20.0,50,True,False,False
1,9.0,9,False,True,False
2,4.0,1,False,False,True
9,19.0,47,True,False,False
0,4.0,3,False,False,True
5,7.0,10,False,True,False
7,7.5,8,False,True,False
6,4.0,2,False,False,True


In [10]:
pd.get_dummies(x_train).astype(int)

Unnamed: 0,height,petals,color_blue,color_green,color_red
8,20,50,1,0,0
1,9,9,0,1,0
2,4,1,0,0,1
9,19,47,1,0,0
0,4,3,0,0,1
5,7,10,0,1,0
7,7,8,0,1,0
6,4,2,0,0,1


In [12]:
pd.get_dummies(x_train).astype(int).join(x_train['color'])

Unnamed: 0,height,petals,color_blue,color_green,color_red,color
8,20,50,1,0,0,blue
1,9,9,0,1,0,green
2,4,1,0,0,1,red
9,19,47,1,0,0,blue
0,4,3,0,0,1,red
5,7,10,0,1,0,green
7,7,8,0,1,0,green
6,4,2,0,0,1,red


In [13]:
x_test

Unnamed: 0,color,height,petals
4,red,4.0,1
3,green,8.0,8


In [14]:
pd.get_dummies(x_test)

Unnamed: 0,height,petals,color_green,color_red
4,4.0,1,False,True
3,8.0,8,True,False


#### ❌ **Pitfall: Mismatched Columns After Applying `pd.get_dummies()` on Test Set**

<div align="justify">

For now, our training data has categories: `red`, `green`, `blue`; but the test data contains only: `red` and `green`.

Our model was trained using the `color_blue` column — but this column is **missing in the test set**.
- The input dimensions won't match.  
- The model may throw an error or produce incorrect predictions.

It may also happen that a test record has a new category like `yellow`, which was **never seen during training**: The model was never trained on `color_yellow`. It expects `color_red`, `color_green`, and `color_blue` — but now it sees an unexpected category. This can also lead to errors or unpredictable results.
<br>

<span style="font-color:red;"> Therefore, when we apply `pd.get_dummies()` separately on the training and test sets: The test set may contain different categories than the training set or it may lack some categories present in training set. As a result, the dummy columns won’t match, and this mismatch can break our model during prediction.</span>

<br>

#### ✅ **Safer Approach**:

Use `.align()` to ensure both training and test sets have the same columns:

```python
train_encoded, test_encoded = train_encoded.align(train_encoded, join='left', axis=1, fill_value=0)
```

<div align="center">

**OR**</div>

USE `ONE-HOT ENCODING` (`OneHotEncoder` will be addressed in the subsequent notebook with detailed implementation).
</div>

In [18]:
# First, apply pd.get_dummies() separately on the train and test data.
train_encoded = pd.get_dummies(x_train).astype(int)
test_encoded = pd.get_dummies(x_test).astype(int)

# Align test set with train set,
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

In [19]:
train_encoded

Unnamed: 0,height,petals,color_blue,color_green,color_red
8,20,50,1,0,0
1,9,9,0,1,0
2,4,1,0,0,1
9,19,47,1,0,0
0,4,3,0,0,1
5,7,10,0,1,0
7,7,8,0,1,0
6,4,2,0,0,1


In [20]:
test_encoded

Unnamed: 0,height,petals,color_blue,color_green,color_red
4,4,1,0,0,1
3,8,8,0,1,0


After applying `.align()`, both train and test sets now share the same feature columns, effectively resolving the mismatch problem.