# 1. What is today’s topic? (Categorical Encoding)

- When we build ML models, they can’t directly understand text values (like "Furniture", "Technology", "West", "Standard Class").
- They only work with numbers.

<strong> Categorical Encoding </strong> = process of converting text columns (categorical variables) into numbers so ML algorithms can use them.

##There are many ways, but two main ones:

### Label Encoding

- Each unique category gets a number.
- Example:
- Segment:
- Consumer -> 0
- Corporate -> 1
- Home Office -> 2

#### ⚠️ Problem: Creates a fake order (0 < 1 < 2), which can confuse ML.
#### Usually used for ordinal data (like Education: High School < Graduate < PhD).

### One-Hot Encoding

- Creates new columns (dummy variables).
- Example:
<pre>
Segment_Consumer | Segment_Corporate | Segment_Home Office
1                | 0                 | 0
0                | 1                 | 0
0                | 0                 | 1
</pre>

#### No artificial order problem.
#### Downside: Creates many columns if categories are large.

## 1. One-Hot Encoding with Pandas (pd.get_dummies)

### ✅ Pros

Super simple, one line of code.
Great for quick exploration & small datasets.
Automatically integrates with Pandas DataFrame.

### ❌ Cons

Very basic — fewer customization options.
Doesn’t remember the encoding for future / new data (no .fit() / .transform() logic).
If you deploy a model, you need to manually make sure the new data has the same columns.
👉 Best use case: Exploratory data analysis (EDA), quick experiments.

## 2. One-Hot Encoding with Scikit-learn (OneHotEncoder)

### ✅ Pros

Standard in production ML pipelines.
Has .fit() and .transform() methods → meaning you can train it once and apply it consistently to future data (critical for deployment).
Can handle unseen categories (with handle_unknown='ignore').
Works well with scikit-learn pipelines (which is how models are usually trained & deployed).

### ❌ Cons

Slightly more code to set up.
Less convenient for quick inspection compared to Pandas.
👉 Best use case: Training ML models, production pipelines, large datasets.

## 3. What do real data scientists use?

In real ML projects, sklearn.preprocessing.OneHotEncoder is the go-to choice ✅
(because ML is not just about fitting once — you’ll need the same transformation at prediction/deployment time).
In exploratory / learning phase, many data scientists (including me 😅) still use pd.get_dummies() because it’s quick, clean, and easy to visualize.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.read_csv("data/Sales.csv")
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

In [2]:
# Step 1: Check categorical columns
print(df.select_dtypes(include='object').columns)

Index(['Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID',
       'Customer Name', 'Segment', 'Country', 'City', 'State', 'Region',
       'Product ID', 'Category', 'Sub-Category', 'Product Name'],
      dtype='object')


In [6]:
# Step 2: Label Encoding (example: Ship Mode)
le = LabelEncoder()
df['Ship Mode_LE'] = le.fit_transform(df['Ship Mode'])
print(df[['Ship Mode', 'Ship Mode_LE']].head())
print(le.classes_)

        Ship Mode  Ship Mode_LE
0    Second Class             2
1    Second Class             2
2    Second Class             2
3  Standard Class             3
4  Standard Class             3
['First Class' 'Same Day' 'Second Class' 'Standard Class']


In [None]:
# One-Hot Encoding using pandas
df_ohe = pd.get_dummies(df, columns=['Segment'], dtype=int)
print(df_ohe.head())

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
1       2  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
2       3  CA-2017-138688  12/06/2017  16/06/2017    Second Class    DV-13045   
3       4  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   
4       5  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   

     Customer Name        Country             City       State  ...  Region  \
0      Claire Gute  United States        Henderson    Kentucky  ...   South   
1      Claire Gute  United States        Henderson    Kentucky  ...   South   
2  Darrin Van Huff  United States      Los Angeles  California  ...    West   
3   Sean O'Donnell  United States  Fort Lauderdale     Florida  ...   South   
4   Sean O'Donnell  United States  Fort Lauderdale     Florida  ...   South   

        Product ID         Category Su

In [14]:
# Step 4: One-Hot Encoding using sklearn
ohe = OneHotEncoder(sparse_output=False)
encoded = ohe.fit_transform(df[['Segment']])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Segment']),dtype=int)
print(encoded_df.head())

   Segment_Consumer  Segment_Corporate  Segment_Home Office
0                 1                  0                    0
1                 1                  0                    0
2                 0                  1                    0
3                 1                  0                    0
4                 1                  0                    0
