
# Lecture 7 – Categorical Encoding (Google Colab Notebook)

This notebook covers:
1. Nominal vs Ordinal Variables  
2. Label Encoding  
3. One-Hot Encoding  
4. Project Tasks (Titanic Dataset)  
5. Homework with Encoding Spec Table  

---


In [36]:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load Titanic dataset (assuming titanic.csv is uploaded in Colab environment)
df = pd.read_csv("../Data/titanic.csv")

# Label Encoding on Sex column
le = LabelEncoder()
df["Sex_encoded"] = le.fit_transform(df["sex"])
df[["sex", "Sex_encoded"]].head()


Unnamed: 0,sex,Sex_encoded
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


In [25]:

# One-Hot Encoding on Embarked column
df = pd.get_dummies(df, columns=["embarked"], drop_first=True)
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,Sex_encoded,embarked_Q,embarked_S
0,0,3,male,22.0,1,0,7.25,Third,man,True,,Southampton,no,False,1,False,True
1,1,1,female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,0,False,False
2,1,3,female,26.0,0,0,7.925,Third,woman,False,,Southampton,yes,True,0,False,True
3,1,1,female,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,0,False,True
4,0,3,male,35.0,0,0,8.05,Third,man,True,,Southampton,no,True,1,False,True


In [26]:

import joblib

# Save the LabelEncoder for reuse
joblib.dump(le, "sex_encoder.pkl")
print("Encoder saved as sex_encoder.pkl")


Encoder saved as sex_encoder.pkl



## Homework

**Encoding Spec Table for Titanic Dataset**  

| Column   | Type (Nominal/Ordinal) | Encoding Method   | Notes |
|----------|-------------------------|------------------|-------|
| Sex      | Nominal                | Label Encoding   | Binary category |
| Embarked | Nominal                | One-Hot Encoding | Drop-first |
| Pclass   | Ordinal                | Label Encoding   | Ordered levels |
| Cabin    | Nominal (sparse)       | Rarely encoded   | Too many missing values |
| Ticket   | Nominal (high-card)    | Usually dropped  | Not useful |
| Name     | Nominal (text)         | Dropped/parsed   | Could extract titles |


In [29]:
df.drop('deck',inplace=True,axis=1)

In [30]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,embark_town,alive,alone,Sex_encoded,embarked_Q,embarked_S
0,0,3,male,22.0,1,0,7.25,Third,man,True,Southampton,no,False,1,False,True
1,1,1,female,38.0,1,0,71.2833,First,woman,False,Cherbourg,yes,False,0,False,False
2,1,3,female,26.0,0,0,7.925,Third,woman,False,Southampton,yes,True,0,False,True
3,1,1,female,35.0,1,0,53.1,First,woman,False,Southampton,yes,False,0,False,True
4,0,3,male,35.0,0,0,8.05,Third,man,True,Southampton,no,True,1,False,True
