**In this notebook I will practice the different types of Categoriacl Encoding on the Kaggle competition dataset**

Note: Assuming the missing values in the variables are handled already

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data
train = pd.read_csv('train.csv')  # Assuming you've downloaded it
test = pd.read_csv('test.csv')

# Separate target
y = train['SalePrice']
X = train.drop('SalePrice', axis=1)

# Split for encoding validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

**Nominal (No Order):**
One_hot encoding is good for low cardinality categorical variables.
**Key Criteria for One-Hot Encoding**:
Low Cardinality (Primary Rule)
Ideal for OHE: Features with** ≤10** unique categories.


In [7]:
nominal_features = [
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LandContour',
    'LotConfig', 'Neighborhood', 'Condition1', 'Condition2',
    'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
    'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'CentralAir',
    'GarageType', 'MiscFeature', 'SaleType', 'SaleCondition'
]

ohe_candidates = [
    'Street', 'Alley', 'CentralAir', 'LandContour',
    'PavedDrive', 'MSZoning', 'LotShape'
]

# Ensuring all the features exist in the data
valid_ohe_features = [col for col in ohe_candidates if col in X_train.columns]
print(valid_ohe_features)


# Applyinh OHE to both train and validation sets
X_train_ohe = pd.get_dummies(
    X_train,
    columns=valid_ohe_features,
    prefix=valid_ohe_features,
    drop_first=True  # Reduces multicollinearity
)
X_val_ohe = pd.get_dummies(
    X_val,
    columns=valid_ohe_features,
    prefix=valid_ohe_features,
    drop_first=True  # Reduces multicollinearity
)


# Aligning columns (in case validation set is missing some categories)
# Getting missing columns in the validation set
missing_cols = set(X_train_ohe.columns) - set(X_val_ohe.columns)
# Add missing columns with 0 values
for col in missing_cols:
    X_val_ohe[col] = 0
# Ensure same column order
X_val_ohe = X_val_ohe[X_train_ohe.columns]

#applying the same encoding on test set
X_test_ohe = pd.get_dummies(
    test,
    columns=valid_ohe_features,
    prefix=valid_ohe_features,
    drop_first=True)

# Align test set columns (same as we did for the validation set)
missing_cols = set(X_train_ohe.columns) - set(X_test_ohe.columns)
for col in missing_cols:
    X_test_ohe[col] = 0
X_test_ohe = X_test_ohe[X_train_ohe.columns]

['Street', 'Alley', 'CentralAir', 'LandContour', 'PavedDrive', 'MSZoning', 'LotShape']


Why Column Alignment is Needed
When you one-hot encode separately on train vs validation/test sets:

A category might exist in training but not in validation
(e.g., Street=Grvl appears in training but all validation samples are Pave)

This creates different columns in each dataset → crashes your model!

In [10]:
X_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
254,255,20,RL,70.0,8400,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
1066,1067,60,RL,59.0,7837,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,5,2009,WD,Normal
638,639,30,RL,67.0,8777,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,5,2008,WD,Normal
799,800,50,RL,60.0,7200,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,6,2007,WD,Normal
380,381,50,RL,50.0,5000,Pave,Pave,Reg,Lvl,AllPub,...,0,0,,,,0,5,2010,WD,Normal


In [11]:
#Ordinal Encoding

ordinal_features = {
    # Feature: Ordered Categories (from worst to best)
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],  # NA = No Basement
    'BsmtCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],
    'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'FireplaceQu': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],
    'GarageQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],
    'GarageCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],
    'PoolQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex', 'NA'],
    'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],  # Irregular → Regular
    'LandSlope': ['Sev', 'Mod', 'Gtl'],         # Steep → Gentle
    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ']
}

for feature in ordinal_features.keys():
    train[feature] = train[feature].fillna('NA')  # "NA" = No Basement/Pool/etc.
    test[feature] = test[feature].fillna('NA')    # Apply same to test set