# 1. Encoding Comparison - Adult Census Income

**Dataset:** UCI Adult Income
**Feature:** `native-country` (42 categories) and `occupation` (15 categories)
**Goal:** Compare One-Hot, Label, and Target Encoding.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from category_encoders import TargetEncoder, OneHotEncoder

columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 'income']

df = pd.read_csv('../data/raw/adult.data', names=columns, na_values=' ?', skipinitialspace=True)
df.dropna(inplace=True)
df['target'] = (df['income'] == '>50K').astype(int)

print(f"Native Country Cardinality: {df['native-country'].nunique()}")

Native Country Cardinality: 42


## 1. Encoding Experiments

We will test how different encodings of `native-country` affect model performance.

In [2]:
X = df.drop(['income', 'target'], axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify all categorical columns
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

# 1. Target Encoding
enc_target = TargetEncoder(cols=cat_cols)
X_train_te = enc_target.fit_transform(X_train, y_train)
X_test_te = enc_target.transform(X_test)

rf = RandomForestClassifier(n_estimators=50, random_state=42)
rf.fit(X_train_te, y_train)
print(f"Target Encoding AUC: {roc_auc_score(y_test, rf.predict_proba(X_test_te)[:,1]):.4f}")

Target Encoding AUC: 0.9083
