# 5. Full Pipeline Optimization - Adult Income

**Goal:** Tune the **Data Preprocessing Steps** and **Model Hyperparameters** simultaneously.

**Grid Search Dimensions:**
1. **Imputer Strategy:** Mean vs Median
2. **Scaling:** StandardScaler vs MinMaxScaler
3. **Model:** Random Forest (tuning n_estimators)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 'income']

df = pd.read_csv('../data/raw/adult.data', names=columns, na_values=' ?', skipinitialspace=True)
df.dropna(inplace=True)
df['target'] = (df['income'] == '>50K').astype(int)

X = df.drop(['income', 'target'], axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 1. Define Pipeline with Placeholders

We set up the pipeline steps with names 'imputer', 'scaler' so we can swap them out in the grid.

In [2]:
numeric_features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42))])

## 2. The Mega-Grid Search

Notice how we access pipeline steps using standard double-underscore notation: `preprocessor__num__scaler`.

In [3]:
param_grid = {
    # Preprocessing Hyperparameters
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
    
    # Model Hyperparameters
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [None, 10]
}

grid_search = GridSearchCV(clf, param_grid, cv=3, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Pipeline Configuration:")
print(grid_search.best_params_)
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")

Fitting 3 folds for each of 16 candidates, totalling 48 fits


Best Pipeline Configuration:
{'classifier__max_depth': 10, 'classifier__n_estimators': 100, 'preprocessor__num__imputer__strategy': 'mean', 'preprocessor__num__scaler': StandardScaler()}
Best CV Accuracy: 0.8559
