# Exercises for Introduction to Python for Data Science

Week 10 - scikit-learn

Matthias Feurer and Andreas Bender  
2025-07-07

# Setup

Check that you have at least **scikit‑learn ≥ 1.4** and
**pandas ≥ 2.0**.

    import sklearn, pandas as pd, numpy as np
    print("scikit‑learn", sklearn.__version__)
    print("pandas", pd.__version__)

In [5]:
    import sklearn, pandas as pd, numpy as np
    print("scikit‑learn", sklearn.__version__)
    print("pandas", pd.__version__)

scikit‑learn 1.6.1
pandas 2.2.3


# Exercise 1: Download an appropriate dataset

*Download a tabular regression dataset that already contains categorical
columns **and** missing values.* Two popular options:

1.  **Ames Housing** — numeric + categorical + some NaNs.

2.  **Mercedes‑Benz Greener Manufacturing** (*OpenML ID = 42165*) if you
    prefer a smaller data set.

Store **features** `X` and **target** `y` as Pandas objects—no NumPy
arrays outside the pipeline!

In [6]:
from sklearn.datasets import fetch_openml

X, y = fetch_openml(
    name="house_prices",
    version="active",
    as_frame=True,
    return_X_y=True,
    parser="pandas"          # parser="pandas" ist empfohlen, um Warnungen zu vermeiden
)
X.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


# Exercise 2: Pipeline

Your end‑to‑end pipeline must:

1.  **Impute missing values**

    -   Numeric → Impute using the mean.
    -   Categorical → Create a new category that indicates missing data.

2.  **Encode categoricals** with
    `OneHotEncoder(handle_unknown="ignore", sparse_output=False)`.

3.  **Scale numerical features** so that they have zero mean and unit
    variance.

4.  **Parallel feature branches**: create three parallel feature
    branches where the data is processed in differen manners:

    -   **Branch A** — use only simple preprocessing (steps 1-3).
    -   **Branch B** — apply principal component analysis.
    -   **Branch C** — apply k-means clustering and add new features
        that measure the distance to all cluster centers.

5.  **Regressor**: Use ridge regression and tune the regularization
    hyperparameter.

6.  **Target scaling**: scale the target so it automatically has zero
    mean and unit variance.

Everything must be wired up in a single `Pipeline` so that a call to
**`fit()`** triggers *all* preprocessing.

HINTS: \* Call `set_output(transform="pandas")` **once** to make every
transformer emit data frames → keeps column names alive. \* You need to
use `ColumnTransformer`, `Pipeline` and `FeatureUnion` in order to solve
this exercise.

In [9]:
X.shape

(1460, 80)

In [None]:
import numpy as np
import pandas as pd

from sklearn import set_config
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import GridSearchCV

# ─── 0) Global: alle Transformer geben pandas.DataFrames zurück ───
set_config(transform_output="pandas")


# alle numerischen Spalten-Namen
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# alle “object”-Spalten (Strings) – 
# oft willst Du hier auch 'category' mit aufnehmen:
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# ─── 2) Basistransformer: Imputation, Encoding, Scaling ───
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler',   StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline,     numeric_cols),
    ('cat', categorical_pipeline, categorical_cols)
], remainder='drop')

# ─── 3) Parallele Feature-Branches ───
# Branch A: nur Basis-Preprocessing
branch_A = preprocessor

# Branch B: PCA auf die vorverarbeiteten Features
branch_B = Pipeline([
    ('pre', preprocessor),
    ('pca', PCA())         # z.B. 5 Hauptkomponenten
])

# Branch C: K-Means und Distanzen zu Clustern
branch_C = Pipeline([
    ('pre', preprocessor),
    ('kmeans', KMeans(random_state=42))  # z.B. 4 Cluster
    # KMeans.fit_transform liefert direkt die Distanzen zu den Cluster-Zentren
])

# FeatureUnion aller drei Branches
features = FeatureUnion([
    ('A', branch_A),
    ('B', branch_B),
    ('C', branch_C)
], n_jobs=-1)


alphas = np.logspace(-6, 6, 25)
     

# ─── 4) Regressor mit Target-Scaling ───
regressor = TransformedTargetRegressor(
    regressor=Ridge(),
    transformer=StandardScaler()
)

param_grid = {
    'reg__regressor__alpha': np.logspace(-6, 6, 25)
}

# ─── 5) Gesamte Pipeline ───
pipe = Pipeline([
    ('features', features),
    ('reg',      regressor)
])

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="neg_mean_squared_error",
    cv=5
)

grid.fit(X, y)

print("Bestes alpha:", grid.best_params_['reg__regressor__alpha'])
print("Bestes CV-MSE:", -grid.best_score_)

Bestes alpha: 100.0
Bestes CV-MSE: 1026808441.8150259


# Exercise 3: Cross-Validation

-   Use **10‑fold cross‑validation** (you can start with 2-fold
    cross-validation for development):
-   Compare against a *baseline* that simply imputes + encodes and uses
    the same linear models (no PCA/Clustering). Did the extra branches
    help?

Using all three pre-processing strategies in parallel results in a
slight performance improvement.

In [17]:
import numpy as np
import pandas as pd

from sklearn import set_config
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold

# ─── 0) Global: alle Transformer geben pandas.DataFrames zurück ───
set_config(transform_output="pandas")


# alle numerischen Spalten-Namen
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# alle “object”-Spalten (Strings) – 
# oft willst Du hier auch 'category' mit aufnehmen:
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# ─── 2) Basistransformer: Imputation, Encoding, Scaling ───
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler',   StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline,     numeric_cols),
    ('cat', categorical_pipeline, categorical_cols)
], remainder='drop')

# ─── 3) Parallele Feature-Branches ───
# Branch A: nur Basis-Preprocessing
branch_A = preprocessor

# Branch B: PCA auf die vorverarbeiteten Features
branch_B = Pipeline([
    ('pre', preprocessor),
    ('pca', PCA())         # z.B. 5 Hauptkomponenten
])

# Branch C: K-Means und Distanzen zu Clustern
branch_C = Pipeline([
    ('pre', preprocessor),
    ('kmeans', KMeans(random_state=42))  # z.B. 4 Cluster
    # KMeans.fit_transform liefert direkt die Distanzen zu den Cluster-Zentren
])

# FeatureUnion aller drei Branches
features = FeatureUnion([
    ('A', branch_A),
    ('B', branch_B),
    ('C', branch_C)
], n_jobs=-1)


alphas = np.logspace(-6, 6, 25)
     

# ─── 4) Regressor mit Target-Scaling ───
regressor = TransformedTargetRegressor(
    regressor=Ridge(),
    transformer=StandardScaler()
)

param_grid = {
    'reg__regressor__alpha': np.logspace(-6, 6, 25)
}

# ─── 5) Gesamte Pipeline ───
pipe = Pipeline([
    ('features', features),
    ('reg',      regressor)
])

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="neg_mean_squared_error",
    cv=5
)



cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
MSE = cross_val_score(grid, X, y, cv = cv, scoring="neg_mean_squared_error")

print("Bestes CV-MSE:", np.mean(-MSE))


pipe2 = Pipeline([
    ('pre', preprocessor),
    ('reg',      regressor)
])

grid2 = GridSearchCV(
    estimator=pipe2,
    param_grid=param_grid,
    scoring="neg_mean_squared_error",
    cv=5
)


cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
MSE = cross_val_score(grid2, X, y, cv = cv, scoring="neg_mean_squared_error")

print("Bestes CV-MSE:", np.mean(-MSE))




Bestes CV-MSE: 1036199874.5081649




Bestes CV-MSE: 999869078.0793294


# Exercise 4: Bonus tasks

-   Swap the linear model for `HistGradientBoostingRegressor` and
    compare.
-   Swap the linear model for `DecisionTreeRegressor`, compare, and plot
    the decision tree.
-   Run **nested CV** with `RandomizeSearchCV` tuning `n_clusters` and
    `n_components`.
-   Persit and re-load the model using `joblib.dump` and `joblib.load`.