## Lab Work 7: Ensemble Methods

This notebook builds on the same lecture of Foundations of Machine Learning. We'll focus on Ensemble Methods.

Important note: the steps shown here are not always the most efficient or the most "industry-approved." Their main purpose is pedagogical. So don't panic if something looks suboptimal—it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

First the necessary imports:


In [1]:
# Basic libraries for data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn tools for preprocessing and modeling
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Baseline model
from sklearn.linear_model import LogisticRegression

# Ensemble models
from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
)
from xgboost import XGBClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Metrics
from sklearn.metrics import (
    balanced_accuracy_score,
    accuracy_score,
    classification_report,
    roc_auc_score,
)

### Step 0

Decision tree for regression: visualization. Generate synthetic data using a noisy Sin function and try to fit it with a decision tree regressor.

The criterion is reduce the variance
$\text{Var}_{\text{split}} = \frac{|S_L|}{|S|}\text{Var}(S_L) + \frac{|S_R|}{|S|}\text{Var}(S_R)$


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)  # For reproducibility
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y

### MAGIC Gamma Telescope Dataset (UCI)

This dataset comes from the MAGIC (Major Atmospheric Gamma Imaging Cherenkov) Telescope, used to study high-energy gamma rays. The objective is to classify whether each recorded event is: g → a gamma-ray event (signal), h → a hadronic shower (background noise).

Import the dataset from the UCI Machine Learning repository using `fetch_ucirepo` with id=159 and split features from target.


### Step 1

Build a preprocessing + Logistic Regression pipeline that will be the benchmark.
Since all features are numerical, just pass them through with a ColumnTransformer, then add a LogisticRegression classifier to the pipeline. Fill in the code below to complete the baseline model setup.


### Step 2

The fundamental "trinity": fit your pipeline on the training set, generate predictions on the test set, and report performance metrics such as accuracy and the classification report.


### Step 3

Using `DecisionTreeClassifier` as introduced during the theoretical lecture, plot the performances on test and training in function of the depth as you did for the IRIS dataset


### Step 4: Bagging

Compare Bagging and Random Forests using decision trees as base learners. Both are ensemble methods, but they differ in how diversity is introduced among the trees:

1. Bagging (Bootstrap Aggregating)
   - Each tree is trained on a different bootstrap sample of the training data.
   - All features are considered when splitting nodes.
   - Final prediction is obtained by averaging (regression) or majority vote (classification).
   - Reduces variance compared to a single decision tree, but may still overfit if trees are deep.

Note: Bagging can be used with other base classifiers!

2. Random Forest. Builds on bagging, but adds feature randomness:
   - At each split, only a random subset of features is considered.
   - This additional randomness further decorrelates trees, usually improving generalization.


### Step 5: Boosting

Compare boosting methods using shallow decision trees as base learners. Boosting builds an ensemble sequentially, where each model focuses on improving the errors of the previous ones.

1. AdaBoost

   - Sequentially fits weak learners, giving more weight to misclassified samples.
   - Reduces bias of weak learners, but can be sensitive to noise.
   - Final prediction: Each weak learner votes, weighted by its accuracy. The class with the highest total vote is predicted.

2. Gradient Boosting

   - Sequentially fits models to residual errors of previous models using gradient descent.
   - More flexible than AdaBoost: supports regression, classification, and custom loss functions.
   - t each step j, we compute the pseudo-residuals as the negative gradient of the loss with respect to the current model $F_{j-1}(x)$: $r_i^{(j)} = - \left. \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right|_{F=F_{j-1}}$

3. XGBoost (no details)
   - Optimized, regularized version of gradient boosting.
   - Faster training via parallelization and handles missing values efficiently.
   - Often performs very well on large tabular datasets.
