[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CompOmics/D012554A_2025/blob/main/notebooks/day_3/3.1b_Exercises_Histone_marks_dt.ipynb)

# 3.1 Exercises – Decision Trees, Bias-Variance & Ensemble Learning

In the lecture notebook you applied decision trees, bagging, and random forests to classify gene expression from histone modifications. In these exercises you will apply those same techniques to the Breast Cancer Wisconsin dataset you already know from the logistic regression exercises — so you can directly compare decision-tree–based models with logistic regression.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import log_loss, accuracy_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

random_seed = 42
np.random.seed(random_seed)

---
## Exercise 1 – Load the data and create a train/validation split

1. Load the Breast Cancer dataset with `load_breast_cancer()`.
2. Create a DataFrame for features and a Series for the target.
3. Split into 80 % training / 20 % validation (`random_state=42`).
4. Print the shape of both sets and the class distribution of the training set.

In [None]:
# YOUR CODE HERE

---
## Exercise 2 – Fit a decision tree with limited depth

1. Create a `DecisionTreeClassifier` with `max_depth=3` and `random_state=42`.
2. Fit it on the training data.
3. Compute and print the accuracy and log-loss on both the training and validation sets.

Hint: Use `predict` for accuracy and `predict_proba` for log-loss.

In [None]:
# YOUR CODE HERE

---
## Exercise 3 – Visualize the decision tree

Use `sklearn.tree.plot_tree` to visualize the tree you just trained.

Hint: Use `feature_names=data.feature_names`, `class_names=data.target_names`, `filled=True`, `rounded=True`.
A large figure size (e.g. `figsize=(20, 10)`) will help readability.

In [None]:
# YOUR CODE HERE

---
## Exercise 4 – Effect of `max_depth` (overfitting curve)

1. Use `GridSearchCV` with `max_depth` values from 1 to 15 and `scoring='neg_log_loss'`.
2. Use 5-fold CV and set `return_train_score=True`.
3. Plot the mean training log-loss and mean validation log-loss vs `max_depth`.
4. Print the best `max_depth`.

At which depth does the model start overfitting?

In [None]:
# YOUR CODE HERE

---
# Part 2 – Understanding Bias & Variance

In machine learning, prediction errors come from two main sources:

Bias is the error from wrong assumptions. A model with high bias oversimplifies and misses patterns in the data. *Example*: a decision tree with `max_depth=1` can only make a single split — it cannot capture complex relationships.

Variance is the error from sensitivity to which data we happen to train on. A model with high variance changes drastically when trained on different subsets. *Example*: a deep decision tree memorises the training data, so different training sets produce wildly different models.

| | Low Variance | High Variance |
|---|---|---|
| Low Bias | ✅ Ideal model | ⚠️ Overfitting |
| High Bias | ⚠️ Underfitting | ❌ Both problems |

Key insight: Ensemble methods (bagging, random forests) reduce variance by averaging many high-variance models, giving us the best of both worlds: low bias and low variance.

In the next exercises you will see this in action!

---
## Exercise 5 – Variance in action: 50 trees, 50 stories

To see how much a single decision tree's predictions depend on the training data, we will:

1. Create 50 bootstrap samples (sampling with replacement) from the training set.
2. Train an unrestricted decision tree (`max_depth=None`) on each bootstrap sample.
3. Record each tree's predicted P(benign) for every validation instance.

Then analyse the results:
- Compute the variance of predictions for each validation instance.
- Compute the mean prediction (= a simple ensemble!).
- Compare the log-loss of the mean prediction with the log-loss of a single tree.
- Create two plots:
  - *Left*: histogram of per-instance prediction variance.
  - *Right*: scatter plot of individual tree predictions (faint) vs the ensemble average (red).

The bootstrap loop is provided below — fill in the tree fitting and the analysis.

In [None]:
# Number of trees to train on bootstrap samples
n_models = 50

# Array to store predicted P(benign) for each tree x each validation instance
all_probs = np.zeros((n_models, len(val_X)))

for i in range(n_models):
    # Create a bootstrap sample (sampling WITH replacement)
    rng = np.random.RandomState(i)
    boot_idx = rng.choice(len(train_X), size=len(train_X), replace=True)
    boot_X = train_X.iloc[boot_idx]
    boot_y = train_y.iloc[boot_idx]

    # YOUR CODE HERE:
    # 1. Create and fit a DecisionTreeClassifier (no max_depth limit, random_state=i)
    # 2. Store predicted P(benign) for all validation instances in all_probs[i]
    #    Hint: dt.predict_proba(val_X)[:, 1]

# YOUR CODE HERE:
# 1. Compute the VARIANCE of predictions for each instance (across the 50 trees)
# 2. Compute the MEAN prediction for each instance
# 3. Print: average variance, log-loss of mean predictions, log-loss of a single tree
# 4. Create a figure with two side-by-side subplots:
#    Left:  histogram of per-instance variances
#    Right: scatter of individual tree predictions (faint) vs ensemble average (red)

---
## Exercise 6 – Following individual instances

Pick three specific validation instances — one *easy* (low variance), one *hard* (high variance), and one *medium* — and create a histogram of the 50 predicted probabilities for each.

Add vertical lines for the mean prediction (red) and the decision boundary at 0.5 (grey).

The instance selection is provided below.

In [None]:
# Select instances with different variance levels
instance_variances = all_probs.var(axis=0)
easy_idx = np.argmin(instance_variances)        # lowest variance
hard_idx = np.argmax(instance_variances)        # highest variance
medium_idx = np.argsort(instance_variances)[len(instance_variances) // 2]

print(f"Easy   (idx={easy_idx}): true={val_y.iloc[easy_idx]}, variance={instance_variances[easy_idx]:.4f}")
print(f"Hard   (idx={hard_idx}): true={val_y.iloc[hard_idx]}, variance={instance_variances[hard_idx]:.4f}")
print(f"Medium (idx={medium_idx}): true={val_y.iloc[medium_idx]}, variance={instance_variances[medium_idx]:.4f}")

# YOUR CODE HERE:
# Create a figure with 3 subplots (1 row, 3 columns)
# For each instance plot a HISTOGRAM of the 50 predicted probabilities
# Add vertical lines for:
#   - Mean prediction (red dashed)
#   - Decision boundary at 0.5 (grey dotted)
#
# Hint: all_probs[:, easy_idx] gives the 50 predictions for the easy instance

---
## Exercise 7 – The power of averaging: growing ensembles

Now let's see how ensemble size affects prediction quality. Using the 50 trees you already trained:

1. For each ensemble size in `[1, 2, 3, 5, 10, 25, 50]`, average the predictions of the first *N* trees.
2. Compute the validation log-loss for each ensemble size.
3. Also track how the prediction for each of the 3 tracked instances changes.

Create two figures:
- Figure 1: Log-loss vs ensemble size.
- Figure 2: 3 subplots showing how each tracked instance’s averaged prediction converges as the ensemble grows. Add horizontal lines for the true label and the decision boundary.

In [None]:
ensemble_sizes = [1, 2, 3, 5, 10, 25, 50]

# YOUR CODE HERE:
# 1. For each size N, compute: avg_probs = all_probs[:N].mean(axis=0)
# 2. Compute log_loss(val_y, avg_probs) and store it
# 3. Record avg_probs[easy_idx], avg_probs[hard_idx], avg_probs[medium_idx]
#
# Figure 1: line plot of log-loss vs ensemble size
# Figure 2: 3 subplots showing prediction vs ensemble size for each tracked instance
#           Add horizontal lines for true label (green) and 0.5 (grey)

---
# Part 3 – Ensemble Learning with scikit-learn

In the exercises above you built ensembles "by hand" by averaging individual trees. Scikit-learn provides two powerful ensemble classifiers that automate and improve on this idea:

- `BaggingClassifier`: trains each tree on a different bootstrap sample (exactly what you did above, but optimised).
- `RandomForestClassifier`: like bagging, but also randomly selects a *subset of features* at each split, further decorrelating the trees.

---
## Exercise 8 – BaggingClassifier: effect of ensemble size

1. Fit `BaggingClassifier` (with `DecisionTreeClassifier()` as the base estimator) for each value in `n_estimators = [1, 5, 10, 25, 50, 100, 200]`.
2. Record validation accuracy and log-loss for each.
3. Plot both metrics vs `n_estimators` (two subplots).

At what point do diminishing returns set in?

In [None]:
# YOUR CODE HERE

---
## Exercise 9 – Random Forest & hyperparameter tuning

1. Fit a `RandomForestClassifier` with default hyperparameters and print accuracy + log-loss.
2. Use `GridSearchCV` to tune:
   - `n_estimators`: [50, 100, 200]
   - `max_depth`: [5, 10, None]
   - `min_samples_leaf`: [1, 2, 4]
3. Print the best hyperparameters and evaluate the best model on the validation set.
4. Extract and plot the top 10 feature importances from the best model.

Hint: Feature importances are in `best_model.feature_importances_`.

In [None]:
# YOUR CODE HERE

---
## Bonus – Model comparison

Create a summary table comparing all the models you have trained:

| Model | Val Accuracy | Val Log-Loss |
|-------|-------------|-------------|
| Single DT (max_depth=3) | ... | ... |
| Single DT (best depth from GridSearchCV) | ... | ... |
| Manual ensemble (50 bootstrapped trees) | ... | ... |
| BaggingClassifier (best n_estimators) | ... | ... |
| RandomForest (tuned) | ... | ... |

Create a bar chart comparing the log-loss values.

Which approach gives the best result? How does it compare to the logistic regression you trained in the 2.1 exercises?

In [None]:
# YOUR CODE HERE