## Lab Work 5: Fair Evaluation and Monte Carlo

This notebook builds on the third lecture of Foundations of Machine Learning. We'll focus on Fair Evaluation and Monte Carlo.

Important note: the steps shown here are not always the most efficient or the most "industry-approved." Their main purpose is pedagogical. So don't panic if something looks suboptimal—it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

First the necessary imports:


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.inspection import permutation_importance
from sklearn.base import clone

### 1. Load the dataset

For this lecture we will use the Pulsar dataset you used in the previous Lecture.

Taks: load it, print the head and split features from target.


### 2. Stratified K-Fold Cross-Validation

Goal: Split the dataset into k folds for cross-validation while preserving the proportion of classes in each fold (important for imbalanced datasets).

Procedure: use sklearn to create the stratification, the pipeline and to iterate through the splits. Then print the summary statistic (balanced accuracy).

### Accuracy vs Balanced Accuracy

**Accuracy:**

- Measures the proportion of correctly classified samples:  
   $
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$
- Works well when classes are balanced.
- Can be **misleading for imbalanced datasets**, because predicting the majority class gives high accuracy even if the model ignores the minority class.

**Balanced Accuracy:**

- Averages the recall for each class, giving equal weight to all classes:  
   $
\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{\text{TP}}{\text{TP + FN}} + \frac{\text{TN}}{\text{TN + FP}} \right)
$
- Ensures the model performs well on **both majority and minority classes**.
- Particularly useful for **imbalanced classification problems**.


### 3. Bootstrap

**Goal:** Estimate the variability of model performance and compute confidence intervals.

**Procedure:**

- Resample the training data **with replacement** to create bootstrap datasets.
- Use the **out-of-bag samples** as test sets.
- Fit the pipeline on each bootstrap sample and record accuracy.
- Repeat for `B` iterations and compute a **95% confidence interval** from the distribution of accuracies.
- Visualize with a histogram and mark the mean accuracy.


### 4. Permutation Test — Single Feature

**Goal:** Assess the importance of a single feature for model performance.

**Procedure:**

- Compute baseline accuracy of the model with all features.
- Randomly **shuffle the values** of the feature of interest.
- Evaluate the model on the permuted data.
- Repeat for multiple permutations to get a distribution of accuracies.
- Compare baseline accuracy to the permuted distribution: a large drop indicates a **relevant feature**.


### 5. Permutation Importance — All Features

**Goal:** Quantify the relevance of each feature for model performance.

**Procedure:**

- Fit the pipeline on the full dataset.
- Use `sklearn.inspection.permutation_importance` to shuffle each feature multiple times.
- Measure the decrease in model accuracy for each feature.
- Features causing a large drop are **more important**.
- Visualize the results with a bar chart of mean decreases in accuracy.


### 6. Permutation Importance for prediction vs learning

Goal: Notice the difference between predictive dependence (how much the model actually uses the feature) and learning dependence (how much the model needs the feature to achieve good performance)

Task: create two groups of features, permute and **refit**. Compare the drop in accuracy: why is it different from the single feature permutation?


### 7. Monte Carlo Permutation Test

**Goal:** Test whether the model's performance is significantly better than chance.

**Procedure:**

- Randomly **permute the target labels** multiple times (B iterations).
- Fit the model on each permuted dataset and record the accuracy.
- Compare the true model accuracy to the distribution of accuracies under the null hypothesis.
- Compute the **p-value**: proportion of permuted accuracies ≥ true accuracy.
- Visualize with a histogram marking the true accuracy.
