# Notebook 00: Invariance and Baselines

## The Foundation of Model Understanding

Before we dive into the sophisticated techniques of model interpretability, we must establish a solid foundation. This notebook introduces two fundamental concepts that every data scientist should understand:

1. **Invariance**: The principle that well-behaved models should produce consistent predictions regardless of row order
2. **Baselines**: Simple reference models that help us understand what "random" or "naive" performance looks like

Understanding these concepts is crucial because they form the bedrock upon which all interpretability techniques are built. If a model isn't invariant to row shuffling, something is fundamentally wrong. If we don't know our baseline performance, we can't judge whether our model is actually learning anything useful.

---

## What is Invariance?

In machine learning, **invariance** refers to the property that certain transformations of the input data should not change the model's predictions. For tabular data models (like linear regression, random forests, gradient boosting), one of the most basic invariances is **row-order invariance**.

### Why Row Order Shouldn't Matter

Consider a dataset of patient records. Whether we arrange patients alphabetically, by age, or completely randomly, a well-trained model should produce the same predictions for the same patients. The model learns patterns from the **features**, not from the **order** in which examples appear.

Mathematically, if we have:
- Original dataset: $X = [x_1, x_2, ..., x_n]^T$ with predictions $\hat{y} = [\hat{y}_1, \hat{y}_2, ..., \hat{y}_n]$
- Permuted dataset: $X_{\pi} = [x_{\pi(1)}, x_{\pi(2)}, ..., x_{\pi(n)}]^T$ with predictions $\hat{y}_{\pi}$

Then for a row-order invariant model: $\hat{y}_{\pi(i)} = \hat{y}_i$ for all $i$.

### When Invariance Breaks

If shuffling rows changes predictions, it indicates:
- **Data leakage**: The model is using information it shouldn't (e.g., row indices, temporal order)
- **Implementation bugs**: The model or preprocessing pipeline has a bug
- **Non-deterministic behavior**: Random seeds not set, causing different results

Testing invariance is a **smoke test**—a quick check that catches fundamental problems before we invest time in deeper interpretability analysis.

---

## Why Baselines Matter

A **baseline model** is the simplest possible predictor that requires no machine learning. It serves as a reference point to answer: "Is my fancy model actually better than doing nothing?"

### Regression Baseline: The Mean Predictor

For regression tasks, the simplest baseline is to always predict the **mean** of the target variable:

$$\hat{y}_{baseline} = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$$

This baseline has:
- **RMSE**: Standard deviation of the target
- **R²**: 0.0 (by definition, since it explains no variance)
- **Interpretation**: "If we knew nothing about features, we'd predict the average"

Any model that doesn't beat this baseline is essentially useless. A good model should have R² > 0 and RMSE < standard deviation of target.

### Classification Baseline: The Majority Class

For classification, the simplest baseline is to always predict the **most common class**:

$$\hat{y}_{baseline} = \text{mode}(y)$$

This baseline has:
- **Accuracy**: Proportion of majority class
- **Interpretation**: "If we guessed the most common outcome every time, we'd be right X% of the time"

Any classifier that doesn't beat this baseline is worse than random guessing (for balanced classes) or worse than always guessing the majority (for imbalanced classes).

---

## The Importance of Random Seeds

**Reproducibility** is essential in data science. Setting random seeds ensures that:
- Train/test splits are consistent across runs
- Model initialization is the same
- Random shuffling produces the same order
- Results are comparable and debuggable

We'll use `seed=42` throughout this dojo (a popular choice, though any fixed number works). In production, you might use timestamps or commit hashes, but for learning, fixed seeds make experiments reproducible.

---

## What We'll Do in This Notebook

1. **Load a regression dataset** (diabetes progression)
2. **Fit a simple linear model** and establish baseline performance
3. **Test invariance** by shuffling rows and comparing predictions
4. **Compare to naive baseline** (mean predictor) to ensure our model learns something

Let's begin!



## Setup and Imports

Let's start by importing the necessary libraries and setting up our environment for reproducibility.


In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set style for plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed for reproducibility
from src.utils import set_seed
set_seed(42)

print("✓ Imports successful!")


## Step 1: Load the Dataset

We'll use the diabetes dataset from scikit-learn, which is a classic regression problem. This dataset contains 10 baseline variables (age, sex, BMI, blood pressure, etc.) and a quantitative measure of disease progression one year after baseline.

The dataset is small (442 samples) and fast to work with, making it perfect for learning interpretability concepts.


In [None]:
# === TODO (you code this) ===
# Load a small regression dataset (diabetes). Split into X, y.
# Hints: 
#   - Use load_diabetes(as_frame=True) to get a DataFrame
#   - Extract X (features) and y (target) from the returned object
#   - Print shapes of X and y to verify
#   - Optionally print feature names

# Acceptance:
# - Print shapes of X and y
# - X should be a DataFrame with 442 rows and 10 columns
# - y should be a Series with 442 values
# - Feature names should be displayed


## Step 2: Train-Test Split

Before we fit any model, we need to split our data into training and testing sets. This ensures we can evaluate our model on unseen data, which gives us a more honest estimate of performance.

We'll use an 80/20 split: 80% for training, 20% for testing.


In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42  # Fixed seed for reproducibility
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")


## Step 3: Fit a Simple Linear Model

Now let's fit a basic linear regression model. This will serve as our reference point for performance. Linear regression is interpretable by design—each coefficient tells us how much the target changes for a one-unit change in that feature (holding other features constant).


In [None]:
# === TODO (you code this) ===
# Fit a simple LinearRegression. Record test RMSE.
# Hints:
#   - Create a LinearRegression() object
#   - Fit it on X_train, y_train
#   - Make predictions on X_test
#   - Compute RMSE: sqrt(mean_squared_error(y_test, y_pred))
#   - Compute R²: r2_score(y_test, y_pred)
#   - Print both metrics

# Acceptance:
# - Print RMSE_base (baseline RMSE for comparison)
# - Print R² score
# - Model should be fitted successfully


## Step 4: Compare to Naive Baseline

Before we test invariance, let's make sure our model is actually learning something useful. We'll compare our linear model to the simplest possible baseline: always predicting the mean of the training target.

If our model doesn't beat this baseline, it's not learning anything meaningful.


In [None]:
# Compute naive baseline: always predict the mean
y_baseline = np.full_like(y_test, y_train.mean())
rmse_baseline = np.sqrt(mean_squared_error(y_test, y_baseline))
r2_baseline = r2_score(y_test, y_baseline)

print(f"Naive Baseline (mean predictor):")
print(f"  RMSE: {rmse_baseline:.2f}")
print(f"  R²: {r2_baseline:.4f}")
print(f"\nOur Linear Model:")
print(f"  RMSE: {rmse_base:.2f}")
print(f"  R²: {r2_base:.4f}")
print(f"\nImprovement:")
print(f"  RMSE reduction: {((rmse_baseline - rmse_base) / rmse_baseline * 100):.1f}%")
print(f"  R² improvement: {r2_base - r2_baseline:.4f}")


## Step 5: Test Invariance

Now for the key test: **Does shuffling the rows change our predictions?**

If our model is well-behaved, the predictions for each test sample should remain the same, even if we reorder the test set. The only thing that should change is the order of the predictions themselves.

We'll:
1. Create a random permutation of the test set indices
2. Reorder X_test and y_test according to this permutation
3. Make predictions on the permuted test set
4. Compare predictions: they should match the original predictions (just in a different order)


In [None]:
# === TODO (you code this) ===
# Reorder the test rows randomly and confirm predictions are identical up to permutation.
# Hints:
#   - Create a random permutation of indices: idx = np.random.permutation(len(X_test))
#   - Reorder X_test: X_test_perm = X_test.iloc[idx] (or X_test[idx] if numpy array)
#   - Reorder y_test: y_test_perm = y_test.iloc[idx] (or y_test[idx] if numpy array)
#   - Make predictions on permuted test set
#   - Compare: sort both prediction arrays and check if they're equal
#   - Or: create a mapping and verify each prediction matches

# Acceptance:
# - Test set rows are randomly permuted
# - Predictions are made on permuted test set
# - Original and permuted predictions are compared
# - Short Markdown note: row order does not change predictions


### Verification

If invariance holds, the sorted predictions should be identical (or very close, accounting for floating-point precision). Let's verify:


In [None]:
# Verify predictions are identical when sorted
predictions_original_sorted = np.sort(y_pred_original)
predictions_permuted_sorted = np.sort(y_pred_permuted)

# Check if they're equal (within floating-point precision)
are_identical = np.allclose(predictions_original_sorted, predictions_permuted_sorted)

print(f"Predictions are identical (up to permutation): {are_identical}")
if are_identical:
    print("✓ Invariance test passed! Row order does not affect predictions.")
else:
    print("⚠ Warning: Predictions differ. This suggests a problem with the model or data.")


## Summary

In this notebook, we've established two fundamental principles:

1. **Invariance**: Well-behaved models produce consistent predictions regardless of row order. This is a basic sanity check that catches data leakage and implementation bugs.

2. **Baselines**: Simple reference models (like the mean predictor) help us understand whether our model is actually learning useful patterns. If we can't beat a naive baseline, our model isn't useful.

These concepts form the foundation for all interpretability work. In the next notebook, we'll explore **permutation importance**, which builds on the idea of invariance to measure feature importance.

---

### Key Takeaways

- ✅ Row-order invariance is a fundamental property of tabular ML models
- ✅ Always compare your model to a naive baseline
- ✅ Random seeds ensure reproducibility
- ✅ Testing invariance is a quick smoke test for model correctness

**Next**: Notebook 01 will show you how to use permutation to measure feature importance!
