# Notebook 04: SHAP Values

## Unpacking Predictions

SHAP (SHapley Additive exPlanations) answers: "How much did each feature contribute to this specific prediction?" It's based on game theory, ensuring local attributions sum to the prediction minus baseline.

---

## What are SHAP Values?

SHAP values provide **local explanations** for individual predictions. They satisfy:

$$\text{prediction} = \text{baseline} + \sum_{j=1}^{p} \phi_j$$

where $\phi_j$ is the SHAP value for feature $j$.

## Different Explainers

- **LinearExplainer**: For linear models (fast, exact)
- **TreeExplainer**: For tree models like XGBoost (fast, exact)
- **KernelExplainer**: For any model (slow, approximate)

## Performance Tips

- Subsample to 500-1000 rows for SHAP
- Use the right explainer for your model type
- Fix random_state for reproducibility

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import shap
import xgboost as xgb

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

import sys
from pathlib import Path
project_root = Path().resolve().parent if Path().resolve().name == 'notebooks' else Path().resolve()
sys.path.insert(0, str(project_root))

from src.utils import set_seed

set_seed(42)
shap.initjs()  # Initialize JS visualization
print("✓ Imports successful!")

## Step 1: Load and Prepare Data

In [None]:
# Load data
data = load_diabetes(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")

## Step 2: SHAP for Linear Model

Use LinearExplainer for Ridge regression.

In [None]:
# === TODO: Fit Ridge, compute SHAP with LinearExplainer on 500-row sample
# Hints:
#   - Fit Ridge pipeline on standardized data
#   - Create background sample (500 rows)
#   - Use shap.LinearExplainer(model, background)
#   - Compute SHAP values for test sample
#   - Plot summary plot
# Acceptance: SHAP summary plot; 2-sentence interpretation

## Step 3: SHAP for Tree Model

Use TreeExplainer for XGBoost.

In [None]:
# === TODO: Fit XGBoost, compute SHAP with TreeExplainer
# Hints:
#   - Fit XGBRegressor(n_estimators=200, max_depth=3)
#   - Use shap.TreeExplainer(model)
#   - Compute SHAP values for small test sample
#   - Plot beeswarm plot
# Acceptance: SHAP beeswarm plot; note top 3 features

## Step 4: Compare SHAP vs Permutation Importance

Compare feature rankings from different interpretability methods.

In [None]:
# === TODO: Compare SHAP ranking vs permutation importance
# Hints:
#   - Get top 5 features from SHAP (mean absolute SHAP values)
#   - Get top 5 features from permutation importance (from notebook 01)
#   - Create comparison table
# Acceptance: Table with top 5 features by each method

## Summary

SHAP provides local and global explanations. Different explainers for different models.

**Next**: Notebook 05 will explore cross-validation schemes and data leakage.