# Notebook 06: Summary & Reflection

## Synthesizing Knowledge

Answer these questions in your own words to reinforce learning.

## Question 1: Permutation Importance

In your own words, define permutation importance and one case it can mislead.

**Answer:**

Permutation importance measures feature importance by randomly shuffling (permuting) a feature's values and observing how much the model's performance metric (like R² or RMSE) changes. If permuting a feature causes a large drop in performance, that feature is important. If performance barely changes, the feature carries little signal.

**Case where it can mislead:**

Permutation importance can be misleading when features are highly correlated. For example, if two features (`feature_A` and `feature_B`) are highly correlated and both contain important information, permuting `feature_A` might show low importance because `feature_B` can still carry all the predictive signal. This doesn't mean `feature_A` is unimportant—it's just redundant with `feature_B`. The model can use either feature, so breaking one doesn't hurt performance much. This is why it's important to check for multicollinearity (using VIF or correlation matrices) before interpreting permutation importance.

## Question 2: Ridge vs Lasso

When do you prefer Ridge vs Lasso and why?

**Answer:**

**Use Ridge when:**
- You want to keep all features but shrink their coefficients toward zero
- You believe all features might be relevant (even if some have small effects)
- You want smooth coefficient shrinkage without feature elimination
- You have multicollinearity issues (Ridge handles it better than Lasso)

**Use Lasso when:**
- You want automatic feature selection (removes irrelevant features by setting coefficients to zero)
- You prefer a simpler, more interpretable model with fewer features
- You suspect many features are irrelevant and want to identify which ones
- You want to reduce model complexity and potential overfitting

**Key difference:** Ridge shrinks coefficients smoothly but rarely zeros them out. Lasso can set coefficients to exactly zero, performing automatic feature selection. In our diabetes dataset, Lasso set 3 features to zero (`age`, `s2`, `s4`) while Ridge kept all 10 features with small but non-zero coefficients.

## Question 3: PCA

What PCA told you about redundancy in your features.

**Answer:**

PCA revealed redundancy in our features by showing that we could reduce from 10 original features to 7 principal components while capturing 90% of the variance. This means 3 features worth of information was redundant—the same predictive power could be achieved with fewer dimensions.

**Key insights:**
- **VIF analysis** showed high multicollinearity in `s1`, `s2`, `s3`, and `s5` (VIF > 10), confirming these features share information
- **PCA components** are linear combinations of ALL original features, not a selection—each component uses information from all 10 features
- **7 components for 90% variance** means we can represent the data with fewer dimensions without losing much information
- **Model performance** was nearly identical (RMSE: 53.64 vs 53.77) when using 7 components vs 10 features, validating that the "lost" 10% variance was likely noise or redundant information

**Takeaway:** PCA doesn't remove specific features—it creates new uncorrelated components that capture the essential information. The fact that 7 components perform as well as 10 features indicates redundancy in the original feature space.

## Question 4: SHAP Pitfalls

One pitfall of SHAP and how you would mitigate it.

**Answer:**

**Pitfall: Computational Cost**

SHAP can be computationally expensive, especially for large datasets or complex models. Computing SHAP values for every prediction in a large dataset (e.g., millions of rows) can take hours or even days, making it impractical for real-time applications or large-scale analysis.

**Mitigation Strategies:**

1. **Subsampling**: Use a representative sample (500-1000 rows) for SHAP computation instead of the full dataset. This is what we did in Notebook 04—we used 353 rows for the background sample and 89 rows for test predictions.

2. **Use the right explainer**: 
   - `LinearExplainer` (fast, exact) for linear models
   - `TreeExplainer` (fast, exact) for tree models
   - Avoid `KernelExplainer` (slow, approximate) unless necessary

3. **Parallel processing**: Use `n_jobs=-1` when available to parallelize computations

4. **Feature subsetting**: Focus SHAP analysis on top features (from permutation importance) rather than all features

**Additional Pitfall: Correlated Features**

SHAP values can be misleading when features are highly correlated because SHAP assumes feature independence. When features are correlated, SHAP might attribute importance inconsistently. **Mitigation**: Check for multicollinearity first, and consider using SHAP in combination with other methods (like permutation importance) for validation.