# üîç Model Interpretability
## Student Dropout Prediction Project

**Goal:** Understand *why* the model makes specific predictions using SHAP (SHapley Additive exPlanations). This is crucial for identifying key drivers of student dropout.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import joblib
import sys
import os
import importlib
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path
sys.path.append('..')
import config
importlib.reload(config)

print("‚úì Libraries imported successfully")

## 1. Load Data and Model

In [None]:
try:
    # Load Data
    train_df = pd.read_csv(config.TRAIN_DATA_PATH)
    test_df = pd.read_csv(config.TEST_DATA_PATH)
    
    X_train = train_df.drop(columns=['Target'])
    y_train = train_df['Target']
    X_test = test_df.drop(columns=['Target'])
    y_test = test_df['Target']
    
    # Load Model
    model_path = config.MODEL_DIR / "best_model.pkl"
    model = joblib.load(model_path)
    
    print(f"‚úì Data and Model loaded successfully")
    print(f"Model type: {type(model).__name__}")
    
except FileNotFoundError as e:
    print(f"‚ùå Error: {e}")

## 2. Initialize SHAP Explainer
We use `TreeExplainer` because our best model (Random Forest or XGBoost) is tree-based. This is faster and more exact than KernelExplainer.

In [None]:
# Create object that can calculate shap values
explainer = shap.TreeExplainer(model)

# Calculate shap values. This is what we will plot.
# We'll use a sample of the test set to speed up calculation if needed, 
# but for this dataset size, full test set might be okay.
# Let's use a sample of 500 for speed in demonstration.
X_test_sample = X_test.sample(n=500, random_state=config.RANDOM_STATE)
shap_values = explainer.shap_values(X_test_sample)

print("‚úì SHAP values calculated")

## 3. Global Feature Importance (Summary Plot)
This plot shows the most important features for the model. 
- **Y-axis:** Features ordered by importance.
- **X-axis:** SHAP value (impact on model output).
- **Color:** Feature value (Red = High, Blue = Low).

In [None]:
# Summary Plot for Class 0 (Dropout)
# Note: For multi-class, shap_values is a list of arrays, one for each class.
# We need to check the class mapping.

class_names = ['Dropout', 'Enrolled', 'Graduate'] # Assuming standard encoding 0, 1, 2

print(f"Plotting SHAP summary for: {class_names[0]} (Target=0)")
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values[0], X_test_sample, plot_type="dot", show=False)
plt.title(f"SHAP Summary Plot - {class_names[0]}")
plt.show()

In [None]:
# Summary Plot for Class 2 (Graduate)
print(f"Plotting SHAP summary for: {class_names[2]} (Target=2)")
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values[2], X_test_sample, plot_type="dot", show=False)
plt.title(f"SHAP Summary Plot - {class_names[2]}")
plt.show()

## 4. Feature Dependence Plot
Shows how a single feature affects the prediction, and how it interacts with another feature.

In [None]:
# Find the top feature automatically
# Calculate mean absolute SHAP value for each feature for Class 0
mean_shap = np.abs(shap_values[0]).mean(axis=0)
top_feature_idx = np.argsort(mean_shap)[-1]
top_feature_name = X_test.columns[top_feature_idx]

print(f"Top feature for Dropout prediction: {top_feature_name}")

# Dependence plot for the top feature
shap.dependence_plot(top_feature_name, shap_values[0], X_test_sample, interaction_index='auto')

## 5. Local Explanation (Force Plot)
Explain a *single* prediction. Why did the model predict Dropout/Graduate for this specific student?

In [None]:
# Select a specific student (e.g., the first one in our sample)
student_idx = 0
student_data = X_test_sample.iloc[student_idx]

# Initialize JS for force plot visualization in notebook
shap.initjs()

print(f"Explaining prediction for student #{student_idx}")
print(f"Feature values:\n{student_data}")

# Force plot for Class 0 (Dropout)
shap.force_plot(explainer.expected_value[0], shap_values[0][student_idx], student_data, matplotlib=True)

## 6. Bar Plot of Feature Importance
A simpler view of global importance.

In [None]:
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values[0], X_test_sample, plot_type="bar", show=False)
plt.title("Feature Importance (SHAP) - Dropout Class")
plt.show()