# Task 3: Model Explainability with SHAP

This notebook uses SHAP (Shapley Additive exPlanations) to interpret the best-performing fraud detection model.

We generate and interpret SHAP plots (summary, bar, beeswarm, force, dependence, waterfall) to understand global and local feature importance.

## Business Context
Fraud detection models must be explainable to build trust with business stakeholders and to understand the key drivers of fraud. SHAP provides both global and local interpretability.

## 1. Imports and Setup

In [None]:
# If not installed, uncomment the next line:
# !pip install shap

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
shap.initjs()

## 2. Load Data and Prepare Features

We use the e-commerce fraud data and the same feature engineering as in previous tasks.

In [None]:
fraud_df = pd.read_csv('../data/Fraud_Data.csv')
fraud_df = fraud_df.dropna().drop_duplicates()
fraud_df['signup_time'] = pd.to_datetime(fraud_df['signup_time'])
fraud_df['purchase_time'] = pd.to_datetime(fraud_df['purchase_time'])
fraud_df['age'] = fraud_df['age'].astype(int)
fraud_df['hour_of_day'] = fraud_df['purchase_time'].dt.hour
fraud_df['day_of_week'] = fraud_df['purchase_time'].dt.dayofweek
fraud_df['time_since_signup'] = (fraud_df['purchase_time'] - fraud_df['signup_time']).dt.total_seconds() / 3600
user_freq = fraud_df.groupby('user_id').size().rename('transaction_count')
fraud_df = fraud_df.merge(user_freq, on='user_id')
drop_cols = ['class', 'ip_address', 'signup_time', 'purchase_time', 'user_id', 'device_id']
X = fraud_df.drop(drop_cols, axis=1)
y = fraud_df['class']
categorical = ['source', 'browser', 'sex']
X = pd.get_dummies(X, columns=categorical, drop_first=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)
X_test_df = pd.DataFrame(X_test, columns=X.columns)

## 3. Train the Best Model (Random Forest)

We retrain the Random Forest on the full training set for SHAP analysis.

In [None]:
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)

## 4. SHAP Analysis

We use a sample of the test set for visualization speed.

In [None]:
X_sample = X_test_df.sample(100, random_state=42)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_sample)

### SHAP Summary Plot (Bar)

Shows global feature importance.

In [None]:
shap.summary_plot(shap_values[1], X_sample, plot_type="bar")

### SHAP Summary Plot (Beeswarm)

Shows both importance and direction of effect.

In [None]:
shap.summary_plot(shap_values[1], X_sample)

### SHAP Force Plot (Local Explanation)

Explains a single prediction.

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_sample.iloc[0])

### SHAP Dependence Plot

Shows how a single feature affects the prediction.

In [None]:
# Pick a top feature from the summary plot, e.g., 'transaction_count'
shap.dependence_plot('transaction_count', shap_values[1], X_sample)

### SHAP Waterfall Plot

Gives a detailed breakdown for a single prediction.

In [None]:
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[1], shap_values[1][0], X_sample.iloc[0])

## 5. Interpretation of SHAP Results

- **Summary Plot:** The top features (e.g., `transaction_count`, `purchase_value`, `time_since_signup`) are the most important for predicting fraud. Their SHAP values show both their importance and whether high or low values increase fraud risk.
- **Force Plot:** For a single transaction, the force plot shows how each feature pushes the prediction toward fraud or not fraud.
- **Dependence Plot:** Shows the relationship between a feature and its SHAP value, revealing non-linear effects.
- **Waterfall Plot:** Gives a detailed breakdown of how each feature contributed to a specific prediction.

**Key Insights:**  
- Features with the largest SHAP values are the main drivers of fraud risk.
- These insights help business stakeholders understand and trust the model, and can guide further investigation or feature engineering.

**Conclusion:**  
SHAP analysis reveals the main drivers of fraud in the data and provides both global and local interpretability for the model's decisions.