# xAI Explainability Demo

This notebook demonstrates explainable AI techniques on the insurance dataset using SHAP values.
For large datasets, consider using `df.sample(frac=0.1)` to reduce runtime.

## Setup and Dependencies

In [1]:
# Install required packages
!pip install shap joblib scikit-learn matplotlib pandas numpy

Collecting shap
  Downloading shap-0.49.1-cp312-cp312-win_amd64.whl.metadata (25 kB)
Collecting slicer==0.0.8 (from shap)
  Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Collecting numba>=0.54 (from shap)
  Downloading numba-0.62.1-cp312-cp312-win_amd64.whl.metadata (2.9 kB)
Collecting llvmlite<0.46,>=0.45.0dev0 (from numba>=0.54->shap)
  Downloading llvmlite-0.45.1-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Downloading shap-0.49.1-cp312-cp312-win_amd64.whl (548 kB)
   ---------------------------------------- 0.0/548.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/548.0 kB ? eta -:--:--
   ---------------------------------------- 548.0/548.0 kB 2.6 MB/s  0:00:00
Downloading slicer-0.0.8-py3-none-any.whl (15 kB)
Downloading numba-0.62.1-cp312-cp312-win_amd64.whl (2.7 MB)
   ---------------------------------------- 0.0/2.7 MB ? eta -:--:--
   ------- -------------------------------- 0.5/2.7 MB 3.4 MB/s eta 0:00:01
   ------------------- -----------------



In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import shap
import joblib
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

  from .autonotebook import tqdm as notebook_tqdm


## Data Loading and Preprocessing

In [2]:
# Load dataset (using AUSTRALIA.csv as example)
data_path = os.path.join('..', 'data', 'csv', 'AUSTRALIA.csv')
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print("\nSample of the data:")
display(df.head())

Dataset shape: (10000, 25)

Sample of the data:


Unnamed: 0,name,age,country,policytype,policytier,sumassured,smokerdrinker,numdiseases,diseases,annualpremium,...,propertytype,propertysize,destinationcountry,tripdurationdays,existingmedicalcondition,healthcoverage,baggagecoverage,tripcancellationcoverage,accidentcoverage,trippremium
0,Sheri Adams,48,Australia,Health,Basic,1557887.0,No,2.0,"Asthma, Hypertension",22589.0,...,,,,,,,,,,
1,Anthony Rios,22,Australia,Health,Basic,1945439.0,No,3.0,"Hypertension, Asthma, Heart Condition",26263.0,...,,,,,,,,,,
2,Amanda Erickson,60,Australia,Health,Basic,1941245.0,No,4.0,"Asthma, Hypertension, Heart Condition, Diabetes",29118.0,...,,,,,,,,,,
3,Todd Gilbert,44,Australia,Health,Basic,2483468.0,Yes,2.0,"Hypertension, Thyroid",37252.0,...,,,,,,,,,,
4,Alisha Hines,59,Australia,Health,Basic,1748417.0,No,2.0,"Thyroid, Diabetes",26226.0,...,,,,,,,,,,


In [3]:
# Auto-detect target column
possible_target_columns = [col for col in df.columns if any(keyword in col.lower() 
                          for keyword in ['recommend', 'policy', 'target'])]

if possible_target_columns:
    target_column = possible_target_columns[0]
else:
    target_column = df.columns[-1]

print(f"Selected target column: {target_column}")

# Identify features
id_patterns = ['id', 'user_id', 'uid']
feature_columns = [col for col in df.columns 
                  if col != target_column and 
                  not any(pattern in col.lower() for pattern in id_patterns)]

print(f"\nNumber of features: {len(feature_columns)}")
print("\nFeatures:", feature_columns)

Selected target column: policytype

Number of features: 23

Features: ['name', 'age', 'country', 'policytier', 'sumassured', 'smokerdrinker', 'numdiseases', 'diseases', 'annualpremium', 'priceofvehicle', 'ageofvehicle', 'typeofvehicle', 'propertyvalue', 'propertyage', 'propertytype', 'propertysize', 'destinationcountry', 'tripdurationdays', 'existingmedicalcondition', 'healthcoverage', 'baggagecoverage', 'tripcancellationcoverage', 'trippremium']


In [5]:
# Preprocess the data
def preprocess_data(df, feature_cols, target_col):
    # Create a copy of the dataframe
    df_clean = df.copy()
    
    # Fill NA values with appropriate defaults
    for col in feature_cols:
        if df_clean[col].dtype == 'object':
            df_clean[col] = df_clean[col].fillna('unknown')
        else:
            df_clean[col] = df_clean[col].fillna(df_clean[col].mean())
    
    # Initialize dictionary to store label encoders
    label_encoders = {}
    
    # Process features
    X = df_clean[feature_cols].copy()
    for col in feature_cols:
        if X[col].dtype == 'object':
            label_encoders[col] = LabelEncoder()
            X[col] = label_encoders[col].fit_transform(X[col])
    
    # Process target
    y = df_clean[target_col]
    if y.dtype == 'object':
        label_encoders[target_col] = LabelEncoder()
        y = label_encoders[target_col].fit_transform(y)
    
    return X, y, label_encoders

# Apply preprocessing
X, y, label_encoders = preprocess_data(df, feature_columns, target_column)

print("Preprocessed data shape:", X.shape)
print(f"Number of classes in target: {len(np.unique(y))}")

Preprocessed data shape: (10000, 23)
Number of classes in target: 5


## Model Loading/Training

In [6]:
# Check if model exists, otherwise train a new one
model_path = os.path.join('..', 'artifacts', 'model.pkl')

if os.path.exists(model_path):
    print("Loading existing model...")
    model = joblib.load(model_path)
else:
    print("Training new model...")
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # Quick evaluation
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"Train accuracy: {train_score:.3f}")
    print(f"Test accuracy: {test_score:.3f}")

Training new model...
Train accuracy: 0.840
Test accuracy: 0.639


## SHAP Analysis

In [7]:
# Calculate SHAP values
# Using a subset of data for SHAP analysis to reduce computation time
n_samples = min(100, len(X))
X_sample = X.sample(n=n_samples, random_state=42)

# Initialize SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sample)

# If model output is multi-class, use the first class for visualization
if isinstance(shap_values, list):
    shap_values = shap_values[0]

# Generate and save SHAP summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_sample, feature_names=feature_columns, show=False)
plt.tight_layout()
plt.savefig('shap_summary.png')
plt.close()

print("Generated SHAP summary plot (saved as 'shap_summary.png')")

  shap.summary_plot(shap_values, X_sample, feature_names=feature_columns, show=False)
  summary_legacy(
  summary_legacy(
  summary_legacy(
  summary_legacy(
  summary_legacy(


Generated SHAP summary plot (saved as 'shap_summary.png')


<Figure size 1000x800 with 0 Axes>

In [10]:
# Generate SHAP force plot for a single instance
instance_idx = 0
single_prediction = explainer.shap_values(X_sample.iloc[instance_idx:instance_idx+1])

# For multi-class, create a summary force plot of the first class
if isinstance(single_prediction, list):
    shap.summary_plot(single_prediction[0], 
                     X_sample,
                     feature_names=feature_columns,
                     plot_type="bar",
                     show=False)
else:
    shap.summary_plot(single_prediction, 
                     X_sample,
                     feature_names=feature_columns,
                     plot_type="bar",
                     show=False)

plt.title("Feature Importance (SHAP Values)")
plt.tight_layout()
plt.savefig('shap_force.png', bbox_inches='tight')
plt.close()
print("Generated SHAP force plot alternative (saved as 'shap_force.png')")

  shap.summary_plot(single_prediction,


Generated SHAP force plot alternative (saved as 'shap_force.png')


## Decision Tree Snippet (Optional)

In [11]:
# Extract a sample decision tree from the random forest
from sklearn.tree import export_text

# Get the first tree from the forest
tree = model.estimators_[0]

# Export the tree as text
tree_text = export_text(tree, feature_names=feature_columns)

# Save the tree text
with open('sample_decision_tree_snippet.txt', 'w') as f:
    f.write(tree_text)

print("Generated decision tree snippet (saved as 'sample_decision_tree_snippet.txt')")
print("\nFirst few lines of the decision tree:")
print('\n'.join(tree_text.split('\n')[:10]))

Generated decision tree snippet (saved as 'sample_decision_tree_snippet.txt')

First few lines of the decision tree:
|--- tripcancellationcoverage <= 1.50
|   |--- class: 3.0
|--- tripcancellationcoverage >  1.50
|   |--- ageofvehicle <= 7.77
|   |   |--- propertytype <= 2.50
|   |   |   |--- class: 1.0
|   |   |--- propertytype >  2.50
|   |   |   |--- typeofvehicle <= 4.50
|   |   |   |   |--- class: 4.0
|   |   |   |--- typeofvehicle >  4.50


## Summary

This notebook has demonstrated:
1. Loading and preprocessing insurance data
2. Auto-detecting target variables and features
3. Training/loading a Random Forest model
4. Generating SHAP explanations with visualizations
5. Extracting a readable decision tree snippet

The generated artifacts can be found in the `explainability_demo/` folder:
- `shap_summary.png`: Global feature importance visualization
- `shap_force.html`: Interactive explanation for a single prediction
- `sample_decision_tree_snippet.txt`: Human-readable decision rules

Note: For large datasets, consider using `df.sample(frac=0.1)` after data loading to reduce computation time.