# Task 4: Statistical Modeling and Evaluation

In this notebook, we build and evaluate predictive models for a dynamic, risk-based pricing system.

## Objectives
1.  **Claim Severity Prediction (Regression)**: Predict `TotalClaims` for policies with a claim.
2.  **Claim Probability Prediction (Classification)**: Predict the likelihood (`IsClaim`) of a claim occurring.
3.  **Interpretability**: Use SHAP values to explain the key drivers of risk.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap

# Ensure src modules are importable
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from src.data.loader import load_data
from src.features.build_features import DataBuilder
from src.models.train_model import ModelTrainer

# Output Setup
sns.set_style("whitegrid")
%matplotlib inline

## 1. Data Loading & Preparation
We load the raw data and pass it through our `DataBuilder` pipeline.

In [None]:
DATA_PATH = '../data/raw/MachineLearningRating.txt'
if not os.path.exists(DATA_PATH):
    DATA_PATH = 'data/raw/MachineLearningRating.txt'

df_raw = load_data(DATA_PATH)
print(f"Raw Data Shape: {df_raw.shape}")

In [None]:
# Initialize and Run Preprocessing Pipeline
builder = DataBuilder(df_raw)
df_processed = builder.preprocess()

print(f"Processed Data Shape: {df_processed.shape}")
df_processed.head()

## 2. Model 1: Claim Severity (Regression)
**Goal**: Predict `TotalClaims` amount using only positive claim cases.

In [None]:
# Get specific dataset for Severity
X_severity, y_severity = builder.get_severity_data()
print(f"Severity Dataset: {X_severity.shape}")

# Split
X_train_sev, X_test_sev, y_train_sev, y_test_sev = builder.split_data(X_severity, y_severity)

# Initialize Trainer
trainer = ModelTrainer()

# Train Regression Models
trainer.train_severity_models(X_train_sev, X_test_sev, y_train_sev, y_test_sev)

## 3. Model 2: Claim Probability (Classification)
**Goal**: Predict binary probability of a claim occurring (`IsClaim=1`).

In [None]:
# Get specific dataset for Probability
X_prob, y_prob = builder.get_probability_data()
print(f"Probability Dataset: {X_prob.shape}")

# Split
X_train_prob, X_test_prob, y_train_prob, y_test_prob = builder.split_data(X_prob, y_prob)

# Train Classification Models
trainer.train_probability_models(X_train_prob, X_test_prob, y_train_prob, y_test_prob)

## 4. Model Evaluation & Comparison

In [None]:
results = trainer.get_results()
display(results)

# Visualize RMSE for Regression
plt.figure(figsize=(10, 5))
reg_res = results[results.index.str.contains("Severity")]
sns.barplot(x=reg_res.index, y=reg_res['RMSE'], palette='viridis')
plt.title("Claim Severity Model Comparison (RMSE - Lower is Better)")
plt.ylabel("RMSE")
plt.xticks(rotation=45)
plt.show()

## 5. Model Interpretability (SHAP)
We analyse the best performing model (typically XGBoost) to understand feature importance.

In [None]:
def interpret_model(model, X_train, model_name):
    print(f"Interpreting {model_name} with SHAP...")
    try:
        # Use TreeExplainer for Tree models
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X_train)
        
        plt.title(f"SHAP Summary: {model_name}")
        shap.summary_plot(shap_values, X_train, plot_type="bar")
        plt.show()
        
        print("Detailed Feature Impact:")
        shap.summary_plot(shap_values, X_train)
    except Exception as e:
        print(f"SHAP failed: {e}")
        # Fallback to feature importance
        if hasattr(model, 'feature_importances_'):
            feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
            feat_importances.nlargest(10).plot(kind='barh')
            plt.title("Feature Importance (Fallback)")
            plt.show()

# Interpret best Probability Model (e.g., XGBoost)
if 'Probability_XGB' in trainer.models:
    interpret_model(trainer.models['Probability_XGB'], X_train_prob, "XGBoost Classification")
elif 'Probability_RF' in trainer.models:
    interpret_model(trainer.models['Probability_RF'], X_train_prob, "RandomForest Classification")

## 6. Conclusion
The analysis provides a dual-framework for pricing:
1.  **Risk Probability**: Identified drivers of *accident frequency*.
2.  **Risk Severity**: Identified drivers of *cost* when accidents occur.

By combining these, we can formulate an optimal premium:
> **Premium** = (P(Claim) * E[Claim Cost]) + Margin