[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ObjectMatrix/Automotive-Physical-Damage-Claimss/blob/main/courseraSupervised_adjusted.ipynb)

# Insurance Claim Prediction using Supervised Machine Learning

## Main Objective of the Analysis
The primary objective is to develop a predictive model to determine whether a policyholder will file an insurance claim (`insuranceclaim`), enabling the insurance company to assess risk and adjust premiums proactively. The focus is on **prediction** to optimize financial planning and reduce losses from unexpected claims, while providing **interpretability** to uncover key claim drivers for stakeholders. Benefits include enhanced risk management, improved pricing accuracy, and targeted interventions.

## Brief Description of the Data Set
The dataset (`Insurance.csv`) contains **1,338 records** of insurance policyholders with **8 attributes**:  
- `age`: Age of the policyholder (numeric)  
- `sex`: Gender (0=female, 1=male)  
- `bmi`: Body Mass Index (numeric)  
- `children`: Number of children (numeric)  
- `smoker`: Smoking status (0=no, 1=yes)  
- `region`: Geographic region (0-3)  
- `charges`: Insurance charges (numeric)  
- `insuranceclaim`: Target variable (0=no claim, 1=claim)  

The goal is to predict `insuranceclaim` and identify key predictors of claim likelihood.

In [None]:
# Install required libraries
!pip install -q pandas numpy seaborn matplotlib scikit-learn

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score

# Load data
url = 'https://raw.githubusercontent.com/ObjectMatrix/Automotive-Physical-Damage-Claimss/main/Insurance.csv'
df = pd.read_csv(url)

# Basic EDA
print('First 5 rows:')
print(df.head())
print('\nData Info:')
print(df.info())
print('\nData Description:')
print(df.describe())

## Brief Summary of Data Exploration and Actions Taken
Exploration revealed no missing values in the dataset. Outliers in `charges` were capped at the 99th percentile to mitigate extreme values. A new feature, `charges_per_child` (`charges` / (`children` + 1)), was engineered to capture cost per dependent, avoiding division by zero. Numerical features (`age`, `bmi`, `children`, `charges`, `charges_per_child`) were scaled using StandardScaler. Visualizations confirmed `smoker` and `charges` as potential claim predictors.

In [None]:
# Check for missing values
print('Missing Values:')
print(df.isnull().sum())

# Handle outliers
df['charges'] = df['charges'].clip(upper=df['charges'].quantile(0.99))

# Feature engineering
df['charges_per_child'] = df['charges'] / (df['children'] + 1)

# Visualizations
plt.figure(figsize=(8, 6))
sns.histplot(df['charges'], kde=True)
plt.title('Distribution of Insurance Charges')
plt.show()

plt.figure(figsize=(8, 6))
sns.boxplot(x='smoker', y='charges', hue='insuranceclaim', data=df)
plt.title('Charges vs. Smoker Status by Insurance Claim')
plt.show()

# Prepare features and target
X = df.drop('insuranceclaim', axis=1)
y = df['insuranceclaim']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## Summary of Training Classifier Models
Three classifier models were trained using an 80/20 train-test split and 5-fold cross-validation:
- **Logistic Regression**: Baseline model with high interpretability.
- **Random Forest**: Ensemble model (100 trees) capturing complex interactions.
- **Gradient Boosting**: Tuned model (learning_rate=0.1, max_depth=4) for high accuracy.

Models were evaluated on accuracy and F1-score to balance precision and recall.

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(learning_rate=0.1, max_depth=4, random_state=42)
}

# Train and evaluate
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
    results[name] = {'Accuracy': acc, 'F1-Score': f1, 'CV Accuracy': cv_scores.mean()}
    print(f'{name}:')
    print(f'  Accuracy = {acc:.2f}')
    print(f'  F1-Score = {f1:.2f}')
    print(f'  CV Accuracy = {cv_scores.mean():.2f} ± {cv_scores.std():.2f}')

## Recommended Final Model
The **Gradient Boosting** model is recommended due to its superior accuracy and F1-score, while providing feature importance insights (e.g., `smoker`, `charges`). It balances predictive power and explainability, making it ideal for risk assessment and pricing optimization.

In [None]:
# Feature importance for Gradient Boosting
gb_model = models['Gradient Boosting']
importances = pd.Series(gb_model.feature_importances_, index=X.columns)
importances.sort_values().plot(kind='barh', figsize=(8, 6))
plt.title('Feature Importance (Gradient Boosting)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

## Summary Key Findings and Insights
The Gradient Boosting model identified `smoker`, `charges`, and `age` as key drivers of `insuranceclaim`. Smokers were significantly more likely to file claims, with high `charges` also strongly correlated with claim likelihood. Older policyholders showed increased claim tendencies, possibly due to health-related risks. These insights suggest focusing on smoking status and premium levels for risk management.

## Suggestions for Next Steps
- **Add Features**: Include health data (e.g., medical history) or regional risk factors to improve predictions.
- **Address Imbalance**: Use SMOTE if `insuranceclaim` imbalance impacts performance.
- **Deploy Model**: Pilot the Gradient Boosting model in pricing strategies to assess real-world efficacy.
- **Explore Alternatives**: Test neural networks for potentially higher accuracy, though with reduced interpretability.