# Model Training for Insurance Fraud Detection

This notebook:
- Loads the engineered dataset
- Splits into training and test sets
- Trains multiple classification models
- Evaluates performance using precision, recall, F1-score
- Identifies best-performing model

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Load Engineered Feature Dataset

In [None]:
import os

project_dir = r"C:\Users\Cloud\OneDrive\Desktop\Fraud_Analytics_Project"
feature_file = os.path.join(project_dir, "data", "features", "engineered_insurance_claims.csv")

df = pd.read_csv(feature_file)
print("✅ Feature data loaded. Shape:", df.shape)
df.head()

## Step 2: Split Features and Target Variable

- `X`: Feature columns
- `y`: Target (fraud_reported)

In [None]:
X = df.drop(columns=['fraud_reported'])
y = df['fraud_reported']

# Optional: scale numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.25, random_state=42, stratify=y
)

print("✅ Split complete:")
print("Train size:", X_train.shape)
print("Test size :", X_test.shape)

## Step 3: Train Models

We will train:
- Logistic Regression (baseline)
- Random Forest Classifier (strong performer on tabular data)

### Train Logistic Regression

In [None]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

y_pred_lr = logreg.predict(X_test)

print("📋 Logistic Regression Report:")
print(classification_report(y_test, y_pred_lr))

### Train Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("📋 Random Forest Report:")
print(classification_report(y_test, y_pred_rf))

## Step 4: Confusion Matrices

Visualize confusion matrices for both models.

In [None]:
def plot_cm(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

plot_cm(y_test, y_pred_lr, "Logistic Regression Confusion Matrix")
plot_cm(y_test, y_pred_rf, "Random Forest Confusion Matrix")

## Step 5: Compare and Choose Best Model

Use F1-score as the main metric due to class imbalance in fraud detection.


In [None]:
from sklearn.metrics import f1_score

f1_lr = f1_score(y_test, y_pred_lr)
f1_rf = f1_score(y_test, y_pred_rf)

print(f"F1 Score - Logistic Regression: {f1_lr:.4f}")
print(f"F1 Score - Random Forest      : {f1_rf:.4f}")

## Step 6: Save Best Model

Save the Random Forest model using `joblib` for future deployment.


In [None]:
import joblib

model_dir = os.path.join(project_dir, "models")
os.makedirs(model_dir, exist_ok=True)

joblib.dump(rf, os.path.join(model_dir, "fraud_model_rf.joblib"))
print("✅ Random Forest model saved.")

# Model Training Complete

✅ Trained and evaluated:
- Logistic Regression
- Random Forest

Best Model: Random Forest  
Saved to: `models/fraud_model_rf.joblib`

---

Next Steps:
- Visualize insights in `reporting_dashboard.ipynb`
- Use the model for real-time or batch predictions