# Step 6: Encoding & Modeling

In this step, the feature-engineered dataset is converted into a model-ready format using appropriate encoding techniques.
Baseline and tree-based models are trained and evaluated with a focus on handling class imbalance and maximizing fraud detection performance.


### Load Feature‑Engineered Raw Data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/processed/feature_engineered_raw.csv")
df.shape


(15420, 23)

### Separate Features & Target

In [2]:
X = df.drop('FraudFound_P', axis=1)
y = df['FraudFound_P']

X.shape, y.shape

((15420, 22), (15420,))

### Encode Categorical Features

In [3]:
X_encoded = pd.get_dummies(X, drop_first=True)
X_encoded.shape


(15420, 92)

### Train–Test Split (Stratified)

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape


((11565, 92), (3855, 92))

### Evaluation Helper Function

In [5]:
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_prob))


### Baseline Model: Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)

log_reg.fit(X_train, y_train)

evaluate_model(log_reg, X_test, y_test)


Confusion Matrix:
[[2240 1384]
 [  27  204]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.62      0.76      3624
           1       0.13      0.88      0.22       231

    accuracy                           0.63      3855
   macro avg       0.56      0.75      0.49      3855
weighted avg       0.94      0.63      0.73      3855

ROC-AUC: 0.7990978851906004


### 6.6 Baseline Model – Logistic Regression

Logistic Regression was trained as a baseline model using class weighting to address class imbalance.

**Observations:**
- The model achieved high recall for fraudulent claims, indicating its ability to capture most fraud cases.
- Precision for the fraud class was very low, resulting in a large number of false positives.
- Overall accuracy is low due to the intentional bias toward fraud detection.
- ROC-AUC close to 0.80 indicates reasonable class separation ability.

**Conclusion:**
Logistic Regression serves as a recall-oriented baseline model. While it is effective at identifying potential fraud cases, it lacks precision and is not suitable as a final production model.


### Primary Model: Random Forest

In [7]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    min_samples_split=10,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

evaluate_model(rf_model, X_test, y_test)


Confusion Matrix:
[[3546   78]
 [ 211   20]]

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.98      0.96      3624
           1       0.20      0.09      0.12       231

    accuracy                           0.93      3855
   macro avg       0.57      0.53      0.54      3855
weighted avg       0.90      0.93      0.91      3855

ROC-AUC: 0.8038903701155357


### Feature Importance (Random Forest)

In [8]:
import pandas as pd

feature_importance = pd.DataFrame({
    'Feature': X_encoded.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

feature_importance.head(15)


Unnamed: 0,Feature,Importance
4,Policyholder_At_Fault,0.140868
48,PolicyType_Sedan - Liability,0.065568
85,BasePolicy_Liability,0.063422
55,VehicleCategory_Sport,0.04628
0,DriverRating,0.039106
2,Repeat_Claimant,0.019387
86,Deductible_Bin_Medium,0.017359
57,VehiclePrice_30000 to 39000,0.016211
83,NumberOfSuppliments_none,0.015487
21,Make_Toyota,0.01532


In [9]:
feature_importance.to_csv(
    "../reports/feature_importance.csv",
    index=False
)


### 6.7 Primary Model – Random Forest

A Random Forest classifier was trained as the primary model to capture non-linear relationships and interactions between engineered features.

**Observations:**
- The model achieved high overall accuracy, primarily driven by correct classification of genuine claims.
- Fraud detection recall is low, indicating a conservative prediction strategy.
- Precision for fraud is higher than Logistic Regression but still limited.
- ROC-AUC slightly outperforms the baseline model, indicating improved ranking ability.

**Feature Importance Analysis:**
Top contributing features include:
- Policyholder_At_Fault
- PolicyType and BasePolicy attributes
- VehicleCategory
- DriverRating
- Repeat_Claimant

These features align well with domain expectations and EDA findings.

**Conclusion:**
The Random Forest model demonstrates strong discriminative power but requires threshold tuning or cost-sensitive optimization to improve fraud recall.


## Modeling Summary

Two models were evaluated for vehicle insurance fraud detection:

- Logistic Regression provided a strong recall-oriented baseline but suffered from excessive false positives.
- Random Forest achieved higher overall accuracy and better feature utilization but was conservative in identifying fraud cases.

Due to the imbalanced nature of the dataset, accuracy alone is not a reliable metric. ROC-AUC and recall–precision trade-offs are more relevant for evaluating fraud detection performance.

The Random Forest model was selected as the primary model, with further optimization planned through threshold tuning and recall-focused evaluation.


### Saving the models

In [10]:
import joblib
import os

# Create models directory if not exists
os.makedirs("../models", exist_ok=True)

# Save Logistic Regression
joblib.dump(
    log_reg,
    "../models/logistic_regression_model.pkl"
)

# Save Random Forest
joblib.dump(
    rf_model,
    "../models/random_forest_model.pkl"
)

print("Models saved successfully.")


Models saved successfully.
