## Insurance Claim Prediction Project Report
______________________________________________

## Project Description

This project predicts which customers are likely to make an insurance claim using machine learning. The goal is to help the client reduce risk and improve claim handling.

We used LightGBM classifier because it works well with large datasets, handles missing values automatically, and is fast with n_jobs=-1 using all CPU cores. All -1 values were replaced with NaN so that the model can handle missing data properly.

The dataset was split into train, validation, and test sets. Validation set and early stopping help prevent overfitting. Multiple evaluation metrics were calculated: ROC AUC, PR AUC, accuracy, precision, recall, F1 score, and MCC.

Note: Exploratory Data Analysis (EDA) was not done because the client requested to skip it. Despite this, the model learned patterns from raw features.

Model Evaluation
Metric	Score	Note
ROC AUC	0.6369	Measures probability ranking
PR AUC	0.0659	Very low due to imbalance
Accuracy	0.6456	Overall correct predictions
Precision	0.0554	Many predicted claims are false
Recall	0.5434	About half of actual claims caught
F1 Score	0.1005	Balance of precision and recall
MCC	0.0754	Confirms effect of imbalanced data

Confusion Matrix:

True Negatives: 74498

False Positives: 40206

False Negatives: 1981

True Positives: 2358

In [3]:
import pandas as pd                # for data handling 
import numpy as np                 # for numerical operations 
import lightgbm as lgb             # for importing machine learning model
from sklearn.model_selection import train_test_split   # for spliting data into train and test

df = pd.read_csv("InsClaim.csv")  # load dataset 

df = df.replace(-1, np.nan)       # replace missing -1 with nan 

X = df.drop(['target', 'id'], axis=1)  # features and removing unwanted id column
y = df['target']                       # target variable 

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # Spliting train data = 80% and test data = 20%
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train  # validation split 
)

model = lgb.LGBMClassifier(
    class_weight='balanced',   # handle imbalanced data because we are not using SMOTE
    n_estimators=2000,          # selecting total number of trees = 2000 (
    learning_rate=0.01,         # selecting learning rate as 0.01 for slow learning and stable result
    num_leaves=50,              # number of leaves will be 50
    random_state=42,            # for getting fix output every time
    n_jobs=-1                   # CPU will use every core (fast processing)
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],      # validate during training
    eval_metric='auc',              # metric to check
    callbacks=[lgb.early_stopping(stopping_rounds=100)]  # if no improvement found till 100 itterations stop it.
)

from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
    matthews_corrcoef,
    average_precision_score
)                                # importing all metrics

y_pred = model.predict(X_test)            # predictions 
y_pred_proba = model.predict_proba(X_test)[:, 1]  # predicted probabilities 

print("------- MODEL EVALUATION REPORT -------")  
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")  
print(f"PR AUC Score:  {average_precision_score(y_test, y_pred_proba):.4f}")  
print(f"Accuracy:      {accuracy_score(y_test, y_pred):.4f}") 
print(f"Precision:     {precision_score(y_test, y_pred):.4f}")  
print(f"Recall:        {recall_score(y_test, y_pred):.4f}")  
print(f"F1 Score:      {f1_score(y_test, y_pred):.4f}")  
print(f"MCC Score:     {matthews_corrcoef(y_test, y_pred):.4f}")  

print("\n------- CONFUSION MATRIX -------")  
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()  
print(f"True Negatives (Correct No-Claim): {tn}")  
print(f"False Positives (False Alarm):     {fp}") 
print(f"False Negatives (Missed Claim):    {fn}") 
print(f"True Positives (Caught Claim):     {tp}")  

print("\n------- CLASSIFICATION REPORT -------")  
print(classification_report(y_test, y_pred))


The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
[LightGBM] [Info] Number of positive: 13884, number of negative: 367051
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.058371 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1367
[LightGBM] [Info] Number of data points in the train set: 380935, number of used features: 57
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[466]	valid_0's auc: 0.640582	valid_0's binary_logloss: 0.635482
------- MODEL EVALUATION REPORT -------
ROC AUC Score: 0.6369
PR AUC Score:  0.0659
Accuracy:      0.6456
Precision:     0.0

## Project Conclusion

The model can detect some customers who may file a claim, but performance is limited due to highly imbalanced data.

ROC AUC of 0.6369 shows moderate probability prediction. PR AUC is very low, showing difficulty with rare claims.

Precision is low (0.0554), recall is moderate (0.5434), and F1 is low (0.1005). MCC (0.0754) confirms impact of imbalance.

Even without EDA, the model provides useful predictions.

Future improvements can include handling data imbalance better, adding more feature information, or feature engineering to improve precision and recall.

Summary: The project is workable and useful for risk management but is limited in performance due to dataset imbalance and lack of feature exploration.