
# Insurance Claims Fraud Detection

This notebook outlines the steps for building a machine learning model to detect fraudulent insurance claims.

## Steps Involved:

1. **Data Preprocessing**: Cleaning and preparing the data for modeling.
2. **Feature Engineering**: Creating new features that can help in the fraud detection.
3. **Model Building**: Using different machine learning algorithms to build predictive models.
4. **Model Evaluation**: Evaluating the models to find the best performer.
5. **Hyperparameter Tuning**: Fine-tuning the models for optimal performance.
6. **Handling Imbalanced Data**: Using techniques like SMOTE or class weight adjustments.


In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
data = pd.read_csv("C:/Users/THembinkosi.Mkhize/Downloads/Explore Data Science course/Integrated Project/Modified_All_Dates_Advanced_Features_Claims_Data.csv")
# Handle missing values (if any)
# data.fillna(method='ffill', inplace=True)

# Encode categorical variables
label_encoders = {}
categorical_columns = data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    label_encoders[column] = LabelEncoder()
    data[column] = label_encoders[column].fit_transform(data[column])

# Split the data into training and testing sets
X = data.drop('fraud_reported', axis=1)
y = data['fraud_reported']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)




In [2]:

# You can use various feature selection techniques here
# For simplicity, we will use all features for this example
selected_features = X_train.columns



In [3]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train[selected_features], y_train)


RandomForestClassifier(random_state=42)

In [4]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions
y_pred = model.predict(X_test[selected_features])

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[206  14]
 [ 61  19]]
              precision    recall  f1-score   support

           0       0.77      0.94      0.85       220
           1       0.58      0.24      0.34        80

    accuracy                           0.75       300
   macro avg       0.67      0.59      0.59       300
weighted avg       0.72      0.75      0.71       300



In [6]:

# Checking the distribution of values in the 'fraud_reported' column
fraud_reported_distribution = data['fraud_reported'].value_counts()
fraud_reported_distribution


0    753
1    247
Name: fraud_reported, dtype: int64

In [9]:
# Correcting the encoding of the target variable
y = data['fraud_reported'].replace({'Y': 1, 'N': 0})

# Re-splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Re-checking the distribution of y_train and y_test
y_train_distribution_corrected = y_train.value_counts()
y_test_distribution_corrected = y_test.value_counts()

# Re-train and evaluate each model with the corrected target variable
corrected_model_performance = {}

for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    # Store performance
    corrected_model_performance[name] = {
        "Accuracy": accuracy,
        "Report": report
    }

y_train_distribution_corrected, y_test_distribution_corrected, corrected_model_performance



(0    533
 1    167
 Name: fraud_reported, dtype: int64,
 0    220
 1     80
 Name: fraud_reported, dtype: int64,
 {'Random Forest': {'Accuracy': 0.75,
   'Report': '              precision    recall  f1-score   support\n\n           0       0.77      0.94      0.85       220\n           1       0.58      0.24      0.34        80\n\n    accuracy                           0.75       300\n   macro avg       0.67      0.59      0.59       300\nweighted avg       0.72      0.75      0.71       300\n'},
  'Logistic Regression': {'Accuracy': 0.73,
   'Report': '              precision    recall  f1-score   support\n\n           0       0.73      0.99      0.84       220\n           1       0.33      0.01      0.02        80\n\n    accuracy                           0.73       300\n   macro avg       0.53      0.50      0.43       300\nweighted avg       0.63      0.73      0.62       300\n'},
  'Decision Tree': {'Accuracy': 0.7766666666666666,
   'Report': '              precision    recall 

In [None]:
Random Forest

Accuracy: 75%
Classification Report:
Precision for class '0' (non-fraud): 77%
Recall for class '0': 94%
Precision for class '1' (fraud): 58%
Recall for class '1': 24%
Logistic Regression

Accuracy: 73%
Classification Report:
Precision for class '0': 73%
Recall for class '0': 99%
Precision for class '1': 33%
Recall for class '1': 1%
Decision Tree

Accuracy: 77.7%
Classification Report:
Precision for class '0': 85%
Recall for class '0': 84%
Precision for class '1': 58%
Recall for class '1': 60%
Naive Bayes

Accuracy: 67%
Classification Report:
Precision for class '0': 74%
Recall for class '0': 85%
Precision for class '1': 31%
Recall for class '1': 19%
Support Vector Machine (SVM)

Accuracy: 73%
Classification Report:
Precision for class '0': 73%
Recall for class '0': 100%
Precision for class '1': 0%
Recall for class '1': 0%

In [None]:
#Based on these results, the Decision Tree model performs the best in terms of overall accuracy and balance between precision and recall for both classes