## About Dataset
### Vehicle Insurance Fraud Detection
 Vehicle insurance fraud involves conspiring to make false or exaggerated claims involving property damage or personal injuries following an accident. Some common examples include staged accidents where fraudsters deliberately “arrange” for accidents to occur; the use of phantom passengers where people who were not even at the scene of the accident claim to have suffered grievous injury, and make false personal injury claims where personal injuries are grossly exaggerated.

### About this dataset

This dataset contains vehicle dataset - attribute, model, accident details, etc along with policy details - policy type, tenure etc. The target is to detect if a claim application is fraudulent or not - FraudFound_P. The dataset is obtained from Kaggle (https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection)

**The objective of this project to is to predict if a vehicle insurance claim was legit**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
df = pd.read_csv("Vehicle_Insurance_Fraud_Detection.csv")

In [None]:
df.head()

### Checking null values

In [None]:
df.isna().sum()

In [None]:
pd.set_option('display.max_columns',None) ### displaying all the columns

In [None]:
df.head()

### Dropping columns not required for analysis

In [None]:
df.drop(['Year','Age','PolicyNumber','RepNumber'],axis=1,inplace=True)

In [None]:
df.head()

# Exploratory Data Analysis

### Checking correlation between numerical columns

In [None]:
df.corr()

In [None]:
fig = plt.figure(figsize = (15,10))
ax = fig.gca()
sns.heatmap(df.corr(),ax=ax,cmap='coolwarm',annot=True,annot_kws={'fontsize': 16, 'color':'black', 'alpha': 1,
                        'verticalalignment': 'center'})

## It can be seen that there is no significant correlation observed between the variables

### Checking Fraudulent claims on the basis of gender and marital status (EDA on categorical variables)

In [None]:
df1 = df[df['FraudFound_P']==1]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(df1['Sex'],ax=axes[0])
sns.histplot(df1['MaritalStatus'],ax=axes[1])

### Checking Fraudulent claims on the basis of Accident "month", "week of month" and "day of week"

In [None]:
df1 = df[df['FraudFound_P']==1]
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
sns.histplot(df1['Month'],ax=axes[0])
sns.histplot(df1['WeekOfMonth'],ax=axes[1])
sns.histplot(df1['DayOfWeek'],ax=axes[2])

### Checking Fraudulent claims on the basis of Make of the vehicle

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (15,10))
ax = fig.gca()
df1 = df[df['FraudFound_P']==1]
df1['Make'].hist(ax=ax,)

### Checking Fraudulent claims on the basis  of the vehicle category

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['VehicleCategory'].hist()

### Checking Fraudulent claims on the basis of age of the vehicle

In [None]:

fig = plt.figure(figsize = (15,5))
ax = fig.gca()
df1 = df[df['FraudFound_P']==1]
df1['AgeOfVehicle'].hist(ax=ax)

### Checking Fraudulent claims on the basis  of Accident area

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['AccidentArea'].hist()

### Checking Fraudulent claims on the basis  of Fault

In [None]:

df1 = df[df['FraudFound_P']==1]
df1['Fault'].hist()

### Checking Fraudulent claims on the basis  of Policy type

In [None]:
fig = plt.figure(figsize = (15,5))
ax = fig.gca()
df1 = df[df['FraudFound_P']==1]
df1['PolicyType'].hist(ax=ax)

### Checking Fraudulent claims on the basis  of vehicle price

In [None]:
fig = plt.figure(figsize = (15,5))
ax = fig.gca()
df1 = df[df['FraudFound_P']==1]
df1['VehiclePrice'].hist(ax=ax)

### Checking Fraudulent claims on the basis  of past number of claims

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['PastNumberOfClaims'].hist()

### Checking Fraudulent claims on the basis of Age of policy holder

In [None]:
fig = plt.figure(figsize = (15,5))
ax = fig.gca()
df1 = df[df['FraudFound_P']==1]
df1['AgeOfPolicyHolder'].hist(ax=ax)

### Checking Fraudulent claims on the basis  of "police report filed" and "witness present"

In [None]:
df1 = df[df['FraudFound_P']==1]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(df1['PoliceReportFiled'],ax=axes[0])
sns.histplot(df1['WitnessPresent'],ax=axes[1])

### Checking Fraudulent claims on the basis  of agent type

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['AgentType'].hist()

### Checking Fraudulent claims on the basis  of number of suppliments

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['NumberOfSuppliments'].hist()

### Checking Fraudulent claims on the basis  of AddressChange_Claim

In [None]:
df1 = df[df['FraudFound_P']==1]
df1['AddressChange_Claim'].hist()

### Checking Fraudulent claims on the basis  of Number of cars and base policy

In [None]:
df1 = df[df['FraudFound_P']==1]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(df1['NumberOfCars'],ax=axes[0])
sns.histplot(df1['BasePolicy'],ax=axes[1])

## From the histograms, it can be seen that all the columns have an effect on the Fraudulent claims. We will be using all the columns as factors in our ML algorithms

## Data Prep

### The target variable is highly imbalanced 

In [None]:
df['FraudFound_P'].value_counts()

### As the dataset is imbalanced, lets use random oversampling. Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.

In [None]:
X = df.drop('FraudFound_P',axis=1)
y = df['FraudFound_P']

In [None]:
X.shape

In [None]:
y.shape

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)

X, y = ros.fit_resample(X, y)

In [None]:
X.shape

In [None]:
y.shape

In [None]:
df=X
df['FraudFound_P']=y

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.3,random_state=0)

In [None]:
train_set.isna().sum()

In [None]:
test_set.isna().sum()

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

### Separating the target variable

In [None]:
train_y = train_set['FraudFound_P']
test_y = test_set['FraudFound_P']

train_inputs = train_set.drop(['FraudFound_P'], axis=1)
test_inputs = test_set.drop(['FraudFound_P'], axis=1)

In [None]:
train_inputs.shape

In [None]:
test_inputs.shape

In [None]:
train_y.shape

In [None]:
test_y.shape

### Identifying the numerical and categorical features

In [None]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [None]:
numeric_columns

In [None]:
categorical_columns

## Creating Pipeline for handling numerical and categorical columns

In [None]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='passthrough')

In [None]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

In [None]:
train_x.shape

In [None]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

In [None]:
test_x

### Model 1: Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier 

tree_clf = DecisionTreeClassifier(min_samples_leaf=10)

tree_clf.fit(train_x, train_y)

### Accuracy for Decision Tree Classifier

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,f1_score,plot_roc_curve,roc_auc_score,roc_curve

In [None]:
#Train accuracy:
train_y_pred = tree_clf.predict(train_x)

print(accuracy_score(train_y, train_y_pred))

In [None]:
#Test accuracy:
test_y_pred = tree_clf.predict(test_x)

print(accuracy_score(test_y, test_y_pred))

### Confusion Matrix for Decision Tree

In [None]:
from sklearn.metrics import confusion_matrix,f1_score
confusion_matrix(test_y, test_y_pred)

### F1 Score for Decision Tree

In [None]:
f1_score(test_y, test_y_pred)

### Determing the Area under the curve for Decision Tree Classifier

In [None]:
dt_auc = roc_auc_score(test_y, test_y_pred)

In [None]:
dt_fpr, dt_tpr, _ = roc_curve(test_y, test_y_pred)

### Model 2: Random Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100,min_samples_leaf=8) 

rnd_clf.fit(train_x, train_y)

### Accuracy for Random Forest Classifier

In [None]:
#Train accuracy
train_y_pred = rnd_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

In [None]:
test_y_pred = rnd_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

### Confusion Matrix for Random Forest 

In [None]:
confusion_matrix(test_y, test_y_pred)

### F1 Score for Random Forest

In [None]:
f1_score(test_y, test_y_pred)

### Determing the Area under the curve for Random Forest Classifier

In [None]:
rn_auc = roc_auc_score(test_y, test_y_pred)

In [None]:
rn_fpr, rn_tpr, _ = roc_curve(test_y, test_y_pred)

### Creating a pickle file for Random Forest model which will be used during model deployment 

In [None]:
# import pickle
# pickle.dump(rnd_clf, open('randomforestmodel.pkl', 'wb'))

### Model 3 XGBoost Classifier

In [None]:
import xgboost

In [None]:
xgb_clf = xgboost.XGBClassifier()

xgb_clf.fit(train_x, train_y)

### Accuracy for XGBoost

In [None]:

train_y_pred = xgb_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

In [None]:
test_y_pred = xgb_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

### Performing Randomized Grid Search with Cross validation for determing best hyperparameters

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [None]:
param_grid = {'max_depth': randint(low=5, high=10), 
              'gamma' : randint(low=2, high=5)}

In [None]:
tree_gs = RandomizedSearchCV(xgboost.XGBClassifier(), param_grid,n_iter=10,cv=5,
                             scoring='accuracy',
                             return_train_score=True)
tree_gs.fit(train_x, train_y)

In [None]:
cvres = tree_gs.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

In [None]:
tree_gs.best_params_

### Accuracy for best XGBoost model obtained after Grid search

In [None]:
#Train accuracy:
train_y_pred = tree_gs.best_estimator_.predict(train_x)

print(accuracy_score(train_y, train_y_pred))

In [None]:
test_y_pred = tree_gs.best_estimator_.predict(test_x)

print(accuracy_score(test_y, test_y_pred))

### Confusion Matrix for best XGBoost obtained after Grid search

In [None]:
confusion_matrix(test_y, test_y_pred)

### F1 score for  best XGBoost obtained after Grid search

In [None]:
f1_score(test_y, test_y_pred)

### Creating a pickle file for XGBoost model which will be used during model deployment

In [None]:
# import pickle
# pickle.dump(tree_gs.best_estimator_, open('XGBoostModel.pkl', 'wb'))

### Determing the Area under the curve for XGBoost model

In [None]:
xg_auc = roc_auc_score(test_y, test_y_pred)

In [None]:
xg_fpr, xg_tpr, _ = roc_curve(test_y, test_y_pred)

### Plotting AUC ROC curve for Decision Tree, Random Forest and XGBoost and comparing the area under the curve

In [None]:
plt.plot(dt_fpr, dt_tpr, linestyle='--', label='Decision Tree (AUROC = %0.3f)' % dt_auc)
plt.plot(rn_fpr, rn_tpr, marker='.', label='Random Forest (AUROC = %0.3f)' % rn_auc)
plt.plot(xg_fpr, xg_tpr, marker='.', label='XGBoost (AUROC = %0.3f)' % xg_auc)

 

# Title
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend() # 
# Show plot
plt.show()

# Interpretation

The objective of this project is to determine if Car Insurance claim is legit. 
Either the insurance claim will be legit or it will be not which makes this problem a classification problem. To make
the predictions, I have used 3 ML algorithms : Decision Tree, Random Forest and XGBoost. 
    
The accuracy for Decision Tree Classifier is **90%**

The accuracy for Random Forest Classifier is **91.32%**

The accuracy for XGBoost is **95.4%**

**Based on the above plot of AUC-ROC, the area under the curve for**

Decision Tree is **0.9**

Random Forest Classifier is **0.91**

XGBoost is **0.95**

Based on the accuracy of 95.4%, it can be said that **XGBoost** model does the best job of predicting the
vehicle insurance claim. The AUCROC of 0.95 (XGBoost)suggests that the model does a great job in discriminating 
fraudulent insurance claims from non fraudulent ones and classifies the insurance claims correctly.

We will be using XGBoost model for making our predictions!