# Auto Insurance Claim Fraud Indicators and Classification

# Mary Donovan Martello

## The goal of this project was to identify significant features in fraudulent insurance claim transactions and to design predictive classification models to predict whether fraud was reported on the insurance claim transaction. This notebook tests different subsets of input features to see which subset(s) may produce the best results in the predictive models.

# Part 5: Test Different Subsets of Input Features for Optimizing Models

### The dataset includes 1,000 prior claim transaction records.  Each record has a mix of 38 quantitative and categorical data features about the claim filed, including information on the policy, insured, and automobile, aspects of the damage incident, and elements of the claim filed.  The dataset also has a feature that indicates whether fraud was reported on each observation (i.e., either Y or N).

In [2]:
dfClaims = pd.read_csv('FradulentInsuranceClaims.csv')

In [3]:
dfClaims.head(2)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported,_c39
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,YES,71610,6510,13020,52080,Saab,92x,2004,Y,
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,?,5070,780,780,3510,Mercedes,E400,2007,Y,


### The dataset includes the original data that was cleaned, prepared, and transformed into Principal Component features in the 1_EDA_Prep notebook.

In [38]:
# import scaled, transformed and PCA df
pcaDF = pd.read_csv('pcaClaimsLog.csv')

# Model Evaluation and Model Selection

>  ## Phase 2: Create Subsets to See if it Improves Baseline Models

In [37]:
# import libraries for models

import pandas as pd
import numpy as np

from numpy import mean
from numpy import std

import yellowbrick
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.model_selection import StratifiedKFold, KFold, cross_val_score
from sklearn.pipeline import make_pipeline

from sklearn.metrics import accuracy_score

# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

#stop unnecessary warnings from printing to the screen
import warnings
warnings.simplefilter('ignore')




### Prepare feature matrix and target vector for creating subsets.

In [66]:
# only use to test 'auto_year'
ordf = pd.read_csv('FradulentInsuranceClaims.csv')
# only use to test 'auto_year'
pcaDF['auto_year'] = ordf['auto_year']

#get the categorical data
cat_features = ['policy_state', 'auto_year',
       'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'incident_type', 'collision_type',
       'incident_severity', 'authorities_contacted', 'incident_state',
       'incident_city', 'property_damage', 'police_report_available',
       'auto_make', 'auto_model']

df_cat2 = pcaDF[cat_features]

# One Hot Encoding 
dfDumm2 = pd.get_dummies(df_cat2)

# create a whole features dataset that can be used for train and validation data splitting
# here we will combine the numerical features and the dummie features together
dfNum2 = pcaDF.drop(['policy_state', 'auto_year',
       'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'incident_type', 'collision_type',
       'incident_severity', 'authorities_contacted', 'incident_state',
       'incident_city', 'property_damage', 'police_report_available',
       'auto_make', 'auto_model'], axis = 1)
Xdumm2 = pd.concat([dfNum2, dfDumm2], axis=1)
# create a whole target dataset that can be used for train and validation data splitting
y2 =  pcaDF['fraud_reported']


### Run Baseline Model again with Subset Feature Matrix

In [67]:
# create a subset of the df that can be used for model evalutation
# RF top 5
subset1 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC13',
 'PC11',
 'PC9']]

# separate data into training and validation 
S1Train, S1Test, yTrain_S1, yTest_S1 = train_test_split(subset1, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS1 = LogisticRegression()

# Fit the model with training data
modelLRS1.fit(S1Train, yTrain_S1)

# predict on test set
yhatS1 = modelLRS1.predict(S1Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S1, yhatS1)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.33


In [68]:
# create a subset of the df that can be used for model evalutation
# RF top 2
subset2 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess'
 ]]

# separate data into training and validation 
S2Train, S2Test, yTrain_S2, yTest_S2 = train_test_split(subset2, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS2 = LogisticRegression()

# Fit the model with training data
modelLRS2.fit(S2Train, yTrain_S2)

# predict on test set
yhatS2 = modelLRS2.predict(S2Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S2, yhatS2)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 82.33


In [69]:
# create a subset of the df that can be used for model evalutation
# incident_severity_Major Damage only
subset3 = Xdumm2.loc[:, [
 'incident_severity_Major Damage'
 ]]

# separate data into training and validation 
S3Train, S3Test, yTrain_S3, yTest_S3 = train_test_split(subset3, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS3 = LogisticRegression()

# Fit the model with training data
modelLRS3.fit(S3Train, yTrain_S3)

# predict on test set
yhatS3 = modelLRS3.predict(S3Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S3, yhatS3)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 80.33


In [70]:
# create a subset of the df that can be used for model evalutation
# incident_severity_Major Damage and Witnesses only
subset4 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'PC9'
 ]]

# separate data into training and validation 
S4Train, S4Test, yTrain_S4, yTest_S4 = train_test_split(subset4, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS4 = LogisticRegression()

# Fit the model with training data
modelLRS4.fit(S4Train, yTrain_S4)

# predict on test set
yhatS4 = modelLRS4.predict(S4Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S4, yhatS4)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 80.33


In [71]:
# create a subset of the df that can be used for model evalutation
# incident_severity_Major Damage, auto_year, and Witnesses only
subset5 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'PC9', 'auto_year'
 ]]

# separate data into training and validation 
S5Train, S5Test, yTrain_S5, yTest_S5 = train_test_split(subset5, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS5 = LogisticRegression()

# Fit the model with training data
modelLRS5.fit(S5Train, yTrain_S5)

# predict on test set
yhatS5 = modelLRS5.predict(S5Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S5, yhatS5)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 80.33


In [72]:
# create a subset of the df that can be used for model evalutation
# RF with .02 or greater
subset6 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC13',
 'PC11',
 'PC9', 'PC16', 'incident_severity_Minor Damage', 'PC14', 'PC7', 'PC10', 'insured_hobbies_cross-fit',
    'PC15', 'PC2', 'PC6', 'PC5', 'PC8', 'PC12', 'PC1', 'PC4', 'PC3',  'incident_severity_Total Loss'
 ]]

# separate data into training and validation 
S6Train, S6Test, yTrain_S6, yTest_S6 = train_test_split(subset6, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS6 = LogisticRegression()

# Fit the model with training data
modelLRS6.fit(S6Train, yTrain_S6)

# predict on test set
yhatS6 = modelLRS6.predict(S6Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S6, yhatS6)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.00


In [73]:
# create a subset of the df that can be used for model evalutation
# CART with .01 or greater
subset7 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC13',
 'PC11',
 'PC9', 'PC16', 'PC14', 'PC7', 'PC10', 'insured_hobbies_cross-fit',
    'PC2', 'PC6', 'PC5', 'PC12', 'PC1', 'PC4', 'PC3',  'insured_occupation_tech-support',
     'insured_relationship_not-in-family'
 ]]

# separate data into training and validation 
S7Train, S7Test, yTrain_S7, yTest_S7 = train_test_split(subset7, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS7 = LogisticRegression()

# Fit the model with training data
modelLRS7.fit(S7Train, yTrain_S7)

# predict on test set
yhatS7 = modelLRS7.predict(S7Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S7, yhatS7)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 79.33


In [74]:
# create a subset of the df that can be used for model evalutation
# CART with .02 or greater
subset8 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC13',
 'PC11',
 'PC9', 'PC10', 'insured_hobbies_cross-fit',
    'PC5', 'PC12', 'PC1', 'PC4', 'PC3'
 ]]

# separate data into training and validation 
S8Train, S8Test, yTrain_S8, yTest_S8 = train_test_split(subset8, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS8 = LogisticRegression()

# Fit the model with training data
modelLRS8.fit(S8Train, yTrain_S8)

# predict on test set
yhatS8 = modelLRS8.predict(S8Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S8, yhatS8)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 79.00


In [75]:
# create a subset of the df that can be used for model evalutation
# CART with TOP 6 / .03 or greater
subset9 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC13',
 'PC5',
 'PC9', 'insured_hobbies_cross-fit'
 ]]

# separate data into training and validation 
S9Train, S9Test, yTrain_S9, yTest_S9 = train_test_split(subset9, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS9 = LogisticRegression()

# Fit the model with training data
modelLRS9.fit(S9Train, yTrain_S9)

# predict on test set
yhatS9 = modelLRS9.predict(S9Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S9, yhatS9)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.67


In [76]:
# create a subset of the df that can be used for model evalutation
# CART with TOP 4 / .07 or greater
subset10 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC9', 'insured_hobbies_cross-fit'
 ]]

# separate data into training and validation 
S10Train, S10Test, yTrain_S10, yTest_S10 = train_test_split(subset10, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS10 = LogisticRegression()

# Fit the model with training data
modelLRS10.fit(S10Train, yTrain_S10)

# predict on test set
yhatS10 = modelLRS10.predict(S10Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S10, yhatS10)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 82.33


In [77]:
# create a subset of the df that can be used for model evalutation
# CART with TOP 3 / .08 or greater
subset11 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
 'PC9'
 ]]

# separate data into training and validation 
S11Train, S11Test, yTrain_S11, yTest_S11 = train_test_split(subset11, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS11 = LogisticRegression()

# Fit the model with training data
modelLRS11.fit(S11Train, yTrain_S11)

# predict on test set
yhatS11 = modelLRS11.predict(S11Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S11, yhatS11)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 82.33


In [78]:
# create a subset of the df that can be used for model evalutation
# Coeff top 4
subset12 = Xdumm2.loc[:, [
 'incident_severity_Major Damage',
 'insured_hobbies_chess',
  'insured_relationship_not-in-family', 'insured_hobbies_cross-fit'
 ]]

# separate data into training and validation 
S12Train, S12Test, yTrain_S12, yTest_S12 = train_test_split(subset12, y2, test_size =0.3, random_state=11)

# Instantiate the logistic regression model using default parameters
modelLRS12 = LogisticRegression()

# Fit the model with training data
modelLRS12.fit(S12Train, yTrain_S12)

# predict on test set
yhatS12 = modelLRS12.predict(S12Test)

# evaluate the baseline with accuracy score
accuracy = accuracy_score(yTest_S12, yhatS12)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 82.33
