# Auto Insurance Claim Fraud Indicators and Classification

# Mary Donovan Martello

## The goal of this project was to identify significant features in fraudulent insurance claim transactions and to design predictive classification models to predict whether fraud was reported on the insurance claim transaction. This notebook includes feature selection.

# Part 3:  Feature Selection

In [1]:
# import data set and libraries for data preparation phase
import pandas as pd
import numpy as np
# importing regex module (search strings) RegEx can be used to check if a string contains the specified search pattern
import re


### The original dataset includes 1,000 prior claim transaction records. Each record has a mix of 38 quantitative and categorical data features about the claim filed, including information on the policy, insured, and automobile, aspects of the damage incident, and elements of the claim filed. The dataset also has a feature that indicates whether fraud was reported on each observation (i.e., either Y or N).

In [2]:
dfClaims = pd.read_csv('FradulentInsuranceClaims.csv')

In [3]:
dfClaims.head(2)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported,_c39
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,YES,71610,6510,13020,52080,Saab,92x,2004,Y,
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,?,5070,780,780,3510,Mercedes,E400,2007,Y,


### The dataset in this notebook includes the original data that was cleaned in the 1_EDA_Prep notebook.

In [23]:
# import cleaned df without re-running the cleaning code
df = pd.read_csv('dfClaims.csv')

> # Feature Selection

> Prepare features to use in models.

In [25]:
# convert categorical data to numbers

#get the categorical data
cat_features = ['policy_state',
       'insured_sex', 'insured_education_level', 'insured_occupation',
       'insured_hobbies', 'insured_relationship', 'incident_type',
       'collision_type', 'incident_severity', 'authorities_contacted',
       'incident_state', 'incident_city', 'property_damage',
       'police_report_available', 'auto_make', 'auto_model']
df_cat = df[cat_features]

# One Hot Encoding 
dfDumm = pd.get_dummies(df_cat)

In [30]:
# combine the numerical features and the dummie features together
dfNum = df.drop(['policy_state', 'incident_location',
       'insured_sex', 'insured_education_level', 'insured_occupation',
       'insured_hobbies', 'insured_relationship', 'incident_type',
       'collision_type', 'incident_severity', 'authorities_contacted',
       'incident_state', 'incident_city', 'property_damage',
       'police_report_available', 'auto_make', 'auto_model', 'fraud_reported'], axis = 1)

Xdumm = pd.concat([dfNum, dfDumm], axis=1)

# create a whole target dataset that can be used for train and validation data splitting
y =  df['fraud_reported']
# for Random Forest when target is not label encoded
yRF = df['fraud_reported']

### Use CART classification and Random Forest Classifier to determine important features.

In [33]:
# https://machinelearningmastery.com/calculate-feature-importance-with-python/
# use CART classification feature importance
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

# define the model
model = DecisionTreeClassifier()
# fit the model
model.fit(Xdumm, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

Feature: 0, Score: 0.02258
Feature: 1, Score: 0.02940
Feature: 2, Score: 0.06283
Feature: 3, Score: 0.00439
Feature: 4, Score: 0.03036
Feature: 5, Score: 0.00230
Feature: 6, Score: 0.04263
Feature: 7, Score: 0.00474
Feature: 8, Score: 0.00657
Feature: 9, Score: 0.01123
Feature: 10, Score: 0.00000
Feature: 11, Score: 0.00000
Feature: 12, Score: 0.01869
Feature: 13, Score: 0.01693
Feature: 14, Score: 0.00931
Feature: 15, Score: 0.01955
Feature: 16, Score: 0.00499
Feature: 17, Score: 0.02381
Feature: 18, Score: 0.00000
Feature: 19, Score: 0.00995
Feature: 20, Score: 0.01840
Feature: 21, Score: 0.00000
Feature: 22, Score: 0.00000
Feature: 23, Score: 0.00552
Feature: 24, Score: 0.00448
Feature: 25, Score: 0.00659
Feature: 26, Score: 0.00430
Feature: 27, Score: 0.00000
Feature: 28, Score: 0.00000
Feature: 29, Score: 0.00000
Feature: 30, Score: 0.00122
Feature: 31, Score: 0.00489
Feature: 32, Score: 0.00000
Feature: 33, Score: 0.01014
Feature: 34, Score: 0.00000
Feature: 35, Score: 0.00000
Fe

NameError: name 'plt' is not defined

In [34]:
# use random forest feature importance
# This approach can also be used with the bagging and extra trees algorithms.
# https://machinelearningmastery.com/calculate-feature-importance-with-python/
from sklearn.ensemble import RandomForestClassifier

# define the model
model = RandomForestClassifier()
# fit the model
model.fit(Xdumm, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

Feature: 0, Score: 0.02714
Feature: 1, Score: 0.02313
Feature: 2, Score: 0.02862
Feature: 3, Score: 0.00710
Feature: 4, Score: 0.02615
Feature: 5, Score: 0.00973
Feature: 6, Score: 0.03022
Feature: 7, Score: 0.01592
Feature: 8, Score: 0.01637
Feature: 9, Score: 0.02130
Feature: 10, Score: 0.00695
Feature: 11, Score: 0.00770
Feature: 12, Score: 0.01055
Feature: 13, Score: 0.02377
Feature: 14, Score: 0.02801
Feature: 15, Score: 0.03070
Feature: 16, Score: 0.03349
Feature: 17, Score: 0.02147
Feature: 18, Score: 0.00677
Feature: 19, Score: 0.00623
Feature: 20, Score: 0.02415
Feature: 21, Score: 0.00389
Feature: 22, Score: 0.00374
Feature: 23, Score: 0.00404
Feature: 24, Score: 0.00468
Feature: 25, Score: 0.00409
Feature: 26, Score: 0.00367
Feature: 27, Score: 0.00268
Feature: 28, Score: 0.00428
Feature: 29, Score: 0.00333
Feature: 30, Score: 0.00298
Feature: 31, Score: 0.00385
Feature: 32, Score: 0.00308
Feature: 33, Score: 0.00127
Feature: 34, Score: 0.00238
Feature: 35, Score: 0.00219
Fe

NameError: name 'plt' is not defined

In [36]:
#https://stackoverflow.com/questions/55466081/how-to-calculate-feature-importance-in-each-models-of-cross-validation-in-sklear

from sklearn.model_selection import cross_validate
from sklearn.ensemble import  RandomForestClassifier


clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, Xdumm, y, cv=2, scoring = 'accuracy', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = Xdumm.columns,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_importances.head(20))


Features sorted by their score for estimator 0:
                                importance
incident_severity_Major Damage    0.156566
insured_hobbies_chess             0.049119
months_as_customer                0.035151
incident_severity_Total Loss      0.033092
property_claim                    0.032555
incident_severity_Minor Damage    0.030792
injury_claim                      0.029776
insured_zip                       0.029532
vehicle_claim                     0.028932
months_bf_incident                0.027750
total_claim_amount                0.024879
age                               0.023256
insured_hobbies_cross-fit         0.020396
incident_hour_of_the_day          0.019590
policy_number                     0.018939
policy_annual_premium             0.018659
policy_deductable                 0.017052
witnesses                         0.015292
auto_year                         0.014910
capital_loss                      0.014299
Features sorted by their score for estimator 1:
 