# **Introduction**: The goal of this notebook is to create a machine learning model to accurately predict cases of car insurance fraud and to understand what characterisitcs in a claim are most indicative of potential fraud

Data: https://www.kaggle.com/roshansharma/insurance-claim

GitHub: https://github.com/ArielJosephCohen/capstone

Presentation: https://docs.google.com/presentation/d/1IQdYSxrzyGvMpurhM-i097Btp4ksqL70WLEiM6yc5Sw/edit#slide=id.g35f391192_00

# **Notebook**

## This will save some stress

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

## Load helper module with custom functions

In [2]:
import helper_module as hm
from helper_module import *

## Load central data for analysis

In [3]:
df = pd.read_csv('Claims.csv')

## Assign uniform randomness for entire project

In [4]:
seed = 14

## Clean data

In [5]:
# address '?' values in data
df=hm.clean_data(df)

In [6]:
# create separate columns for policy bind year and month
df = hm.reassign_year_and_month(df)

In [7]:
# assign a car type to auto models
auto_model_dict = hm.create_auto_dict()

In [8]:
# map car type
df.auto_model=df.auto_model.map(lambda x: auto_model_dict[x])

In [9]:
# create a timeline between policy bind data and claim
df=hm.create_timeline(df)

In [10]:
# now that I have the timeline and month-year information, I can drop some more columns
df.drop(['incident_date','policy_bind_date'],axis=1,inplace=True)

In [11]:
# show capital loss as a positive value
df=hm.quantify_absolute_value(df,'capital-loss')

In [12]:
# assign numerical binary to insured sex
df=hm.map_binary_dict(df,'insured_sex','MALE','FEMALE')

In [13]:
# assign numerical binary to fraud reported (target feature)
df=hm.map_binary_dict(df,'fraud_reported','Y','N')

## Address categorical and numerical features

In [14]:
# create separate lists of numerical and categorical features
num_list = hm.create_num_list()
cat_list = hm.create_cat_list()
cat_list_2 = hm.create_cat_list_2()

In [15]:
# remove correlated features and update numerical feature list
df,num_list=hm.remove_correlation(df,num_list)

In [16]:
# create categorical and numerical data frames
df_num=df[num_list]
df_cat=df[cat_list]

## Encode categorical data as numerical values

In [17]:
# use correlation with target variable to encode categorical features
for col in cat_list_2:
    df_cat = create_encoding(col,df_cat,df)

## Combine data frames and revisit correlation

In [18]:
# merge categorical and numerical data frames into one
df_atg = hm.combine_data_frames(df_cat,df_num)

In [19]:
# remove correlation from encoded categorical features
df_atg = hm.remove_categorical_correlation(df_atg,'incident_type')

## RFE

In [20]:
x_and_y=hm.reduce_features(df_atg,seed)

## Filter, normalize, and scale

In [21]:
x_and_y = hm.filter_outliers(x_and_y)

In [22]:
X = x_and_y.drop('fraud_reported',axis=1)
y = x_and_y.fraud_reported

In [23]:
X = hm.normalize_features(X)

In [24]:
X = hm.scale_data(X)