# Assignment 2: Fraud Detection

Author: Josh NM Blackmore <br>
StID: 201776628

## Case Study

<p>An insurance company plans to utilise their historic insurance fraud dataset to predict the likelihood or the level of risk a customer poses. You can find the dataset above. Referring genuine claims cause customer stress and directly leads to customer loss, costing the company money (assume that any referred non-fraud case will lead to losing that customer). While obviously, fraud claims cost the company as well. Their main requirement is to use an unbiased predictive model capable of flagging and referring potential fraud cases for further investigation with a balanced error rate of 5%.</p>

## 1 Aims, Objectives & Plan *(Revisit Section)*
### 1.1 Aims & Objectives
This projects primary objective is to analyse a medium sized dataset containing insurance claims from an insurance company with the goal of identifying potential fraudulant claims. The project requires a minumum of two techniques to provide predictions or valuable insights.

Provided datasets will be pre-processed with various techniques and the reasoning behind certain decisions. 

Secondly, a technical report which consists of a narration around the analysis conducted for the project. This will be a jupyter notebook document containing the key procedures taken for each step such as justifying choices made in pre-processing, the models solutions, various visualisations of the analysis and testing with performance metrics such as F1, recall, confusion matrix etc.

### 1.2 Plan
- Gantt Chart

## 2 Understanding the Case Study *(Revisit Section)*
### 2.1 Case Study Analysis
In this section I will present my understanding of the case study with the 4 critical points found in the case description.
#### 2.1.1 Predicting a Customers Level of Risk
Each row of the datasets provide information for an individual customers claim with labels depicting if the claim was fraudulant or legitimate. The client needs to know what level of risk future clients pose for future cases, using historical data.

#### 2.1.2 Cost of Fraudulant Cases for Customers and Company
Cases provided contain various cost factors such as property claim, injury claims etc. Part of the case study specifies the need to highlight the cost risk for fraudulant cases company and customer alike, the case study specifies the risk of losing customers due to cost. 

#### 2.1.3 Unbiased Prediction Model to Flag Fraudulant Cases
The case study specifies the need of unbiased predictions. This can be acheived in pre-processing ensuring the training/test data is well balanced with fair distributions. Techniques such as regularization and post-processing techniques such as equalizing false positives and false negatives.

#### 2.1.4 Accuracy and Error Rate of the Models
Accuracy measures are needed to ensure the model is making correct predictions/accurate insights, such as correctly classified instances. The error rate will also be calculated to complement accuracy. The case study specifies the client is expecting a balanced error rate of around 5%.


## 3 Pre-Processing

#### Key Observations
- No duplicate customerIDs.
- 13 categorical values, 6 numerical. However there are some cases where missing values in numerical values have "?" or "MISSINGVALUE" which is causing the feature to register as non numerical.
- Data window is from 2015-01-01 to 2015-03-14.
- Binary class labels N=No, Y=Yes.
- Witnesses contains 46 values with 'MISSINGVALUE' removing these rows makes the most sense as it is an insignificant portion of the data.
- 'TypeOfCollission' contains 5162 values of '?'
- 'PropertyDamage' contains 10459 values of '?'
- 'PoliceReport' contains 9805 values of '?'
- 'IncedentTime' has some negative values which does not make sense, these rows will be dropped.

#### Importing Initial Permitted Libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

#### Loading the Datasets into Pandas Dataframes for Analysis

In [5]:
claims_csv = "./Data/archive/TrainData/TrainData/Train_Claim.csv"
customers_withoutTarget_csv = "./Data/archive/TrainData/TrainData/Traindata_withoutTarget.csv"
customers_withTarget_csv = "./Data/archive/TrainData/TrainData/Traindata_with_Target.csv"
demographics_csv = "./Data/archive/TrainData/TrainData/Train_Demographics.csv"
policy_csv = "./Data/archive/TrainData/TrainData/Train_Policy.csv"
vehicle_csv = "./Data/archive/TrainData/TrainData/Train_Vehicle.csv"


claims_df = pd.read_csv(claims_csv)
customers_no_target_df = pd.read_csv(customers_withoutTarget_csv)
customers_target_df = pd.read_csv(customers_withTarget_csv)
demographics_df = pd.read_csv(demographics_csv)
policy_df = pd.read_csv(policy_csv)
vehicle_df = pd.read_csv(vehicle_csv)

### 3.1 Preparing the labels

In [7]:
claims_with_labels = pd.merge(claims_df, customers_target_df, on='CustomerID')

### 3.2 Removing Synonymous and Noisy Atrributes

### 3.3 Dealing with Missing Values

### 3.4 Dealing with duplicate values

### 3.5 Rescaling

### 3.6 Dealing with Class Imbalance

### 3.7 Dealing with Collinearity