# QCTO - Workplace Module

### Project Title: Insurance Fraud Investigation
#### Done By: Ndumiso Biyela

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

Insurance plays a vital role in the economy by providing financial security and stability, which enables economic growth and development.Insurance fraud not only affects insurance companies but also has broader economic and social impacts, leading to higher costs for consumers, financial instability, and reduced accessibility to insurance.

Insurance fraud occurs when individuals collaborate to submit false or inflated claims for property damage or personal injuries after an accident. Common tactics include orchestrating accidents intentionally, using ‘phantom passengers’ who were not present at the scene but claim severe injuries, and exaggerating the extent of personal injuries to receive higher compensation

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>


The packages used in this project include:
            -Numpy and Pandas for data manipulation.
            -Matplotlib/Seaborn for visualization.
            -Scikit-learn for modeling


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import matplotlib.pyplot as plt

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

### About the dataset

This dataset was collected from https://github.com/NdumisoBiyela/Public-Data github repository.

The dataset contains the details about various insurance claims and whether they was fraud reported on the claim or not.

### Key Dataset Features:

policy_bind_date: Starting date of the insurance policy.

policy_csl: Combined Single Limits - This is the maximum value of the insurer will pay out per incident.

policy_annual_premium: The total dollar amount for the yearly premium.

umbrella_limit: Extra insurance that provides protection beyond existing limits and coverages of other policies.

auto_make: Vehicle brand.

auto_model: Vehicle model.

insured_education_level: Highest qualification of the insurer.

policy_deductable: Excess payment before a payout or service is conducted.

insured_occupation: The profession in which the insurer works.

Fraud_reported: Y - a fraudelent or false claim, N - a legit and valid claim.


---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>


In [2]:
dataset = pd.read_csv('insurance_claims_raw.csv')

In [3]:
dataset.describe()

Unnamed: 0,months_as_customer,age,policy_number,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,capital-gains,capital-loss,incident_hour_of_the_day,number_of_vehicles_involved,bodily_injuries,witnesses,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_year,_c39
count,1000.0,998.0,1000.0,998.0,997.0,1000.0,1000.0,998.0,1000.0,1000.0,1000.0,1000.0,1000.0,996.0,999.0,999.0,1000.0,1000.0,0.0
mean,203.954,38.962926,546238.648,1134.268537,1257.001113,1101000.0,501214.488,25176.452906,-26793.7,11.644,1.839,0.992,1.487,52767.46988,7434.944945,7398.628629,37928.95,2005.103,
std,115.113174,9.135425,257063.005276,611.251914,244.265051,2297407.0,71701.610941,27877.379027,28104.096686,6.951373,1.01888,0.820127,1.111335,26405.348039,4883.158265,4827.050887,18886.252893,6.015861,
min,0.0,19.0,100804.0,500.0,433.33,-1000000.0,430104.0,0.0,-111100.0,0.0,1.0,0.0,0.0,100.0,0.0,0.0,70.0,1995.0,
25%,115.75,32.0,335980.25,500.0,1090.32,0.0,448404.5,0.0,-51500.0,6.0,1.0,0.0,1.0,41812.5,4290.0,4440.0,30292.5,2000.0,
50%,199.5,38.0,533135.0,1000.0,1257.83,0.0,466445.5,0.0,-23250.0,12.0,1.0,1.0,1.0,57935.0,6780.0,6750.0,42100.0,2005.0,
75%,276.25,44.0,759099.75,2000.0,1415.74,0.0,603251.0,51075.0,0.0,17.0,3.0,2.0,2.0,70620.0,11310.0,10890.0,50822.5,2010.0,
max,479.0,64.0,999435.0,2000.0,2047.59,10000000.0,620962.0,100500.0,0.0,23.0,4.0,2.0,3.0,114920.0,21450.0,23670.0,79560.0,2015.0,


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

This section shows how the dataset was cleaned by handling missing values and removing outliers.

In [4]:
dataset.isna().sum()

months_as_customer                0
age                               2
policy_number                     0
policy_bind_date                  0
policy_state                      0
policy_csl                        0
policy_deductable                 2
policy_annual_premium             3
umbrella_limit                    0
insured_zip                       0
insured_sex                       0
insured_education_level           1
insured_occupation                0
insured_hobbies                   2
insured_relationship              0
capital-gains                     2
capital-loss                      0
incident_date                     0
incident_type                     0
collision_type                    0
incident_severity                 0
authorities_contacted             2
incident_state                    2
incident_city                     0
incident_location                 0
incident_hour_of_the_day          0
number_of_vehicles_involved       0
property_damage             

In [5]:
dataset.drop('_c39', axis=1, inplace=True)

In [6]:
def fill_missing_values(df):
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:
            # Fill NaN with mean for numeric columns
            df[column].fillna(df[column].mean(), inplace=True)
        else:
            # Fill NaN with mode for categorical columns
            df[column].fillna(df[column].mode()[0], inplace=True)
    return df

In [7]:
dataset = fill_missing_values(dataset)


In [8]:
dataset.isna().sum()

months_as_customer             0
age                            0
policy_number                  0
policy_bind_date               0
policy_state                   0
policy_csl                     0
policy_deductable              0
policy_annual_premium          0
umbrella_limit                 0
insured_zip                    0
insured_sex                    0
insured_education_level        0
insured_occupation             0
insured_hobbies                0
insured_relationship           0
capital-gains                  0
capital-loss                   0
incident_date                  0
incident_type                  0
collision_type                 0
incident_severity              0
authorities_contacted          0
incident_state                 0
incident_city                  0
incident_location              0
incident_hour_of_the_day       0
number_of_vehicles_involved    0
property_damage                0
bodily_injuries                0
witnesses                      0
police_rep

In [9]:
duplicates = dataset[dataset.duplicated()]
print(duplicates)

Empty DataFrame
Columns: [months_as_customer, age, policy_number, policy_bind_date, policy_state, policy_csl, policy_deductable, policy_annual_premium, umbrella_limit, insured_zip, insured_sex, insured_education_level, insured_occupation, insured_hobbies, insured_relationship, capital-gains, capital-loss, incident_date, incident_type, collision_type, incident_severity, authorities_contacted, incident_state, incident_city, incident_location, incident_hour_of_the_day, number_of_vehicles_involved, property_damage, bodily_injuries, witnesses, police_report_available, total_claim_amount, injury_claim, property_claim, vehicle_claim, auto_make, auto_model, auto_year, fraud_reported]
Index: []

[0 rows x 39 columns]


In [10]:
def identify_outliers(df):
    outlier_columns = []
    for column in df.select_dtypes(include=['float64', 'int64']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        if df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0] > 0:
            outlier_columns.append(column)
    return outlier_columns
    

In [11]:
outliers = identify_outliers(dataset)

In [12]:
print(outliers)

['age', 'policy_annual_premium', 'umbrella_limit', 'total_claim_amount', 'property_claim']


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [13]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [14]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [15]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [16]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [17]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [18]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
