# QCTO - Workplace Module

### Investigating the insurance industry in Africa
#### Done By: Mieke Spaans

© ExploreAI 2024

---
 <a id="cont"></a>
## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

Add introduction here

---
<a id="one"></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** Listed below is all of the Python packages that will be used throughout the project, such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc. I have seperated the code blocks to libraries used for specific parts of the project for ease of reading and debugging.
---

In [None]:
# MLFlow is used to perform model hyperparameter tuning
!pip install mlflow

In [None]:
# loading .csv files via google drive:
from google.colab import drive

In [4]:
# Libraries used for data loading and pre-processing:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

In [None]:
# Libraries used for data visualisation during exploratory data analysis:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import re
from scipy import stats

In [None]:
# Model training libraries:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Model evaluation libraries:
from sklearn.metrics import mean_squared_error

---
<a id="two"></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

The dataset I am using for this project has been provided by ExploreAI. 

Link to the GitHub repository that contains the .csv file: https://github.com/Explore-AI/Public-Data/blob/master/insurance_claims.csv

We will go into greater detail during the "Loading Data" section below, but here are some of the notable columns:
- fraud_reported : 
    - Y for fraudulent claim
    - N for a valid claim
- policy_bind_date: 
    - Starting date of the insurance policy. 
    - This is useful, because we can analyse the time each policy was started.

---
<a id="three"></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [5]:
# Loading via pandas when using Jupyter Notebooks
insurance = pd.read_csv('insurance_claims.csv')

In [6]:
# Check that the data was imported correctly
insurance.head()

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported,_c39
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,YES,71610,6510,13020,52080,Saab,92x,2004,Y,
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,?,5070,780,780,3510,Mercedes,E400,2007,Y,
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,NO,34650,7700,3850,23100,Dodge,RAM,2007,N,
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,NO,63400,6340,6340,50720,Chevrolet,Tahoe,2014,Y,
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,NO,6500,1300,650,4550,Accura,RSX,2009,N,


From the above code, we see that the dataframe imports correctly. 

Some issues noted:
- Column `_c39` has strange name, and all null values
- `police_report_available` has YES, NO and ? values.
- `policy_state` has values like 'OH', 'IN' and 'IL'. These seem to correspond to American states, specifically Ohio, Indiana and Illinois. This is strange, as we are tasked to investigate the insurance industry in Africa.
- Confusion regarding the `month_as_customer` and `policy_bind_date`:
    - row 0 has value 328 when the policy started in 2014
    - row 1 has value 228 when the policy started in 2006.
    - by common reasoning, one would assume the `policy_bind_date` and `month_as_customer` are directly correlated, but the data seems to show otherwise.

In [7]:
# Check size of dataframe
insurance.shape

(1000, 40)

The dataframe contains 1000 entries, and has 40 columns.

In [9]:
# Get each column's data type, and count non-null values
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   months_as_customer           1000 non-null   int64  
 1   age                          1000 non-null   int64  
 2   policy_number                1000 non-null   int64  
 3   policy_bind_date             1000 non-null   object 
 4   policy_state                 1000 non-null   object 
 5   policy_csl                   1000 non-null   object 
 6   policy_deductable            1000 non-null   int64  
 7   policy_annual_premium        1000 non-null   float64
 8   umbrella_limit               1000 non-null   int64  
 9   insured_zip                  1000 non-null   int64  
 10  insured_sex                  1000 non-null   object 
 11  insured_education_level      1000 non-null   object 
 12  insured_occupation           1000 non-null   object 
 13  insured_hobbies    

From this we find out what each column's data type is, as well as the total number of non-null values in each.

Analysis:
- The three data types present are:
    - Integer (17 columns)
    - Float (2 columns)
    - Object (typically string values) (21 columns)
- Only two columns contains null values:
    - `authorities_contacted`: Could indicate unceartainty whether authorities were contacted or not. Replacing these values will require finding out if this column is a boolean Y/N, or on a spectrum of Yes/No/Unsure
    - `_c39`: Seems to be an extra column. Will remove, as all values are null.

---
<a id="four"></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="five"></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="six"></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="seven"></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="eight"></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="nine"></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="ten"></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
