Preamble: <br>
Author: Stephen Brownsey  <br>
Python version: 3.10.5 64-bit  <br>


The problem is to predict which cases will lapse and is broken down into three sections:
1. Data exploration: What are the most interesting features of the data set? What have you considered and why have you made the decisions you have done?
2. Modelling: What process did you follow when modelling retention? How have you designed your model and what did you account for
3. What are your conclusions and what else would’ve been useful to know?


In [16]:
#Library loading section
import pandas as pd
import sklearn as skl
import sweetviz as sv
from tqdm import tqdm


In [18]:
data = pd.read_csv("data/home_insurance.csv").drop(columns = ["i", "Police"], errors = "ignore") #Dropping the two identifier columns i is the index and police is the police number

In [19]:
#Sweetviz is a very good EDA library that shows you information about all the 
my_report = sv.analyze(data)
my_report.show_html("raw_data.html")



  value_counts_without_nan = pd.Series()
Done! Use 'show' commands to display/save.   |██████████| [100%]   00:06 -> (00:00 left)


Report raw_data.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


This sweetviz report quickly tells us a few things about the dataset:
1. There are 67115 cases where there is a missing policy status, since this is our dependent variable, rows which are missing here should be removed. This number of 67115 is also present in a lot of the other variables as such it backs up this thought. There are also 16 Unknown policies, since this is such a low number we can afford to remove them as well.
2. There are a number of variables that are majority missing, more analysis will be undertaken for these but it is expected that most will be dropped before modelling.
3. There are a number of date variables, which should be put through feature engineering before we add them to our model
4. There are some irrelevant columns which only have one option such as PAYMENT_FREQUENCY and other columns which are very heavily skewed so need analysis into whether these should be considered for the model or not
5. There are some numerical columns such as SUM_INSURED_CONTENTS that are more ordinal than continuous so should be encoded as such
6. There are very strong associations between alot of the columns and as such the dataset should go through a rigourous feature selection process before being used to model the data to limit issues with overfitting e.t.c

In [21]:
data.columns
data['POL_STATUS'].unique()
data[pd.isnull(data.POL_STATUS)]
#data[data.POL_STATUS == "Unknown"]

Unnamed: 0,QUOTE_DATE,COVER_START,CLAIM3YEARS,P1_EMP_STATUS,P1_PT_EMP_STATUS,BUS_USE,CLERICAL,AD_BUILDINGS,RISK_RATED_AREA_B,SUM_INSURED_BUILDINGS,...,HP2_ADDON_PRE_REN,HP2_ADDON_POST_REN,HP3_ADDON_PRE_REN,HP3_ADDON_POST_REN,MTA_FLAG,MTA_FAP,MTA_APRP,MTA_DATE,LAST_ANN_PREM_GROSS,POL_STATUS
18,11/15/2007,,N,,,,,,,,...,,,,,Y,216.95,216.95,02/02/2009,216.95,
28,1/17/2008,,N,,,,,,,,...,,,,,Y,247.26,247.26,18/02/2008,247.26,
81,11/27/2007,,,,,,,,,,...,,,,,,,,,,
82,12/11/2007,,,,,,,,,,...,,,,,,,,,,
83,12/13/2007,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255992,2/22/2011,,,,,,,,,,...,,,,,,,,,,
255993,3/3/2011,,,,,,,,,,...,,,,,,,,,,
255994,3/3/2011,,,,,,,,,,...,,,,,,,,,,
255995,8/17/2011,,,,,,,,,,...,,,,,,,,,,
