Preamble: <br>
Author: Stephen Brownsey  <br>
Python version: 3.10.5 64-bit  <br>


The problem is to predict which cases will lapse and is broken down into three sections:
1. Data exploration: What are the most interesting features of the data set? What have you considered and why have you made the decisions you have done?
2. Modelling: What process did you follow when modelling retention? How have you designed your model and what did you account for
3. What are your conclusions and what else would’ve been useful to know?


In [120]:
#Library loading section
import pandas as pd
import sklearn as skl
import sweetviz as sv
from tqdm import tqdm
from utils import get_missing_column_values
from datetime import datetime, date
from dateutil.relativedelta import relativedelta


In [101]:
data = pd.read_csv("data/home_insurance.csv").drop(columns = ["i", "Police"], errors = "ignore") #Dropping the two identifier columns i is the index and police is the police number

In [5]:
#Sweetviz is a very good EDA library that shows you information about all the columns in the dataframe, takes a few minutes to run, so just open raw_data.html from repo to see this output
#my_report = sv.analyze(data)
#my_report.show_html("raw_data.html")



This sweetviz report quickly tells us a few things about the dataset:
1. There are 67115 cases where there is a missing policy status, since this is our dependent variable, rows which are missing here should be removed. This number of 67115 is also present in a lot of the other variables as such it backs up this thought. There are also 16 Unknown policies, since this is such a low number we can afford to remove them as well
2. There are some irrelevant columns which only have one option such as PAYMENT_FREQUENCY and CAMPAIGN_DESC which is all missing.
3. There are a number of variables that are majority missing, more analysis will be undertaken for these but it is expected that most will be dropped before modelling.
4. There are a number of date variables, which should be put through feature engineering before we add them to our model
5. Some columns are very heavily skewed so need analysis into whether these should be considered for the model or not
7. There are some numerical columns such as SUM_INSURED_CONTENTS/SUM_INSURED_BUILDING that are more ordinal than continuous so should be encoded as such
8. There are very strong associations between a lot of the columns, particularly around pre renewal and post renewal columns which highlights perhaps they should be combined. As well as sum assured and premium columns which are very strongly linked. The dataset should go through a rigourous feature selection process before being used for modelling to iron out as much of these correlations as possible
9. There are some outliers which will need looking at in more detail

In [121]:
# Looking into point 1:
# Quick look into the Null Policy status rows to see if it contains anything useful
data[pd.isnull(data.POL_STATUS)].describe()
my_report = sv.analyze(data)
my_report.show_html("full_dataset.html")

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:06 -> (00:00 left)


Report full_dataset.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [102]:
#Now update the data view to remove Unknown and Null policies
data = data[ (~pd.isnull(data.POL_STATUS)) & (data.POL_STATUS != "Unknown")]


#### Point 2/Point 3: <br>
From point 2 it can be seen that PAYMENT_FREQUENCY AND CAMPAIGN_DESC should be dropped from the analysis as they are irrelevant.
Looking into Point 3, to start with will drop P1_PT_EMP_STATUS and CLERICAL as very high percentage missing and 


In [106]:
# Looking at point 2:
data.drop(["PAYMENT_FREQUENCY", "CAMPAIGN_DESC"],axis = 1, inplace=True)
missing_info = get_missing_column_values(df = data)
missing_info





Unnamed: 0,column,missing_count,missing_percentage
5,MTA_DATE,162578,86.02
3,MTA_FAP,133630,70.7
4,MTA_APRP,133630,70.7
0,QUOTE_DATE,109868,58.13
1,RISK_RATED_AREA_B,48140,25.47
2,RISK_RATED_AREA_C,8731,4.62


In [104]:
def index_insight(column, data = data):
    nulled = data[(pd.isnull(data[column]))]["POL_STATUS"]
    contained = data[ (~pd.isnull(data[column]))]["POL_STATUS"] 
    total_null = len(nulled)
    total_contained = len(contained)
    nulled = pd.DataFrame(nulled.value_counts() ).rename(columns= {"POL_STATUS": "nulled"})
    contained = pd.DataFrame(contained.value_counts()).rename(columns= {"POL_STATUS": "contained"})
    df = pd.concat([nulled, contained], axis = 1).reset_index().rename(columns = {"index":"POL_STATUS"})
    df["index"] = round( (df.contained/total_contained)/(df.nulled/total_null), 3)
    df["column"] = column
    return df[["column", "POL_STATUS", "contained", "nulled", "index"]]

indexes = index_insight('P1_PT_EMP_STATUS')
iterate_cols = missing_info.column.unique().tolist()
iterate_cols.remove('P1_PT_EMP_STATUS')

for column in iterate_cols:
    temp = index_insight(column)
    indexes = pd.concat([indexes, temp])
indexes
    
data.drop(["P1_PT_EMP_STATUS", "CLERICAL"], axis = 1, inplace=True)


Point 4:
There are 4 columns which are dates - these are encoded as object columns in the df, so convert them to datetime and perform feature engineering on them.
Columns are: QUOTE_DATE, COVER_START, P1_DOB, MTA_DATE
Feature engineering of date columns:
For dates evolving around start dates and quote times feature engineering will involve getting day of week and similiar information
For date of birth we'll turn it into age at Cover_start since quote date has lots of NULLS

In [116]:
data[["QUOTE_DATE", "COVER_START", "P1_DOB", "MTA_DATE"]] = data[["QUOTE_DATE", "COVER_START", "P1_DOB", "MTA_DATE"]].apply(lambda _: pd.to_datetime(_,format='%Y-%m-%d %H:%M:%S.%f', errors='coerce'))



In [126]:
#data["age_at_start"] = relativedelta(data.COVER_START, data.P1_DOB)
#data.drop(["P1_DOB"], axis = 1, inplace=True)
#data['year'] = data['COVER_START'].year - data['P1_DOB'].year

data


Unnamed: 0,QUOTE_DATE,COVER_START,CLAIM3YEARS,P1_EMP_STATUS,BUS_USE,AD_BUILDINGS,RISK_RATED_AREA_B,SUM_INSURED_BUILDINGS,NCD_GRANTED_YEARS_B,AD_CONTENTS,...,HP2_ADDON_PRE_REN,HP2_ADDON_POST_REN,HP3_ADDON_PRE_REN,HP3_ADDON_POST_REN,MTA_FLAG,MTA_FAP,MTA_APRP,MTA_DATE,LAST_ANN_PREM_GROSS,POL_STATUS
0,2007-11-22,2007-11-22,N,R,N,Y,19.0,1000000.0,7.0,Y,...,N,N,N,N,N,,,NaT,274.81,Lapsed
1,2007-11-22,2008-01-01,N,E,Y,Y,25.0,1000000.0,6.0,Y,...,N,N,N,N,Y,308.83,-9.27,NaT,308.83,Live
2,2007-11-23,2007-11-23,N,E,N,N,,0.0,0.0,Y,...,N,N,N,N,Y,52.65,52.65,2010-03-11,52.65,Live
3,2007-11-23,2007-12-12,N,R,N,N,,0.0,0.0,Y,...,N,N,N,N,N,,,NaT,54.23,Live
4,2007-11-22,2007-12-15,N,R,N,Y,5.0,1000000.0,7.0,Y,...,N,N,N,N,N,,,NaT,244.58,Live
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256131,NaT,2005-02-22,Y,R,N,Y,16.0,1000000.0,2.0,Y,...,N,N,N,N,N,,,NaT,235.08,Lapsed
256132,NaT,2000-01-12,N,R,N,Y,0.0,1000000.0,5.0,Y,...,N,N,N,N,Y,194.02,194.02,2010-01-12,194.02,Live
256133,NaT,2006-01-18,N,R,N,Y,1.0,1000000.0,5.0,Y,...,N,N,N,N,N,,,NaT,287.30,Live
256134,NaT,2004-12-31,N,R,N,Y,32.0,1000000.0,5.0,N,...,N,N,N,N,N,,,NaT,457.57,Lapsed


In [136]:
#Calculating user age
data['Age'] = (data['COVER_START'] - data['P1_DOB']).astype('timedelta64[Y]').astype('int')
#Switching Quote and Cover Start days into Day of week and month of year
def date_feature_engineerer(data = data, column = "COVER_START", prefix = "cover_start"):
    column1 = prefix + "_day_of_week"
    column2 = prefix + "_month"
    data[column1] = data[column].dt.day_name()
    data[column2] = data[column].dt.month_name()
    data.drop(column, axis = 1, inplace=True)

date_feature_engineerer()
date_feature_engineerer(column = "QUOTE_DATE", prefix = "quote_date")
date_feature_engineerer(column = "MTA_DATE", prefix = "mta")

