# Introduction


## Problem statement


Singlife has observed a concerning trend in the customer journey: potential policyholders are expressing hesitation and eventual disengagement during the insurance acquisition process. To address this, Singlife seeks to leverage its dataset. The objective is to <font size="4">**derive actionable insights from this data to enhance the customer experience**</font>. The challenge is to dissect the dataset to <font size="4">**uncover the critical touchpoints that contribute to customer drop-off and identify opportunities to streamline the application process and personalize communication**</font>. The ultimate goal is to <font size="4">**predict customer satisfaction and conversion rates, thereby bolstering Singlife's market position**</font>.


## Selected variables


<strong><h5>1. General Client Information</h5></strong>

1. `clntnum`
2. `ctrycode_desc`
3. `stat_flag`
4. `min_occ_date`
5. `cltdob_fix`
6. `cltsex_fix`
7. `cltage` (Age of client)
8. `clt_ten` (Customer tenure)

<strong><h5>2. Client Risk and Status Indicators</h5></strong>

1. `flg_substandard`
2. `flg_is_borderline_standard`
3. `flg_is_revised_term`
4. `flg_has_health_claim`
5. `flg_gi_claim`
6. `flg_is_proposal`

<strong><h5>3. Demographic and Household Information</h5></strong>

1. `is_dependent_in_at_least_1_policy`
2. `annual_income_est`

<strong><h5>4. Policy and claim history</h5></strong>

1. `tot_inforce_pols`, `tot_cancel_pols`
2. `f_ever_declined_la`

<strong><h5>5. Target Column</h5></strong>

1. `f_purchase_lh` (Indicates if customer will purchase insurance in the next 3 months)


# Code


## Data cleaning


In [18]:
# %pip install pyarrow
# %pip install scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import VarianceThreshold
import pyarrow
import os
from datetime import datetime
currWD = os.getcwd()
print("Current Working Directory:", currWD)
# os.chdir("Set WD here")
os.chdir("/Users/swislar/Desktop")
filepath = "./data/catB_train.parquet"
data = pd.read_parquet(filepath, engine='pyarrow')

Current Working Directory: /Users/swislar/Desktop


In [19]:
# Adding cltage and clt_ten cols
# Drop rows with incomplete data for `cltdob_fix`
data['cltdob_fix'] = data['cltdob_fix'].replace("None", pd.NaT)
data = data.dropna(subset=['cltdob_fix'])
# Drop rows with incomplete data for `min_occ_date`
data['min_occ_date'] = data['min_occ_date'].replace("None", pd.NaT)
data = data.dropna(subset=['min_occ_date'])

# Convert cltdob_fix to datetime format and compute the client's age
currentDate = datetime.now()
data["cltdob_fix"] = data["cltdob_fix"].map(
    lambda x: datetime.strptime(x, "%Y-%m-%d"))
data["cltage"] = data["cltdob_fix"].map(
    lambda x: ((currentDate - x).days/365.25))
# Convert min_occ_date to datetime format and compute cltage_start
data["min_occ_date"] = data["min_occ_date"].map(
    lambda x: datetime.strptime(x, "%Y-%m-%d"))
data["cltage_start"] = (
    (data["min_occ_date"] - data["cltdob_fix"]).dt.days/365.25).round().astype(int)

# Computing the customer tenure
data["clt_ten"] = data["cltage"] - data["cltage_start"]
# Filtering columns
columnNames = ["clntnum", "ctrycode_desc", "stat_flag", "min_occ_date", "cltdob_fix", "cltsex_fix", "cltage", "clt_ten",
               "flg_substandard", "flg_is_borderline_standard", "flg_is_revised_term", "flg_has_health_claim", "flg_gi_claim", "flg_is_proposal",
               "is_dependent_in_at_least_1_policy", "annual_income_est", "tot_inforce_pols", "tot_cancel_pols", "f_ever_declined_la",
               "f_purchase_lh"]
data = data.loc[:, columnNames]