# Auto Insurance Policy Lapse Risk Prediction

### Author: Henry Udeogu

This machine learning project uses real-world, insurance-aligned dataset from the paper
`"Dataset of an actual motor vehicle insurance portfolio" by Segura-Gisbert et al.` The authors of this paper conducted a research project within a Spanish insurance company and gained access to a sample of their motor vehicle insurance portfolio datasets which they were also authorized to share. This dataset is a collection of **105,555 records**, and the data has been anonymized to protect the policyholders' identities.

This dataset includes indispensable date-related information, including the effective date of policies, the birthdates of insured individuals, and renewal dates. It also enriched with valuable economic variables, notably premiums and claim costs. 

It is important to mention that the availability of open access data concerning insured populations is currently limited. This dataset can be used by insurance companies, researchers and educators and is relevant for marketing purposes; including customer segmentation, contract renewal processes, price renewal strategies, optimization and price sensitivity models, as well as pricing mechanisms for new business.

The **primary goal** of this project is to build a classification model that predicts whether a customer is likely to let their auto insurance policy lapse (i.e., churn), so retention campaigns can be better targeted.

<u>**Impact:**</u>
-  **Improve Policyholder Retention** – Identify customers at risk of lapsing and implement proactive engagement strategies.
- **Optimize Marketing & Outreach** – Personalize communication strategies based on predicted lapse risk scores.
- **Reduce Revenue Losses** – Mitigate potential revenue decline due to policyholder churn.

*<u>Data Source:</u> The dataset was collected from a non-life insurance company operating in Spain; therefore, the data has been meticulously anonymized to align with the prevailing European legislation, safeguarding individual privacy and confidentiality. For ease of use, the authors have cleaned the data and provided a clean data file in spreadsheet format and can be accessed via the DOI link below.*
https://doi.org/10.17632/5cxyb5fp4f.2

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE

import streamlit as st
import joblib

import warnings
warnings.filterwarnings("ignore")

## Load Dataset & Overview of the Data

The dataset is formatted as a spreadsheet covering the main operations of the company during a period of three (3) full years (November 2015 to December 2018), containing several motor insurance portfolio variables. This dataset comprises **105,555 rows** and **32 columns**. Each row signifies a policy transaction, while each column represents a distinct variable.

There are three (3) files in the data/raw folder:
- `Descriptive of the variables.xlsx` : Description of the variables in the dataset
- `Motor vehicle insurance data.csv` : Full motor insurance dataset (105K+ rows, 30 columns)
- `sample type claim.csv` : Partial claim type data (only 15% of policies, 2 additional columns relating to  "claim_type")

All dataset files have been included in the github repo for this project and the description of the variables and the raw variables can be viewed in the spreadsheets.

### Categorizing the Variables

Based on the description of the raw variables, I have categorized them under 5 Features sets.

1. **Customer Demographics** - Describe's the policyholder’s personal background, such as Age, Gender, Income Level etc.
2. **Policy Details** - Describes the structure and lifecycle of the insurance policy, including tenure, renewal dates, and claims history.
3. **Policy Behaviour/Enagagement** - Describes policyholder's relationship history with the insurer. E.g. payment method, products held & lapse records.
4. **Financial Metrics** - Describes variables reflecting the economic value of the policy. This includes premiums paid and claim costs.
5. **Vehicle & Driving History** - Describes the technical and historical data about the insured vehicle and driving characteristics.

| No | Category | Raw Variable(s) | ML Feature |
| :- | :- | -: | :-: |
| 1. | Customer Demographics | `ID`, `Date_birth` | Age of policyholder may impact lapse
| 2. | Policy Details | `Seniority`, `N_claims_year`, `N_claims_history`, `R_Claims_history`, `Date_start_contract`, `Date_last_renewal`, `Date_next_renewal`, `Date_lapse` | Longer tenure may reduce lapse risk, Claims frequency impact on satisfaction
| 3. | Policy Behavior & Engagment | `Distribution_channel`, `Payment`, `Policies_in_force`, `Max_policies`, `Max_products`, `Lapse` | Indicator of loyalty and cross-sell opportunity, Annual vs semi-annual payment method (may affect lapse), Broker vs agent — can impact retention, More bundled products may indicate higher retention
| 4. | Financial Metrics | `Premium`, `Cost_claims_year` | High premiums + no claims = dissatisfaction risk
| 5. | Vehicle & Driving History | `Date_driving_licence`, `Type_risk`, `Power`, `Cylinder_capacity`, `Value_vehicle`, `Year_matriculation`, `N_doors`, `Type_fuel`, `Length`, `Weight`, `Area`, `Second_driver` | High-value or risky vehicles may correlate with lapse, Urban vs Rural may affect churn behavior, Driving Experience may impact lapse


### Claim Type Data




### Final Feature Set



## Exploratory Data Analysis

<u>**Categorizing the Variables**</u>

## Data Preprocessing

## Feature Engineering & Feature Selection

## Model Training & Selection

## Hyperparameter Tuning

## Model Evaluation

## SHAP - Explaining the Model

## Results

## Bonus

## Streamlit Insurance Lapse App