# Auto Insurance Policy Lapse Risk Prediction

### Author: Henry Udeogu

This machine learning project uses real-world, insurance-aligned dataset from the paper
`"Dataset of an actual motor vehicle insurance portfolio" by Segura-Gisbert et al.` The authors of this paper conducted a research project within a Spanish insurance company and gained access to a sample of their motor vehicle insurance portfolio datasets which they were also authorized to share. This dataset is a collection of **105,555 records**, and the data has been anonymized to protect the policyholders' identities.

This dataset includes indispensable date-related information, including the effective date of policies, the birthdates of insured individuals, and renewal dates. It also enriched with valuable economic variables, notably premiums and claim costs. 

It is important to mention that the availability of open access data concerning insured populations is currently limited. This dataset can be used by insurance companies, researchers and educators and is relevant for marketing purposes; including customer segmentation, contract renewal processes, price renewal strategies, optimization and price sensitivity models, as well as pricing mechanisms for new business.

The **primary goal** of this project is to build a classification model that predicts whether a customer is likely to let their auto insurance policy lapse (i.e., churn), so retention campaigns can be better targeted.

Impact:
- Help reduce Customer Churn
- Optimize Marketing Outreach

*Data Source: The dataset was collected from a non-life insurance company operating in Spain; therefore, the data has been meticulously anonymized to align with the prevailing European legislation, safeguarding individual privacy and confidentiality. For ease of use, the authors have cleaned the data and provided a clean data file in spreadsheet format and can be accessed via the DOI link below.*
https://doi.org/10.17632/5cxyb5fp4f.2

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE

import streamlit as st
import joblib

import warnings
warnings.filterwarnings("ignore")

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


## Load Dataset & Overview of the Data


Key Features:
- `id` : Unique identifier for each transaction
- `V1-V28` : Anonymized features representing various transaction attributes (e.g., time, location, etc.)
- `Amount` : The transaction amount
- `Class` : Binary label indicating whether the transaction is fraudulent (1) or not (0)

## Exploratory Data Analysis

## Data Preprocessing

## Feature Engineering & Feature Selection

## Model Training & Selection

## Hyperparameter Tuning

## Model Evaluation

## SHAP - Explaining the Model

## Results

## Bonus

## Streamlit Insurance Lapse App