# Vehicle Policy Lapse Prediction

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [2]:
from pathlib import Path
base_dir = Path.cwd().parent
data_path = base_dir / "data" / "raw"/ "eudirectlapse.csv"

eudirectlapse_data= pd.read_csv(data_path)

In [3]:
eudirectlapse_data.head()

Unnamed: 0,lapse,polholder_age,polholder_BMCevol,polholder_diffdriver,polholder_gender,polholder_job,policy_age,policy_caruse,policy_nbcontract,prem_final,prem_freqperyear,prem_last,prem_market,prem_pure,vehicl_age,vehicl_agepurchase,vehicl_garage,vehicl_powerkw,vehicl_region
0,0,38,stable,only partner,Male,normal,1,private or freelance work,1,232.46,4 per year,232.47,221.56,243.59,9,8,private garage,225 kW,Reg7
1,1,35,stable,same,Male,normal,1,private or freelance work,1,208.53,4 per year,208.54,247.56,208.54,15,7,private garage,100 kW,Reg4
2,1,29,stable,same,Male,normal,0,private or freelance work,1,277.34,1 per year,277.35,293.32,277.35,14,6,underground garage,100 kW,Reg7
3,0,33,down,same,Female,medical,2,private or freelance work,1,239.51,4 per year,244.4,310.91,219.95,17,10,street,75 kW,Reg5
4,0,50,stable,same,Male,normal,8,unknown,1,554.54,4 per year,554.55,365.46,519.5,16,8,street,75 kW,Reg14


## Feature Engineering
### Numerical Feature 


In [4]:
data = eudirectlapse_data.copy()

In [5]:
data['policy_age_log'] = np.log1p(data['policy_age'])
data['prem_final_log'] = np.log1p(data['prem_final'])

In [6]:
numerical_cols = ['polholder_age', 'policy_age_log', 'vehicl_age', 'vehicl_agepurchase', 'prem_final_log']

### Ordinal Feature

In [7]:
data['policy_nbcontract'].value_counts()

policy_nbcontract
1     18259
2      3541
3       793
4       270
5        87
6        39
7        31
10       11
8         9
9         6
15        6
11        4
13        2
12        1
14        1
Name: count, dtype: int64

**policy_nbcontract_grp:** Most policies have between 1 and 5 contracts, while higher numbers are rare. To make the feature more stable and avoid overfitting on these rare values, all contracts above 5 are grouped into a single category 6. This keeps the feature meaningful for the model while preserving most of the information.

In [8]:
data['policy_nbcontract_grp'] = data['policy_nbcontract'].apply(lambda x: x if x <= 5 else 6)

In [9]:
data['prem_freqperyear'].value_counts()

prem_freqperyear
1 per year     11680
4 per year      6114
2 per year      3090
12 per year     2176
Name: count, dtype: int64


**prem_freqperyear_ord:** This feature represents how often premiums are paid per year. The values are converted to numbers to reflect their natural order (1 per year = 1, 2 per year = 2, 4 per year = 3, 12 per year = 4). This keeps the information meaningful for the model while simplifying the feature.


In [10]:
freq_map = {
   '1 per year': 1,
   '4 per year' : 2,
   '2 per year' : 3,
   '12 per year': 4
}

data['prem_freqperyear_ord'] = data['prem_freqperyear'].map(freq_map)

In [11]:
data['vehicl_powerkw'].value_counts()

vehicl_powerkw
75 kW         10339
100 kW         5116
25-50 kW       4968
125-300 kW     1720
150 kW          580
175 kW          206
225 kW           77
200 kW           32
250 kW           16
275 kW            4
300 kW            2
Name: count, dtype: int64

In [12]:
def group_powerkw(x):
    if x in ['75 kW','100 kW' ,'25-50 kW']:
        return x
    else:
        return '125+ kW'
data['vehicl_powerkw_ord'] = data['vehicl_powerkw'].apply(group_powerkw)

power_map = {
    '25-50 kW': 1,
    '75 kW': 2,
    '100 kW': 3,
    '125+ kW': 4
}
data['vehicl_powerkw_ord'] = data['vehicl_powerkw_ord'].map(power_map)



**vehicl_powerkw:** represents vehicle engine power and has a natural order. Most vehicles fall into a few common power ranges, while higher power values occur rarely. To reduce sparsity and keep the feature meaningful, all power values above 100 kW are grouped into a single category (125+ kW). The grouped values are then converted into ordered numeric levels so the model can learn patterns related to increasing engine power.

In [13]:
data['polholder_BMCevol'].value_counts() 

polholder_BMCevol
stable    12036
down      10155
up          869
Name: count, dtype: int64

In [14]:
bmc_map = {
    'down': 0,
    'stable': 1,
    'up': 2
}

data['polholder_BMCevol_ord'] = data['polholder_BMCevol'].map(bmc_map)

**polholder_BMCevol**: This feature shows the evolution of the policyholder’s bonus malus class. The values have a natural order (down < stable < up), so I convert them to numeric codes (0, 1, 2) to reflect this order. This helps the model understand the direction of change while keeping the feature simple and interpretable.

In [15]:
ordinal_cols = ['policy_nbcontract_grp', 'prem_freqperyear_ord', 'vehicl_powerkw_ord', 'polholder_BMCevol_ord' ]

cols_to_frop =['prem_last', 'prem_market' , 'prem_pure', 'policy_age', 'prem_final', 'policy_nbcontract', 'prem_freqperyear', 'vehicl_powerkw', 'polholder_BMCevol']
data = data.drop(cols_to_frop, axis=1)

### Categorical Features

In [16]:
categorical_cols = ['polholder_diffdriver', 'polholder_gender', 
                    'polholder_job', 'policy_caruse', 'vehicl_garage', 'vehicl_region']

These features *polholder_diffdriver, polholder_gender, polholder_job, policy_caruse, vehicl_garage, and vehicl_region*  are categorical with no natural order. We turn them into separate binary columns using one-hot encoding so the model can use them effectively. 

In [17]:
# Separate features and target
X = data.drop(columns=['lapse'])
y = data['lapse']

- Numerical features are standardized to control scale differences introduced by log based feature engineering; normalization was considered but not used, as it would compress meaningful relative variation.
- Categorical variables are one-hot encoded with unknown categories ignored to ensure robustness to unseen values; dense output is used to enable conversion to DataFrames, feature inspection, and downstream analysis.



In [18]:
# Preprocessing steps
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore',drop='first', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
        ('ord', 'passthrough', ordinal_cols)
    ]
)

# Fit and transform features only
processed_array = preprocessor.fit_transform(X)

# Feature names
num_features = numerical_cols
cat_features = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
ord_features = ordinal_cols

all_features = list(num_features) + list(cat_features) + list(ord_features)

# Final dataframe
processed_data = pd.DataFrame(processed_array, columns=all_features)

processed_data.head()


Unnamed: 0,polholder_age,policy_age_log,vehicl_age,vehicl_agepurchase,prem_final_log,polholder_diffdriver_commercial,polholder_diffdriver_learner 17,polholder_diffdriver_only partner,polholder_diffdriver_same,polholder_diffdriver_unknown,...,vehicl_region_Reg4,vehicl_region_Reg5,vehicl_region_Reg6,vehicl_region_Reg7,vehicl_region_Reg8,vehicl_region_Reg9,policy_nbcontract_grp,prem_freqperyear_ord,vehicl_powerkw_ord,polholder_BMCevol_ord
0,-0.408475,-0.22429,-1.13109,0.064332,-0.722601,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2.0,4.0,1.0
1,-0.65135,-0.22429,0.540215,-0.137257,-0.946394,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,3.0,1.0
2,-1.1371,-1.06287,0.261664,-0.338846,-0.358732,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,3.0,1.0
3,-0.813267,0.266248,1.097316,0.467509,-0.661034,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,2.0,0.0
4,0.563026,1.595367,0.818765,0.064332,1.071429,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,2.0,1.0


In [19]:
proc_df = processed_data.copy()

In [20]:
proc_df.shape

(23060, 39)

### Base Model : Logistic Regression

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    proc_df,
    y,
    test_size=0.2,
    random_state=2025,
    stratify=y
)

- Stratification is used to preserve the rare lapse distribution in both training and test sets, ensuring reliable model evaluation.
- `liblinear` is chosen because it gives stable, interpretable results in imbalanced actual datasets.

In [23]:


model = LogisticRegression(max_iter=1000, 
                           random_state=2025,
                            class_weight='balanced',
                            solver='liblinear' ) 

model.fit(X_train, y_train)

In [24]:

y_pred_proba = model.predict_proba(X_test)[:, 1]

  ret = a @ b
  ret = a @ b
  ret = a @ b


In [25]:

roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print("ROC AUC Score:", roc_auc_score)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

ROC AUC Score: 0.6065436492256601
              precision    recall  f1-score   support

           0       0.90      0.58      0.71      4021
           1       0.17      0.57      0.26       591

    accuracy                           0.58      4612
   macro avg       0.53      0.57      0.48      4612
weighted avg       0.81      0.58      0.65      4612



  ret = a @ b
  ret = a @ b
  ret = a @ b


- In lapse modeling, it is better to flag too many customers than to miss customers who will actually lapse.

In [26]:
coef_df = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': model.coef_[0]
}).sort_values(by='coefficient', ascending=False)
