<div style="background-color: #004B87; padding: 20px; text-align: center; border-radius: 10px; color: white; font-family: Arial, sans-serif; margin: auto; width: 80%;">
    <h1>Module 3 Assignment</h1>
    <h2>RURANGWA IRADUKUNDA Jean-François Régis</h2>
    <h3>September 7, 2025</h3>
    <p>Email : jeanfrancoisregis.rurangwairadukunda@axa.be</p>
</div>

# Introduction


Following the 'Actuarial Data Scientist' program offered by the Belgian association of actuaries (IABE), we are asked with an assignment that builds on the foundational concepts learned in the third and last module.
<br/>
This assignment involves analyzing the "eusavingULnoPS" dataset which is based on unit-linked saving products, with no profit sharing, sold in an unknown European country. Those insurance policies are observed between 1999 and 2008, with entries and exits possible.
<br/>

In this analysis, we will focus on developing classification and regression models for specific targets, applying feature selection techniques, and implementing explainability and fairness assessments.
<br/>
The objective is to provide actionable insights into the factors influencing policy lapses and risk premium proportions, ensuring the models are transparent, fair, and robust.

## Importing packages <a name="Importing_packages"></a>

Our first step is to import and load the packages that will be necessary in our task.


In [None]:
## Perform these pip installs in the command prompt, with your python kernel activated
!pip install shap
!pip install tensorflow
!pip install shapicant
!pip install scikeras
!pip install mlflow
!pip install evidently
!pip install boruta
!pip install pandas scikit-learn deap
!pip install aif360

!git clone https://github.com/Regis0323/Module-3-assignment.git
!pip install scikit-fuzzy

In [None]:
# 1. SETUP: load libraries & data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data preprocesing and models
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras import layers, models
!pip install tensorflow
from sklearn.metrics import classification_report


# MLFlow
import mlflow
import cloudpickle
from evidently import Dataset
from evidently import DataDefinition
from evidently import Report
from evidently.presets import DataDriftPreset, DataSummaryPreset

# Feature selection and explainability of the model
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.inspection import permutation_importance
from boruta import BorutaPy
from deap import base, creator, tools, algorithms
import random

# Bias mitigation
from aif360.datasets import StandardDataset, BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric, ClassificationMetric
from aif360.algorithms.preprocessing.optim_preproc_helpers.opt_tools import OptTools
from aif360.algorithms.preprocessing import Reweighing, OptimPreproc
from aif360.algorithms.inprocessing import MetaFairClassifier, AdversarialDebiasing
from aif360.algorithms.postprocessing import RejectOptionClassification, EqOddsPostprocessing



# Data exploration and preoprocessing


Then we load the dataset and get its specificities.

In [49]:
# Load
df = pd.read_csv("/content/Module-3-assignment/eusavingULnoPS.csv")
display(df.head())
print(df.dtypes.value_counts())

Unnamed: 0,policy.ID,issue.date,termination.date,lapse.reason,premium.frequency,gender,underwriting.age,face.amount,risk.premium,saving.premium,...,rate2Y.relvar1mth,rate2Y.relvar1qtr,rate10Y.relvar1mth,rate10Y.relvar1qtr,unemploy.relvar1mth,unemploy.relvar1qtr,industry.relvar1mth,industry.relvar1qtr,RTV.relvar1mth,RTV.relvar1qtr
0,N1,1999-01-01,2006-04-01,Claim,unique,Male,77,2979.53,265.86,29795.27,...,0.08631,0.184211,0.081598,0.144626,-0.011494,-0.011494,-0.000979,0.016699,0.005941,0.006931
1,N2,1999-01-01,2006-04-01,Claim,unique,Male,77,3039.71,271.23,30397.16,...,0.08631,0.184211,0.081598,0.144626,-0.011494,-0.011494,-0.000979,0.016699,0.005941,0.006931
2,N3,1999-01-01,2005-04-01,Surrender,unique,Female,72,2994.12,80.43,35929.44,...,-0.025391,-0.018489,-0.023343,-0.018868,-0.010309,-0.040816,0.001,0.003024,0.002002,0.000998
3,N4,1999-01-01,2003-02-01,Surrender,unique,Male,35,3051.1,7.5,30511.01,...,-0.041762,-0.133838,-0.038309,-0.039273,0.0,-0.00885,-0.003067,0.0,0.009524,0.009574
4,N5,1999-01-01,2008-01-01,In force,unique,Male,40,5810.21,81.39,58102.11,...,0.037712,-0.00443,0.025938,-0.002027,0.022727,0.069767,0.001876,0.000941,-0.01155,-0.026692


float64    23
object      6
int64       1
Name: count, dtype: int64


Our dataset is composed of 30 columns, of which 6 are categorized as 'object', 1 as 64-bit 'integer' (int64) and 23 as 64-bit 'floating-point' (float64) numbers.
<br>
We shall now check for missing and/or duplicated values before treating them accordingly if any.

In [50]:
#Potential missing values
Missing_Values=df.isna().sum()
print(Missing_Values)

#Potential duplicated values
Duplicated_Values=df[df.duplicated(keep=False)]
print(Duplicated_Values)
  #Treating duplicated values (if any)
df_cleaned=df.drop_duplicates(keep='first')

#Information about the new cleaned dataset
df_cleaned.info()
print(df_cleaned.columns.tolist())

policy.ID              0
issue.date             0
termination.date       0
lapse.reason           0
premium.frequency      0
gender                 0
underwriting.age       0
face.amount            0
risk.premium           0
saving.premium         0
CPI.relvar1mth         0
CPI.relvar1qtr         0
CPI.relvar1yr          0
CPI.relvar2yr          0
EUidx.relvar1mth       0
EUidx.relvar1qtr       0
EUidx.relvar1yr        0
EUidx.relvar2yr        0
rate1Y.relvar1mth      0
rate1Y.relvar1qtr      0
rate2Y.relvar1mth      0
rate2Y.relvar1qtr      0
rate10Y.relvar1mth     0
rate10Y.relvar1qtr     0
unemploy.relvar1mth    0
unemploy.relvar1qtr    0
industry.relvar1mth    0
industry.relvar1qtr    0
RTV.relvar1mth         0
RTV.relvar1qtr         0
dtype: int64
Empty DataFrame
Columns: [policy.ID, issue.date, termination.date, lapse.reason, premium.frequency, gender, underwriting.age, face.amount, risk.premium, saving.premium, CPI.relvar1mth, CPI.relvar1qtr, CPI.relvar1yr, CPI.relvar2yr, EUidx.

After this check, we can see that there are neither missing values nor duplicates and there are 2 date variables that have been properly treated.

We shall now create two datasets (df_cat and df_num), one for each of the following two dependent variables:
*   a categorical one: lapse.reason;
*   a numeric one: a numerical variable that represents the proportion of risk premium over total premium: proportion_risk_premium = risk.premium / (risk.premium+saving.premium)

In [55]:
# Conversion of the date columns from 'object' to 'datetime'
df_cleaned['issue.date'] = pd.to_datetime(df['issue.date'], errors='coerce')
df_cleaned['termination.date'] = pd.to_datetime(df['termination.date'], errors='coerce')

# Creation of numerical and categorical datasets
df_cat = df_cleaned.copy()
df_num = df_cleaned.copy()

# Creation of numerical dependent variable
df_num['proportion_risk_premium'] = df_num['risk.premium']/(df_num['risk.premium'] + df_num['saving.premium'])

## Data preprocessing
After creating the numerical and categorical datasets, we shall now define the X (X_cat and X_num) and y datasets before converting categorical variables to a suitable format (using the "one-hot encoding" method), handling date variables and splitting the datasets into training and test sets.

In [61]:
# Define X and y (wisely select variables for the X datasets)
y_num = df_num['proportion_risk_premium']
y_cat = df_cat['lapse.reason']
X_num = df_num.drop(['proportion_risk_premium'], axis=1)
X_cat = df_cat.drop(['lapse.reason'], axis=1)

# One-hot encoding method for categorical variables
categorical_cols = X_cat.select_dtypes(include=['object']).columns.tolist()
preprocessor = ColumnTransformer(
    transformers=[('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
    remainder='passthrough')

# Example on how to handle date variables
for df_ in [X_num, X_cat]:
    df_['issue_date_year'] = df_['issue.date'].dt.year
    df_['issue_date_month'] = df_['issue.date'].dt.month
    df_['termination_date_year'] = df_['termination.date'].dt.year
    df_['termination_date_month'] = df_['termination.date'].dt.month
    df_.drop(['issue.date', 'termination.date'], axis=1, inplace=True)

# Split test and train datasets
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(X_cat, y_cat, test_size=0.25, random_state=23, stratify=y_cat)

X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(X_num, y_num, test_size=0.25, random_state=23)

## Development of model for categorical variable: RandomForest

In this part we shall develop a Random Forest algorithm for the categorical dependent variable "lapse.reason".

In [78]:
# Build the pipeline with preprocessing and classifier
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Your existing OneHotEncoder + passthrough
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=23))
])

# Fit the model on training data
rf_pipeline.fit(X_train_cat, y_train_cat)

# Predict on test data
y_pred = rf_pipeline.predict(X_test_cat)

# Evaluate the model
print("Classification report for 'lapse.reason':")
print(classification_report(y_test_cat, y_pred))

Classification report for 'lapse.reason':
              precision    recall  f1-score   support

       Claim       1.00      0.10      0.19        49
    In force       1.00      1.00      1.00      3470
   Surrender       0.98      1.00      0.99      1962

    accuracy                           0.99      5481
   macro avg       0.99      0.70      0.72      5481
weighted avg       0.99      0.99      0.99      5481



Summary:
The current model is performing well overall, but poorly on the minority class.
Tuning can help improve recall for 'Claim'.
So, yes, you should tune and experiment further.

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE

# 1. Encode categorical variables
categorical_cols = X_train_cat.select_dtypes(include=['object']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)

# Fit and transform training data
X_train_encoded = preprocessor.fit_transform(X_train_cat)

# 2. Apply SMOTE on the encoded data
smote = SMOTE(random_state=23)
X_resampled, y_resampled = smote.fit_resample(X_train_encoded, y_train_cat)

# 3. Now, X_resampled is fully numerical and balanced
# You can train your model directly on X_resampled, y_resampled


NameError: name 'X_train_cat' is not defined

In [1]:
# Ensure imbalanced-learn is installed
!pip install imbalanced-learn

from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_recall_curve
import matplotlib.pyplot as plt

# --- Step 1: Encode your categorical features ---
# Use your existing 'preprocessor' for encoding
X_train_encoded = preprocessor.fit_transform(X_train_cat)
X_test_encoded = preprocessor.transform(X_test_cat)

# --- Step 2: Apply SMOTE to balance the training data ---
smote = SMOTE(random_state=23)
X_resampled, y_resampled = smote.fit_resample(X_train_encoded, y_train_cat)

# --- Step 3: Build a pipeline with best hyperparameters and class_weight='balanced' ---
best_params = {
    'classifier__n_estimators': 100,
    'classifier__min_samples_split': 2,
    'classifier__min_samples_leaf': 1,
    'classifier__max_features': 'log2',
    'classifier__max_depth': None
}

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(**best_params, class_weight='balanced', random_state=23))
])

# --- Step 4: Train on the SMOTE-balanced data ---
clf.fit(X_resampled, y_resampled)

# --- Step 5: Get probabilities on test set ---
y_proba = clf.predict_proba(X_test_encoded)

# Find index of 'Claim'
claim_idx = list(clf.named_steps['classifier'].classes_).index('Claim')
claim_probs = y_proba[:, claim_idx]

# --- Step 6: Plot Precision-Recall curve ---
precision, recall, thresholds = precision_recall_curve(y_test == 'Claim', claim_probs)

plt.figure(figsize=(8,6))
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.xlabel('Decision Threshold')
plt.title('Precision-Recall Trade-off for Claim Class')
plt.legend()
plt.show()

# --- Step 7: Select an optimal threshold (e.g., 0.05) to improve recall ---
optimal_threshold = 0.05
y_pred_adjusted = (claim_probs >= optimal_threshold).astype(str)

# --- Step 8: Evaluate at this threshold ---
print(f"Classification report at threshold={optimal_threshold}:")
print(classification_report(y_test_cat, y_pred_adjusted))




NameError: name 'preprocessor' is not defined

In [84]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': ['sqrt', 'log2']
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf_pipeline, param_distributions=param_dist,
    n_iter=20, scoring='f1_weighted', cv=5, n_jobs=-1, verbose=2, random_state=23
)

# Run hyperparameter tuning
random_search.fit(X_train_cat, y_train_cat)

# Best parameters
print("Best hyperparameters:", random_search.best_params_)

# Evaluate the best model
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_cat)
print("Classification report after hyperparameter tuning:")
print(classification_report(y_test_cat, y_pred))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best hyperparameters: {'classifier__n_estimators': 100, 'classifier__min_samples_split': 2, 'classifier__min_samples_leaf': 1, 'classifier__max_features': 'log2', 'classifier__max_depth': None}
Classification report after hyperparameter tuning:
              precision    recall  f1-score   support

       Claim       1.00      0.10      0.19        49
    In force       1.00      1.00      1.00      3470
   Surrender       0.98      1.00      0.99      1962

    accuracy                           0.99      5481
   macro avg       0.99      0.70      0.72      5481
weighted avg       0.99      0.99      0.99      5481



## Development of model for numerical variable (proportion): Sequential Dense NN


For the numerical variable "proportion_risk_premium", develop a Sequential Dense Neural Network.

In [13]:
scaler_num = StandardScaler()
# Normalize numerical features
X_train_num_scaled = scaler_num.fit_transform(X_train_num)
X_test_num_scaled = scaler_num.transform(X_test_num)

# Define neural network
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train_num_scaled.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='linear')
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

model.fit(
    X_train_num_scaled,
    y_train_num,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop],
    verbose=0
)

# Evaluate
loss, mae = model.evaluate(X_test_num_scaled, y_test_num)
print(f"Test MAE for risk proportion: {mae}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 2.7012e-07 - mae: 3.2976e-04
Test MAE for risk proportion: 0.00033756138873286545


# Variable selection

## Categorical problem, feature selection


For the categorical dependent variable "lapse.reason", perform Boruta feature selection using (from boruta) BortutaPy.

In [9]:
categorical_features = list(X_train_cat.select_dtypes(include=['object']).columns)

# Create the preprocessor with sparse output
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

# Fit and transform (sparse output)
X_train_cat_encoded = preprocessor.fit_transform(X_train_cat)

# No need to convert to dense
# Use BorutaPy directly with sparse matrix
rf_for_boruta = RandomForestClassifier(n_estimators=100, random_state=23)

# Encode y
le = LabelEncoder()
y_boruta = le.fit_transform(y_train_cat)

# Run Boruta
boruta_selector = BorutaPy(rf_for_boruta, max_iter=50, random_state=23)
boruta_selector.fit(X_train_cat_encoded, y_boruta)

# Get selected feature indices
support_mask = boruta_selector.support_

# Get feature names from the OneHotEncoder
feature_names = preprocessor.get_feature_names_out(categorical_features)
selected_features_boruta = feature_names[support_mask]
print("Boruta selected features:", list(selected_features_boruta))

TypeError: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.

For the categorical dependent variable "lapse.reason", perform Genetic Algorith feature selection using (from deap) base, creator, tools, algorithms repspectively.

Optional: Use MLFLow to show you know how to use good development and tracking practices while training the model.

In [None]:
# Setup MLFlow

host =
port =
mlflow.set_tracking_uri(uri=f"http://{host}:{port}")


In [None]:
# Model packaging


In [None]:
# Model registry


In [None]:
# Model serving


In [None]:
# Model monitoring


## Numerical problem, feature selection

For the numerical variable "proportion_risk_premium", perform a forward and a backward selection on explanatory variables.




In [10]:
# Standardize features
X_train_num_scaled = scaler_num.fit_transform(X_train_num)

# Forward selection
sfs_forward = SequentialFeatureSelector(
    Lasso(alpha=0.01),
    n_features_to_select=10,
    direction='forward'
)
sfs_forward.fit(X_train_num_scaled, y_train_num)
print("Forward selection features:", X_train_num.columns[sfs_forward.get_support()])

# Backward selection
sfs_backward = SequentialFeatureSelector(
    Lasso(alpha=0.01),
    n_features_to_select=10,
    direction='backward'
)
sfs_backward.fit(X_train_num_scaled, y_train_num)
print("Backward selection features:", X_train_num.columns[sfs_backward.get_support()])

NameError: name 'scaler_num' is not defined

For the numerical variable "proportion_risk_premium", apply Lasso using LassoCV.

Optional: Use MLFLow to show you know how to use good development and tracking practices while training the model.

In [None]:
# Setup MLFlow

host =
port =
mlflow.set_tracking_uri(uri=f"http://{host}:{port}")


In [None]:
# Model packaging


In [None]:
# Model registry


In [None]:
# Model serving


In [None]:
# Model monitoring


# Explainability of the model

## Categorical problem

Explain the RandomForest model to a client, that wants to know why their quote is so high by permutation feature selection using permutation_importance.

In [None]:
result = permutation_importance(clf_cat, X_test_cat, y_test_cat, n_repeats=10, random_state=42)
importances = pd.Series(result.importances_mean, index=X_test_cat.columns)
print("Feature importances:\n", importances.sort_values(ascending=False))

Choose another method of your preference to explain the RandomForest model.

## Numerical problem

Explain the Neural Netwok model to a client, that wants to know why their quote is so high using Shapley values.

In [None]:
X_sample = X_test_num_scaled[:100]
explainer = shap.KernelExplainer(model.predict, X_train_num_scaled[:100])
shap_values = explainer.shap_values(X_sample)

shap.summary_plot(shap_values, X_test_num.columns)

Choose another method of your preference to explain the NN model.

# BIAS for the numerical problem

After having set a threshold = 0.0002 = 0.02%, define a variable called "risk_category" that takes values High or Low if variable "proportion_risk_premium" is higher or lower than the treshold.

Similarly, define variable "risk_category_pred" that does the same but on the basis of the values predicted by the Neural Network model. For these two variables, count the number of High and Low values for Females and Males.

In [None]:
# Threshold = 0.0002 = 0.02%
threshold = 0.0002

df_num['risk_category'] = np.where(df_num['proportion_risk_premium'] > threshold, 'High', 'Low')
df_num['risk_category_pred'] =

Using confusion_matrix, compare real and predicted values by counting False positives, False negatives, Trues positives, True negatives.

## Metrics to detect possible bias on predicted values
Investigate possible bias on the basis of the metrics:


*   Demographic Parity (no bias range: -0.1, 0.1)
*   Equal Opportunity (no bias range: -0.1, 0.1)
*   Average Odds (no bias range: -0.1, 0.1)
*   Disparate Impact (no bias range: 0.8, 1.25)
*   Theil Index (no bias range: 0, 0.15)

## Bias mitigation
Proceed with:

*   Preprocessing bias mitigation: choose the method, between reweighing and optimized preprocessing, that leads to a greater bias reduction
*   In-processing bias mitigation:  choose the method, between Meta Fair
*   Postprocessing bias mitigation: Reject option classification and equalized Odds




In [None]:
# Preprocessing


In [None]:
# In-processing


In [None]:
# Post-Processing


Compare pros and cons of chosen methods:




Write down in text what you think the pros and cons are of the methods above