# **ML pipeline**

- This notebook contains the pre-processing and the model training phase
- During the training the model evaluation & explanation results will be logged in local mlflow runs.
- Why grouping the 2 steps ? 
    - This phase contains all the 'fit' steps : we will fit the encoder and the model so that it can be used later on during inference
- Choice of the model : why choosing XGBoost ?
    - `Time` : we don't dispose of a lot of time to benchmark multiple algorithms so we need to make a choice 
    - `Performance` : XGBoost has proven to be the winning solution algorithm in numerous hackathons (kaggle,...)
    - `Calibration` : easy to calibrate and allows the user to easily perform regularization and prevent overfitting
    - `Scalability` : fast and consistent execution time even with large datasets
    - `Explainability` : TreeSHAP algorithm is able to calculate very accurate SHAP values for tree-based models
    - `Conveniency` : The tree-base nature of the model allows us to skip multiple pre-processing steps such as normalization, standardization & extreme values processing
- "No Free Lunch" : 
    - There is no ML model that is indisputably better than others, for a binary classification problem with structured data the model performances will be considerably enhanced by relevant and quality features.
    - Hyperparameters fine-tuning is the cherry on the cake.

In [1]:
import sys
from pathlib import Path
import logging 
import warnings
warnings.filterwarnings('ignore')

LOGGER = logging.getLogger(__name__)
sys.path.append(str(Path("../src").resolve())) 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## **Parameters**

In [None]:
columns_to_remove = ["policy_bind_date", "incident_date", "total_claim_amount"]
features_type_mapping = {
    'policy_state': "str", 
    'insured_education_level': "str", 
    'insured_occupation': "str",
    'insured_hobbies': "str", 
    'incident_type': "str", 
    'collision_type': "str",
    'incident_severity': "str", 
    'authorities_contacted': "str", 
    'incident_state': "str",
    'incident_city': "str", 
    'property_damage': "str", 
    'police_report_available': "str",
    'auto_make': "str"
}
binary_columns_list = ["property_damage", "police_report_available"]
unknown_value = -1
test_size = 0.2
shuffle = True
random_state = 1234
hyperparameters = {
    "booster" : "gbtree", # NOTE: we want tree-based boosting
    "objective": "binary:logistic", # NOTE : it is a binary classification
    "n_estimators": 1000, 
    "eta": 0.01, # NOTE : rule of thumb n_estimator ~10/eta
    "min_child_weight": 1, # NOTE : if too little it can lead to overfitting
    "seed": 1234, # NOTE : seed for sampling
    "n_jobs": 5, # NOTE : number of cores to use to speed up calculation time
    "base_score": 0.25, # NOTE : % of 1 in the labels -> helps the model to converge faster
    "max_depth": 8, # NOTE : if too high it can lead to overfitting, increase if the nb of features is high
    "subsample": 0.8, # NOTE : % of rows to sample each step -> regularization
    "colsample_bytree": 0.75, # NOTE : % of columns to sample each step -> regularization, to decrease if we have a lot of columns
    "eval_metric": ["auc", "logloss"], # NOTE : we will minimize the loss and observe the auc during training
}
training_params = {"early_stopping_rounds": 15} # NOTE : the number of rounds after which the training should stop if there is no improvement on the training metrics

## **Input loading**

In [4]:
import polars as pl 
from pathlib import Path

instances = pl.read_parquet(source=(Path().cwd().parent / "data/02_intermediate/instances.parquet").as_posix()).to_pandas()
labels = pl.read_parquet(source=(Path().cwd().parent / "data/02_intermediate/labels.parquet").as_posix()).to_pandas()

## **Execution** 

In [5]:
import functions.ml.preprocessing_functions as preprocessing 
import functions.ml.ml_training_functions as ml_training
import functions.ml.model_explanation_functions as explanation 
import functions.ml.model_prediction_functions as prediction 
import functions.ml.model_evaluation_function as evaluation
import mlflow
import pickle 
import json
from pathlib import Path

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment(experiment_name="technical_test")
with mlflow.start_run(run_name="model_training") as run:
    target_directory = (Path().cwd().parent / "data/03_model_inputs/")
    target_directory.mkdir(parents=True, exist_ok=True)

    LOGGER.info("-------------------------------------------------")
    LOGGER.info("Pre processing : Labels...")
    indexed_labels = preprocessing.set_index(data=labels)
    ml_labels = preprocessing.create_response_variable(labels = indexed_labels)
    LOGGER.info("Saving the final table...")
    ml_labels.to_parquet((target_directory / "ml_labels.parquet").as_posix())

    LOGGER.info("-------------------------------------------------")
    LOGGER.info("Pre processing : Instances...")
    indexed_instances = preprocessing.set_index(data=instances)
    instances_without_useless_columns = preprocessing.remove_unused_columns(
        data=indexed_instances, 
        columns_to_remove = columns_to_remove
    )
    mlflow.log_param(key = "columns_to_remove", value=columns_to_remove)
    typed_instances = preprocessing.set_features_type(
        data = instances_without_useless_columns, 
        features_type_mapping = features_type_mapping
    )
    mlflow.log_param(key = "features_type_mapping", value=features_type_mapping)
    instances_without_missing_values = preprocessing.impute_missing_values(
        data = typed_instances,
        binary_columns_list = binary_columns_list
    )
    mlflow.log_param(key = "binary_columns_list", value=binary_columns_list)
    columns_order = preprocessing.fit_columns_order(data = instances_without_missing_values)
    with open((target_directory / "columns_order.pkl").as_posix(), "wb") as file:
        pickle.dump(columns_order, file)
    mlflow.log_artifact(
        local_path=(target_directory / "columns_order.pkl").as_posix(), 
        artifact_path="model_artifacts"
    )

    instances_with_correct_columns_order = preprocessing.transform_columns_order(
        data = instances_without_missing_values,
        columns_order = columns_order
    )
    numerical_data, categorical_data = preprocessing.split_numeric_and_categorical_data(
        data=instances_with_correct_columns_order
    )

    encoder = preprocessing.fit_categorical_encoder(
        data = categorical_data, 
        unknown_value = unknown_value
    )
    with open((target_directory / "encoder.pkl").as_posix(), "wb") as file:
        pickle.dump(encoder, file)
    mlflow.log_artifact(
        local_path=(target_directory / "encoder.pkl").as_posix(), 
        artifact_path="model_artifacts"
    )

    encoded_categorical_data = preprocessing.transform_categorical_encoder(
        data = categorical_data,
        encoder = encoder 
    )

    ml_features = preprocessing.concatenate_numerical_and_categorical_data(
        numerical_data = numerical_data,
        categorical_data = encoded_categorical_data
    )
    LOGGER.info("Saving the final table...")
    ml_features.to_parquet((target_directory / "ml_features.parquet").as_posix())


    LOGGER.info("-------------------------------------------------")
    LOGGER.info("Training the model...")
    x_train, x_test, y_train, y_test = ml_training.apply_train_test_split(
        x = ml_features, 
        y= ml_labels,
        test_size = test_size,
        shuffle = shuffle, 
        random_state = random_state
    )
    mlflow.log_param(key="test_size", value=test_size)
    mlflow.log_param(key="shuffle", value=shuffle)
    mlflow.log_param(key="random_state", value=random_state)

    xgb_model = ml_training.train_xgb_model(
        x_train = x_train,
        x_test = x_test,
        y_train = y_train,
        y_test = y_test,
        hyperparameters = hyperparameters,
        training_params = training_params,
    )
    mlflow.log_params(hyperparameters)
    mlflow.log_params(training_params)
    with open((target_directory / "xgb_model.pkl").as_posix(), "wb") as file:
        pickle.dump(xgb_model, file)
    mlflow.log_artifact(
        local_path=(target_directory / "xgb_model.pkl").as_posix(), 
        artifact_path="model_artifacts"
    )

    LOGGER.info("-------------------------------------------------")
    LOGGER.info("Explaining the model...")
    target_directory = (Path().cwd().parent / "data/04_model_outputs/")
    target_directory.mkdir(parents=True, exist_ok=True)
    shap_values_train = explanation.create_shap_values(
        data = x_train,
        model = xgb_model
    )
    importance_variables_plot = explanation.plot_importance_variables(
        model=xgb_model,
        data=x_train
    )
    importance_variables_plot.savefig(
        (target_directory / "importance_variables_xgb.png").as_posix(),  
        bbox_inches="tight"
    )
    mlflow.log_artifact(
        local_path=(target_directory / "importance_variables_xgb.png").as_posix(), 
        artifact_path="model_explanation"
    )

    dependence_plots = explanation.plot_dependence_plots(
        data=x_train,
        shap_values = shap_values_train,
        encoder=encoder
    )
    for key, value in dependence_plots.items():
        filepath = (target_directory / "dependence_plots")
        filepath.mkdir(parents=True, exist_ok=True)
        value.savefig(
            (filepath / key).as_posix(),  
            bbox_inches="tight"
        )
        mlflow.log_artifact(
            local_path=(filepath / key).as_posix(), 
            artifact_path="model_explanation"
        )

    LOGGER.info("-------------------------------------------------")
    LOGGER.info("Evaluating the model...")
    y_probs_test = prediction.predict_probs(data=x_test, model=xgb_model)
    classification_reports = evaluation.create_classification_reports(
        y_true=y_test, y_probs= y_probs_test, thresholds = [0.25,0.3,0.4,0.5,0.6]
    )
    with open((target_directory / "classification_reports.json").as_posix(), "w") as file :
        json.dump(classification_reports, file)
    mlflow.log_artifact(
        local_path=(target_directory / "classification_reports.json").as_posix(), 
        artifact_path="model_evaluation"
    )

2025-07-14 10:47:29,307 : INFO : -------------------------------------------------
2025-07-14 10:47:29,307 : INFO : Pre processing : Labels...
2025-07-14 10:47:29,318 : INFO : labels.tp_fraud.value_counts().to_dict() = {0.0: 753, 1.0: 247}
2025-07-14 10:47:29,318 : INFO : Saving the final table...
2025-07-14 10:47:29,323 : INFO : -------------------------------------------------
2025-07-14 10:47:29,323 : INFO : Pre processing : Instances...
2025-07-14 10:47:29,481 : INFO : numerical_data.columns = Index(['insured_zip', 'umbrella_limit', 'vehicle_claim', 'per_accident_limit',
       'policy_annual_premium', 'number_of_vehicles_involved',
       'nb_years_between_incident_and_bind_date', 'incident_hour_of_the_day',
       'policy_deductable', 'months_as_customer', 'per_person_limit', 'age',
       'property_claim', 'capital-gains', 'witnesses', 'auto_year',
       'capital-loss', 'injury_claim', 'bodily_injuries'],
      dtype='object')
2025-07-14 10:47:29,481 : INFO : categorical_data.c

[0]	validation_0-auc:0.56605	validation_0-logloss:0.62229
[1]	validation_0-auc:0.72426	validation_0-logloss:0.61936
[2]	validation_0-auc:0.87245	validation_0-logloss:0.61500
[3]	validation_0-auc:0.85564	validation_0-logloss:0.61511
[4]	validation_0-auc:0.86331	validation_0-logloss:0.61286
[5]	validation_0-auc:0.89055	validation_0-logloss:0.60857
[6]	validation_0-auc:0.91178	validation_0-logloss:0.60337
[7]	validation_0-auc:0.90966	validation_0-logloss:0.60125
[8]	validation_0-auc:0.91579	validation_0-logloss:0.59728
[9]	validation_0-auc:0.91909	validation_0-logloss:0.59314
[10]	validation_0-auc:0.91909	validation_0-logloss:0.59130
[11]	validation_0-auc:0.91473	validation_0-logloss:0.59143
[12]	validation_0-auc:0.91225	validation_0-logloss:0.59110
[13]	validation_0-auc:0.91650	validation_0-logloss:0.58719
[14]	validation_0-auc:0.91544	validation_0-logloss:0.58710
[15]	validation_0-auc:0.91485	validation_0-logloss:0.58619
[16]	validation_0-auc:0.91273	validation_0-logloss:0.58602
[17]	va

2025-07-14 10:47:49,292 : INFO : -------------------------------------------------
2025-07-14 10:47:49,293 : INFO : Explaining the model...
2025-07-14 10:47:56,224 : INFO : -------------------------------------------------
2025-07-14 10:47:56,232 : INFO : Evaluating the model...


🏃 View run model_training at: http://localhost:5000/#/experiments/222123412902236260/runs/5d2bcbbb6268423495a132cffde844ee
🧪 View experiment at: http://localhost:5000/#/experiments/222123412902236260


# **Conclusions & more**

- the idea was to predict the probability of fraud and then choose a threshold to determine wether a predicted individual is a fraud or not
- the problem was a binary classification with imbalanced classes so it is better to do so instead of predicting the class directly (most of the time the threshold is fixed at 50% so with an imbalanced classification the true positives will likely stay at 0)
- what metrics did I choose during model training ? : minimizing the log loss
    - for an imbalanced classification, it is better to predict probabilities and then choose a threshold (every probs >= threshold will be considered as a fraud)
    - the log loss is a "distance" between the labels and the predicted probabilities, so minimizing it will help calibrate the model
- what metrics did I choose to evaluate the model ? :
    - precision
    - recall
    - the confusion matrix 
- these are the prefered metrics in this case because : 
    - for fraud detection what matters the most is the nb of fraud detected and the amount of cases the model predicts as a high risk of fraud
    - for an imbalanced classification problem looking at the accuracy is not advised (if i only have 5% of 1 in my labels, by predicting everyone to 0 i will have a 95% accuracy)
- what are the results ? :

    - when grouping high cardinality categorical variables : 
```json
{
    "classification_report_threshold_25%": {
        "precision": 0.6774193548387096, 
        "recal": 0.6885245901639344, 
        "true_positives": 42.0, 
        "false_positives": 20.0, 
        "true_negatives": 119.0, 
        "false_negatives": 19.0
    }, 
    "classification_report_threshold_30%": {
        "precision": 0.7272727272727273, 
        "recal": 0.6557377049180327, 
        "true_positives": 40.0, 
        "false_positives": 15.0, 
        "true_negatives": 124.0, 
        "false_negatives": 21.0
    }, 
    "classification_report_threshold_40%": {
        "precision": 0.7391304347826086, 
        "recal": 0.5573770491803278, 
        "true_positives": 34.0, 
        "false_positives": 12.0, 
        "true_negatives": 127.0, 
        "false_negatives": 27.0
    }, 
    "classification_report_threshold_50%": {
        "precision": 0.8484848484848485, 
        "recal": 0.45901639344262296, 
        "true_positives": 28.0, 
        "false_positives": 5.0, 
        "true_negatives": 134.0, 
        "false_negatives": 33.0
    }, 
    "classification_report_threshold_60%": {
        "precision": 0.8666666666666667, 
        "recal": 0.21311475409836064, 
        "true_positives": 13.0, 
        "false_positives": 2.0, 
        "true_negatives": 137.0, 
        "false_negatives": 48.0
    }
}
```
- the goal is to find a balance between precision and recall, we want to detect enough frauds but without making too much controls, fraud investigations are usually quite costly for insurance companies.

    - when not grouping high cardinality categorical variables : 
    
```json
{
  "classification_report_threshold_25%": {
    "precision": 0.7671232876712328,
    "recal": 0.9180327868852459,
    "true_positives": 56,
    "false_positives": 17,
    "true_negatives": 122,
    "false_negatives": 5
  },
  "classification_report_threshold_30%": {
    "precision": 0.7681159420289855,
    "recal": 0.8688524590163934,
    "true_positives": 53,
    "false_positives": 16,
    "true_negatives": 123,
    "false_negatives": 8
  },
  "classification_report_threshold_40%": {
    "precision": 0.7627118644067796,
    "recal": 0.7377049180327869,
    "true_positives": 45,
    "false_positives": 14,
    "true_negatives": 125,
    "false_negatives": 16
  },
  "classification_report_threshold_50%": {
    "precision": 0.7857142857142857,
    "recal": 0.5409836065573771,
    "true_positives": 33,
    "false_positives": 9,
    "true_negatives": 130,
    "false_negatives": 28
  },
  "classification_report_threshold_60%": {
    "precision": 0.8333333333333334,
    "recal": 0.32786885245901637,
    "true_positives": 20,
    "false_positives": 4,
    "true_negatives": 135,
    "false_negatives": 41
  }
}
```
- we observe a significant change in performance if we group the high cardinality variables or not -> we will choose not to
    - the XGBoost can handle high cardinality variables when using OrdinalEncoder because it splits each node using a "question-based" threshold 
    - `feature_1 >= threshold ?`
- the goal is to find a balance between precision and recall, we want to detect enough frauds but without making too much controls, fraud investigations are usually quite costly for insurance companies.
- the great compromise in our case could be choosing a threshold around 25~30%