----
### 04. Modeling

**Objective:**  To perform models training and evaluation, this notebook workflow includes training baseline and advanced models, hyperparameter tuning, and out-of-fold (OOF) evaluation to compare all models, with a focus on the positive/fraud class. Challenged by the severe class imbalance within the dataset.

---
**Notebook Structure:**

- 4.1 Library Imports & Data Loading: Load required packages and datasets.

- 4.2 Train-Test Split: Prepare training and testing datasets.

- 4.3 Baseline Model Training: Train simple baseline models.

- 4.4 Advanced Model Training: Train advanced models eg CatBoost, LightGBM, and Random Forest classifiers.

- 4.5 Hyperparameter Fine-Tuning: Tune parameters.

- 4.6 Out-of-Fold Evaluation: Conduct OOF evaluation and rank all models based on PR AUC.

---

**Usage / Notes:** Many cells include `toggle flags (True/False)` to control execution.

- Set `fine_tuning` = `True` to run hyperparameter tuning cells, otherwise leave False to skip long-running computations.

- Set `save_trained_model` = `True` when you want to save a trained model, and switch to False to prevent overwriting.


Run cells sequentially and control True/False toggle as required


----
### 4.1 Library Imports & Data Loading

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from joblib import dump, load

from src.feature_engineering import FeatureEngineer

In [2]:
from src.data_loader import load_data

raw_data = load_data("../data/creditcard.csv")

raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


---
### 4.2 Train-Test Split

In [3]:
from sklearn.model_selection import train_test_split

X = raw_data.iloc[:, :-1]
y = raw_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify= y, random_state=3479)

In [4]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((227845, 30), (227845,), (56962, 30), (56962,))

####  Train-Test Split Retention (to ensure consistency and reproducibility)

In [5]:
# Saving the datasets
dump(X_train, "../data/X_train.joblib")
dump(y_train, "../data/y_train.joblib")
dump(X_test, "../data/X_test.joblib")
dump(y_test, "../data/y_test.joblib")

['../data/y_test.joblib']

-----
### 4.3 Baseline Logistic Regression Model

In [6]:
%%time
from sklearn.linear_model import LogisticRegression

categorical_features = ["Time_segment"]

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ],
    # keep all other columns (eg numeric features)
    remainder="passthrough"
)

# Final pipeline
pipeline_lr = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(class_weight="balanced", random_state=3479, max_iter=5000))
])

pipeline_lr.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_lr.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

CPU times: total: 13.7 s
Wall time: 5.86 s


Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,LogisticRegression


In [7]:
save_trained_model = False  # toggle on/off saving

if save_trained_model:
    dump(pipeline_lr, "../trained_models/logistic_regression_baseline.joblib")


`Considerations moving forward:`

- I will apply balanced class weight
- `I will utilize CatBoost and LightGBM which are gradient-boosted tree models powerful for tabular data, handle imbalanced classes well.`

----

### 4.4 Advanced Model Training

#### 4.4.1 CatBoost Classifier (Gradient Boosted Tree model)

CatBoost Classifier is a model based on gradient-boosted sequential decision trees (weak learners) where each tree corrects the previous errors. This model  has been selected to be the first advanced Classifier due to:

- Native handling of categorical data.
- Strength in handling minority classes through (`class_weights`, and `auto_class_weights = balanced`).
- Requires minimum fine-tuning.



In [8]:
%%time
# 1. Importing CatBoost model
from catboost import CatBoostClassifier

# 2. Build the pipeline
categorical_features = ['Time_segment']

# Final pipeline
pipeline_catboost = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("classifier", CatBoostClassifier(
        iterations=1000,
        # Number of boosted trees CatBoost will build.
        auto_class_weights="Balanced",
        # Automatically increases the importance of the minority class.
        # Handles severe class imbalance without manually computing weights.
        learning_rate=0.01,
        depth=6,
        # The depth of each decision tree.
        cat_features=categorical_features,
        eval_metric="PRAUC",
        # The metric CatBoost optimizes during training.
        verbose=0 #silent training
    ))
])

# 3. Train/ fit the model
pipeline_catboost.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_catboost.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

CPU times: total: 7min 47s
Wall time: 1min 32s


Step,Name,Type
1,feature_engineer,FeatureEngineer
2,classifier,CatBoostClassifier


In [9]:
save_trained_model = False  # toggle on/off saving

if save_trained_model:
    dump(pipeline_catboost, "../trained_models/catboost_default.joblib")

-------
#### 4.4.2 LightGBM (Gradient Boosted Tree model)
LightGBM has similar strong traits e.g. handles severe class imbalance and categorical features. Moreover, it is fast on large datasets.

In [10]:
# Suppress LightGBM feature name warnings globally
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names"
)

In [11]:
%%time

from lightgbm import LGBMClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Time_segment']

# encode categorical features
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
], remainder="passthrough")  # keep numeric features as they are

pipeline_lightgbm = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", LGBMClassifier(
        n_estimators=1000,
        class_weight="balanced", # handles class imbalance
        learning_rate=0.01,
        max_depth=6,
        random_state=3479,
        verbose=0
    ))
])

pipeline_lightgbm.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_lightgbm.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

CPU times: total: 28.4 s
Wall time: 6.22 s


Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,LGBMClassifier


In [12]:
save_trained_model = False  # toggle on/off saving

if save_trained_model:
    dump(pipeline_lightgbm, "../trained_models/lightgbm_default.joblib")

-----
#### 4.4.3 Random Forest Classifier

(Aggregation-based ensemble algorithm it has better interpretability)

In [13]:
%%time

from sklearn.ensemble import RandomForestClassifier
categorical_features = ['Time_segment']

preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
], remainder="passthrough")

pipeline_rf = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(
        max_depth=10,
        min_samples_leaf=10,
        random_state=3479,
        n_estimators=1000,
        class_weight="balanced", # handles class imbalance
        n_jobs=-1, # use all CPU cores
    ))
])

pipeline_rf.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_rf.steps)],
    columns=['Step', 'Name', 'Type']
)
pipeline_df.style.hide(axis="index")

CPU times: total: 18min 43s
Wall time: 1min 48s


Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,RandomForestClassifier


In [14]:
save_trained_model = False  # toggle on/off saving

if save_trained_model:
    dump(pipeline_rf, "../trained_models/random_forest_default.joblib")

-----
### 4.5 Hyperparameters Fine-Tuning


#### 4.5.1 Random Forest Classifier Tuning

In [17]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_rf = False # toggle on/off the time-consuming tuning

if fine_tune_rf:

    # Set the step classifier
    rf_classifier = RandomForestClassifier(class_weight="balanced", random_state=1234) # handles class imbalance
    pipeline_rf.set_params(classifier = rf_classifier)

    # Distribution of the parameters to tune
    rf_param_dist  = {
        "classifier__n_estimators": [500, 1000, 1500, 2000],
        "classifier__max_depth": [8, 10, 12],
        "classifier__min_samples_split": [10, 20, 50],
        "classifier__min_samples_leaf": [5, 10, 20],
        "classifier__max_features": ["sqrt", "log2", 0.5],
        # How many features the model looks at when splitting each node in a decision tree
        # sqrt features, log2(features), or 50% of features
    }

    # Randomized Search for tuning
    rf_search = RandomizedSearchCV(
        estimator=pipeline_rf,
        param_distributions=rf_param_dist,
        n_iter=20,  # sample 20 combinations
        scoring="average_precision", # PR-AUC
        cv=3,
        random_state=3479,
        n_jobs=-1,
    )

    # Fit the search
    rf_search.fit(X_train, y_train)

    # Extract best model
    rf_best_model = rf_search.best_estimator_

CPU times: total: 0 ns
Wall time: 7.39 μs


In [18]:
save_trained_model = False # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(rf_best_model,"../trained_models/random_forest_tuned_for_pr_auc.joblib")

-----
#### 4.5.2 LightGBM Classifier Tuning

In [39]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_lightgbm = False # toggle on/off the time-consuming tuning

if fine_tune_lightgbm:

    # pos_ratio = #positives / #negatives in the dataset
    pos_ratio = y_train.sum() / (len(y_train) - y_train.sum())

    # Set the step classifier
    lightgbm_classifier = LGBMClassifier(class_weight="balanced", random_state=1234, verbose=-1) # handles class imbalance
     # verbosity=-1 to silence warnings
    pipeline_lightgbm.set_params(classifier = lightgbm_classifier, verbose=False)

    # Distribution of the parameters to tune
    lightgbm_param_dist = {
        "classifier__n_estimators": [500, 1000, 1500], # number of trees
        "classifier__learning_rate": [0.01, 0.03, 0.05],
        "classifier__num_leaves": [31, 63],   # tree complexity: smaller to avoid overfitting rare positives
        "classifier__max_depth": [6, 8, 10], # max tree depth
        "classifier__min_child_samples": [5, 10, 20],   # min samples per leaf; prevents overfitting rare positives
        "classifier__subsample": [0.8, 1.0],       # row sampling fraction; higher to see positives
        "classifier__colsample_bytree": [0.8, 1.0],  # feature sampling fraction
        "classifier__reg_alpha": [0.0, 0.1, 1.0],   # L1 regularization
        "classifier__reg_lambda": [0.0, 0.1, 1.0],  # L2 regularization
        "classifier__scale_pos_weight": [max(1, int(pos_ratio * factor)) for factor in [0.3, 0.5, 1.0,1.5, 2.0, 2.5]],
        # dynamically  scales weight of the (positive/ fraud class) to counter imbalance
        "classifier__boosting_type": ['gbdt', 'dart'], # boosting algorithm
    }

    # Randomized Search for tuning
    lightgbm_search = RandomizedSearchCV(
        estimator=pipeline_lightgbm,
        param_distributions=lightgbm_param_dist,
        n_iter=25,  # sample 20 combinations
        scoring="average_precision", # maximise PR-AUC
        cv=3,
        random_state=3479,
        n_jobs=-1,
    )

    # Fit the search
    lightgbm_search.fit(X_train, y_train)

    # Extract best model
    light_gbm_best_model = lightgbm_search.best_estimator_

CPU times: total: 48.9 s
Wall time: 19min 16s


In [20]:
save_trained_model = False  # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(light_gbm_best_model, "../trained_models/lightgbm_tuned_for_pr_auc.joblib")

----
#### 4.5.3 Catboost Tuning

In [21]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_catboost = False # toggle on/off the time-consuming tuning

if fine_tune_catboost:

# Calculate class imbalance ratio
    pos_ratio = y_train.sum() / max(1, (len(y_train) - y_train.sum()))

# Initialize CatBoost classifier (silent to avoid logs)
    categorical_features = ['Time_segment']

    catboost_classifier = CatBoostClassifier(
        random_state=1234,
        verbose=False, # suppress output
    )
    pipeline_catboost.set_params(classifier = catboost_classifier)

# Define parameter grid for RandomizedSearchCV
    catboost_param_dist = {
        "classifier__iterations": [500, 1000, 1500],   # number of trees
        "classifier__learning_rate": [0.01, 0.03, 0.05],
        "classifier__depth": [4, 6, 8],   # tree depth; smaller helps with rare positives
        "classifier__l2_leaf_reg": [1, 3, 5],      # L2 regularization to smooth weights
        "classifier__border_count": [32, 64, 128],   # number of bins for numerical features
        "classifier__scale_pos_weight": [max(1, int(pos_ratio * f)) for f in [0.5, 1.0, 1.5, 2.0]],  # imbalance
        "classifier__bagging_temperature": [0.0, 0.5, 1.0],   # randomness in bagging to reduce overfitting
    }

# Randomized Search setup
    catboost_search = RandomizedSearchCV(
        estimator=pipeline_catboost,
        param_distributions=catboost_param_dist,
        n_iter=20,  # try 20 combinations
        scoring="average_precision", # PR-AUC
        cv=3,
        random_state=3479,
        n_jobs=-1,
    )

# Fit the search with the cat_features
    catboost_search.fit(X_train, y_train, classifier__cat_features=categorical_features)

    # Extract best model
    catboost_best_model = catboost_search.best_estimator_

CPU times: total: 0 ns
Wall time: 5.72 μs


In [22]:
save_trained_model = False  # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(catboost_best_model, "../trained_models/catboost_tuned_for_pr_auc.joblib")

-----
### 4.6  Out-of-Fold Evaluation of All Models (Focused on Positive/Fraud Class)

In [23]:
load_trained_models = True

if load_trained_models:
    logistic_regression_model = load("../trained_models/logistic_regression_baseline.joblib")
    catboost_model = load("../trained_models/catboost_default.joblib")
    lightgbm_model = load("../trained_models/lightgbm_default.joblib")
    rf_model = load("../trained_models/random_forest_default.joblib")


    lightgbm_model_tuned = load("../trained_models/lightgbm_tuned_for_pr_auc.joblib")
    catboost_model_tuned = load("../trained_models/catboost_tuned_for_pr_auc.joblib")
    rf_model_tuned = load("../trained_models/random_forest_tuned_for_pr_auc.joblib")

In [24]:
%%time
run_oof_validation = False

if run_oof_validation:
    from src.model import oof_validation

    oof_metrics = oof_validation({"Logistic Regression (Baseline)": logistic_regression_model,
                                  "CatBoost (Default)": catboost_model,
                                  "LightGBM (Default)": lightgbm_model,
                                  "Random Forest (Default)": rf_model,
                                  "CatBoost (Tuned)": catboost_model_tuned,
                                  "LightGBM (Tuned)": lightgbm_model_tuned,
                                  "Random Forest (Tuned)": rf_model_tuned,
                                  }, X_train, y_train, categorical_features=['Time_segment']
                             )

Default metric period is 5 because PRAUC is/are not implemented for GPU
Metric PRAUC is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Default metric period is 5 because PRAUC is/are not implemented for GPU
Metric PRAUC is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time
Default metric period is 5 because PRAUC is/are not implemented for GPU
Metric PRAUC is not implemented on GPU. Will use CPU for metric computation, this could significantly affect learning time


CPU times: total: 4min 45s
Wall time: 12min 22s


In [27]:
save_results = False

if save_results:
    oof_metrics.to_csv("../results/tables/oof_validation_metrics_all_models.csv", index=True)

In [29]:
load_oof_val_metrics = True

if load_oof_val_metrics:
    oof_val_metrics = load_data("../results/tables/oof_validation_metrics_all_models.csv")

oof_val_metrics

Unnamed: 0,model,precision,recall,f1-score,support,val pr auc,val roc auc
0,CatBoost (Tuned),0.871,0.787,0.827,394.0,0.818,0.977
1,LightGBM (Tuned),0.865,0.746,0.801,394.0,0.807,0.96
2,Random Forest (Tuned),0.867,0.764,0.812,394.0,0.782,0.947
3,Random Forest (Default),0.846,0.794,0.819,394.0,0.755,0.956
4,CatBoost (Default),0.832,0.805,0.818,394.0,0.749,0.967
5,LightGBM (Default),0.763,0.759,0.761,394.0,0.744,0.955
6,Logistic Regression (Baseline),0.056,0.865,0.105,394.0,0.684,0.963


`Outcomes:`

This out-of-fold (OOF) evaluation of all models `focuses exclusively on positive/fraud class metrics`, prioritizing performance on the fraud class.

**Baseline** –– The baseline Logistic Regression achieved a PR-AUC of ~0.71, which is a strong starting point given the severe class imbalance. However, the model shows poor precision for the positive class (fraud), indicating many false positives. This underscores the need for more powerful models.

**Production Candidates** –– The CatBoost (Tuned) model achieved the highest PR-AUC (~0.818) with strong precision and recall for the positive class. The tuned versions of LightGBM and Random Forest also achieved high PR-AUC, marginally behind CatBoost (Tuned). However, Random Forest (Tuned) outperforms LightGBM in precision, recall, and F1-score, making it a strong candidate when interpretability is also considered.

Based on this assessment, CatBoost (Tuned) and Random Forest (Tuned) are the most suitable candidates for production, as they perform best for the positive class. A a more in-depth analysis, see the following notebook: 05_Results_analysis.


-----------
Next Step: Results analysis → Analyse models' performance and deploy the best mode for production

------------