----
### 04. Modeling

#### Objective:
The aim is to perform models  training and evaluation, this notebook workflow includes training a baseline and advanced models, hyperparameter tuning and out-of-fold (OOF) evaluation for all models with a focus on the positive/fraud class. Challenged by the severe class imbalance.

#### Notebook Structure:

- 4.1 Library Importing and Data Loading: Load necessary packages and the raw and engineered dataset.
- 4.2 Train-Test Split: Prepare training and testing datasets.
- 4.3 Baseline Model Training: Train  simple baseline models.
- 4.4 Advanced Model Training: Train advanced models e.g CatBoost,  LightGBM and Random Forest classifiers.
- 4.5 Hyperparameter Fine-Tuning: Tune the hyper-parameters.
- 4.6 Out-of-Fold Evaluation: Conduct OOF evaluation and rank all models based on PR AUC.

---

#### Compute Environment/Machine:


| Component | Specification |
|----------|----------------|
| CPU | AMD Ryzen 5 7600X (6 cores) |
| RAM | 32 GB |
| Models | Logistic Regression, Random Forest, LightGBM, CatBoost |

----
#### Approximate Runtimes (on the mentioned machine):

| Task | Runtime |
|------|---------|
| Baseline / default models | < 2 minutes each |
| Random Forest fine-tuning (50 iterations) | **~7 hours 45 minutes** |
| LightGBM tuning (50 iterations) | ~41 minutes |
| CatBoost tuning (50 iterations) | ~48 minutes |
| OOF evaluation | ~12 minutes |

---
#### Usage Note:


Many cells include `toggle flags (True/False)` to control execution. Run the notebook from top to bottom and switch the flags when needed:

- Set `fine_tuning` = `False` to skip long-running computations.
- Set `save_trained_model` = `False` to prevent overwriting the saved models.

----



### 4.1 Library Imports & Data Loading

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path().resolve().parent  # go up from notebooks/ to project root
sys.path.insert(0, str(PROJECT_ROOT))

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
import pandas as pd
from joblib import dump, load

from src.feature_engineering import FeatureEngineer


##### Raw Data

In [3]:
from src.data_loader import load_data
from src.feature_engineering import FeatureEngineer

raw_data = load_data("../data/creditcard.csv")

raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
# split into features and target values
X = raw_data.iloc[:, :-1]
y = raw_data.iloc[:, -1]

##### Data After Feature Engineering

In [5]:
# apply feature engineer class operations on X
engineer = FeatureEngineer()
engineer.fit(X)
engineered_X = engineer.transform(X)

engineered_X.head()

Unnamed: 0,V1,V3,V4,V6,V9,V10,V11,V12,V13,V14,...,V7_scaled,V8_scaled,V20_scaled,V21_scaled,V23_scaled,V27_scaled,V28_scaled,Amount_scaled,Hour_of_day,Time_segment
0,-1.359807,2.536347,1.378155,0.462388,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,...,0.266815,0.786444,0.582942,0.561184,0.663793,0.418976,0.312697,0.005824,0.0,early_morning
1,1.191857,0.16648,0.448154,-0.082361,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,...,0.264875,0.786298,0.57953,0.55784,0.666938,0.416345,0.313423,0.000105,0.0,early_morning
2,-1.358354,1.773209,0.37978,1.800499,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,...,0.270177,0.788042,0.585855,0.565477,0.678939,0.415489,0.311911,0.014739,0.0,early_morning
3,-0.966272,1.792993,-0.863291,1.247203,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,...,0.266803,0.789434,0.57805,0.559734,0.662607,0.417669,0.314371,0.004807,0.0,early_morning
4,-1.158233,1.548718,0.403034,0.095921,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,...,0.268968,0.782484,0.584615,0.561327,0.663392,0.420561,0.31749,0.002724,0.0,early_morning


---
### 4.2 Train-Test Split

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify= y, random_state=3479)

In [7]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((227845, 30), (227845,), (56962, 30), (56962,))

####  Train-Test Split Retention (to ensure consistency and reproducibility)

In [None]:
# Saving the datasets
dump(X_train, "../data/X_train.joblib")
dump(y_train, "../data/y_train.joblib")
dump(X_test, "../data/X_test.joblib")
dump(y_test, "../data/y_test.joblib")

-----
### 4.3 Baseline Logistic Regression Model

In [38]:
%%time

from sklearn.linear_model import LogisticRegression

categorical_features = ["Time_segment"]

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ],
    # keep all other columns (eg numeric features)
    remainder="passthrough"
)

# Final pipeline
pipeline_lr = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(class_weight="balanced", random_state=3479, max_iter=5000))
])

pipeline_lr.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_lr.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,LogisticRegression


In [9]:
save_trained_model = True  # toggle on/off saving

if save_trained_model:
    dump(pipeline_lr, "../trained_models/logistic_regression_baseline.joblib")


`Considerations moving forward:`

- I will apply balanced class weight
- `I will utilize CatBoost and LightGBM which are gradient-boosted tree models powerful for tabular data, handle imbalanced classes well.`

----

### 4.4 Advanced Model Training

### 4.4.1 CatBoost Classifier (Gradient Boosted Tree model)

CatBoost Classifier is a model based on gradient-boosted sequential decision trees (weak learners) where each tree corrects the previous errors. This model  has been selected to be the first advanced Classifier due to:

- Native handling of categorical data.
- Strength in handling minority classes through (`class_weights`, and `auto_class_weights = balanced`).
- Requires minimum fine-tuning.



In [39]:
%%time

# 1. Importing CatBoost model
from catboost import CatBoostClassifier

# 2. Build the pipeline
categorical_features = ['Time_segment']

# Final pipeline
pipeline_catboost = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("classifier", CatBoostClassifier(
        iterations=1000,
        # Number of boosted trees CatBoost will build.
        auto_class_weights="Balanced",
        # Automatically increases the importance of the minority class.
        # Handles severe class imbalance without manually computing weights.
        learning_rate=0.01,
        depth=6,
        # The depth of each decision tree.
        cat_features=categorical_features,
        eval_metric="PRAUC",
        random_state=3479,
        # The metric CatBoost optimizes during training.
        verbose=0 #silent training
    ))
])

# 3. Train/ fit the model
pipeline_catboost.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_catboost.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

Step,Name,Type
1,feature_engineer,FeatureEngineer
2,classifier,CatBoostClassifier


In [32]:
save_trained_model = True  # toggle on/off saving

if save_trained_model:
    dump(pipeline_catboost, "../trained_models/catboost_default.joblib")

-------
### 4.4.2 LightGBM (Gradient Boosted Tree model)
LightGBM has similar strong traits e.g. handles severe class imbalance and categorical features. Moreover, it is fast on large datasets.

In [12]:
# Suppress LightGBM feature name warnings globally
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names"
)

In [40]:
%%time

from lightgbm import LGBMClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Time_segment']

# encode categorical features
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
], remainder="passthrough")  # keep numeric features as they are

pipeline_lightgbm = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", LGBMClassifier(
        n_estimators=1000,
        class_weight="balanced", # handles class imbalance
        learning_rate=0.01,
        max_depth=6,
        random_state=3479,
        verbose=0
    ))
])

pipeline_lightgbm.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_lightgbm.steps)],
    columns=['Step', 'Name', 'Type']
)

pipeline_df.style.hide(axis="index")
# Display without the DataFrame index

Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,LGBMClassifier


In [14]:
save_trained_model = True  # toggle on/off saving

if save_trained_model:
    dump(pipeline_lightgbm, "../trained_models/lightgbm_default.joblib")

-----
### 4.4.3 Random Forest Classifier

(Aggregation-based ensemble algorithm it has better interpretability)

In [41]:
%%time

from sklearn.ensemble import RandomForestClassifier
categorical_features = ['Time_segment']

preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
], remainder="passthrough")

pipeline_rf = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(
        max_depth=10,
        min_samples_leaf=10,
        random_state=3479,
        n_estimators=1000,
        class_weight="balanced", # handles class imbalance
        n_jobs=-1, # use all CPU cores
    ))
])

pipeline_rf.fit(X_train, y_train)

# Convert pipeline steps to a Pandas DataFrame
pipeline_df = pd.DataFrame(
    [(i+1, name, type(step).__name__) for i, (name, step) in enumerate(pipeline_rf.steps)],
    columns=['Step', 'Name', 'Type']
)
pipeline_df.style.hide(axis="index")

Step,Name,Type
1,feature_engineer,FeatureEngineer
2,preprocessor,ColumnTransformer
3,classifier,RandomForestClassifier


In [16]:
save_trained_model = True  # toggle on/off saving

if save_trained_model:
    dump(pipeline_rf, "../trained_models/random_forest_default.joblib")

-----
### 4.5 Hyperparameters Fine-Tuning


### 4.5.1 Random Forest Classifier Tuning

In [None]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_rf = True # toggle on/off the time-consuming tuning

if fine_tune_rf:

    # Set the step classifier
    rf_classifier = RandomForestClassifier(class_weight="balanced", random_state=1234) # handles class imbalance
    pipeline_rf.set_params(classifier = rf_classifier)

    # Distribution of the parameters to tune
    rf_param_dist  = {
        "classifier__n_estimators": [500, 1000, 1500, 2000],
        "classifier__max_depth": [8, 10, 12],
        "classifier__min_samples_split": [10, 20, 50],
        "classifier__min_samples_leaf": [5, 10, 20],
        "classifier__max_features": ["sqrt", "log2", 0.5],
        # How many features the model looks at when splitting each node in a decision tree
        # sqrt features, log2(features), or 50% of features
        "classifier__criterion": ["gini", "entropy"],
        "classifier__bootstrap": [True, False],
    }

    # Randomized Search for tuning

    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=3479)

    rf_search = RandomizedSearchCV(
        estimator=pipeline_rf,
        param_distributions=rf_param_dist,
        n_iter=50,  # sample 50 combinations
        scoring="average_precision", # PR-AUC
        cv=cv,
        refit=True, # After finding the best hyperparameters, re-train the final model on the full training set as best_estimator_
        random_state=3479,
        n_jobs=-1,
        verbose=2,
    )

    # Fit the search
    rf_search.fit(X_train, y_train)

In [18]:
# Extract best model
rf_best_model = rf_search.best_estimator_

In [19]:
save_trained_model = True # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(rf_best_model,"../trained_models/random_forest_tuned_for_pr_auc.joblib")

-----
### 4.5.2 LightGBM Classifier Tuning

In [None]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_lightgbm = True # toggle on/off the time-consuming tuning

if fine_tune_lightgbm:

    # pos_ratio = #positives / #negatives in the dataset
    pos_ratio = y_train.sum() / (len(y_train) - y_train.sum())

    # Set the step classifier
    lightgbm_classifier = LGBMClassifier(class_weight="balanced", random_state=1234, verbose=-1) # handles class imbalance
     # verbosity=-1 to silence warnings
    pipeline_lightgbm.set_params(classifier = lightgbm_classifier, verbose=False)

    # Distribution of the parameters to tune
    lightgbm_param_dist = {
        "classifier__n_estimators": [500, 1000, 1500], # number of trees
        "classifier__learning_rate": [0.01, 0.03, 0.05],
        "classifier__num_leaves": [31, 63],   # tree complexity: smaller to avoid overfitting rare positives
        "classifier__max_depth": [6, 8, 10], # max tree depth
        "classifier__min_child_samples": [5, 10, 20],   # min samples per leaf; prevents overfitting rare positives
        "classifier__subsample": [0.8, 1.0],       # row sampling fraction; higher to see positives
        "classifier__colsample_bytree": [0.8, 1.0],  # feature sampling fraction
        "classifier__reg_alpha": [0.0, 0.1, 1.0],   # L1 regularization
        "classifier__reg_lambda": [0.0, 0.1, 1.0],  # L2 regularization
        "classifier__scale_pos_weight": [max(1, int(pos_ratio * factor)) for factor in [0.3, 0.5, 1.0,1.5, 2.0, 2.5]],
        # dynamically  scales weight of the (positive/ fraud class) to counter imbalance
        "classifier__boosting_type": ['gbdt', 'dart'], # boosting algorithm
    }

    # Randomized Search for tuning
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=3479)

    lightgbm_search = RandomizedSearchCV(
        estimator=pipeline_lightgbm,
        param_distributions=lightgbm_param_dist,
        n_iter=50,  # sample 50 combinations
        scoring="average_precision", # maximise PR-AUC
        cv=cv,
        refit=True, # After finding the best hyperparameters, re-train the final model on the full training set as best_estimator_
        random_state=3479,
        n_jobs=-1,
        verbose=2,
    )

    # Fit the search
    lightgbm_search.fit(X_train, y_train)

In [21]:
# Extract best model
light_gbm_best_model = lightgbm_search.best_estimator_

In [22]:
save_trained_model = True  # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(light_gbm_best_model, "../trained_models/lightgbm_tuned_for_pr_auc.joblib")

----
### 4.5.3 Catboost Tuning

In [None]:
%%time

from sklearn.model_selection import RandomizedSearchCV

fine_tune_catboost = True # toggle on/off the time-consuming tuning

if fine_tune_catboost:

# Calculate class imbalance ratio
    pos_ratio = y_train.sum() / max(1, (len(y_train) - y_train.sum()))

# Initialize CatBoost classifier (silent to avoid logs)
    categorical_features = ['Time_segment']

    catboost_classifier = CatBoostClassifier(
        random_state=1234,
        verbose=False, # suppress output
    )
    pipeline_catboost.set_params(classifier = catboost_classifier)

# Define parameter grid for RandomizedSearchCV
    catboost_param_dist = {
        "classifier__iterations": [500, 1000, 1500],   # number of trees
        "classifier__learning_rate": [0.01, 0.03, 0.05],
        "classifier__depth": [4, 6, 8],   # tree depth; smaller helps with rare positives
        "classifier__l2_leaf_reg": [1, 3, 5],      # L2 regularization to smooth weights
        "classifier__border_count": [32, 64, 128],   # number of bins for numerical features
        "classifier__scale_pos_weight": [max(1, int(pos_ratio * f)) for f in [0.5, 1.0, 1.5, 2.0]],  # imbalance
        "classifier__bagging_temperature": [0.0, 0.5, 1.0],   # randomness in bagging to reduce overfitting
    }

# Randomized Search setup
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=3479)

    catboost_search = RandomizedSearchCV(
        estimator=pipeline_catboost,
        param_distributions=catboost_param_dist,
        n_iter=50,  # try 50 combinations
        scoring="average_precision", # PR-AUC
        cv=cv,
        refit=True, # After finding the best hyperparameters, re-train the final model on the full training set as best_estimator_
        random_state=3479,
        n_jobs=-1,
        verbose=2,
    )

# Fit the search with the cat_features
    catboost_search.fit(X_train, y_train, classifier__cat_features=categorical_features)

In [24]:
# Extract best model
catboost_best_model = catboost_search.best_estimator_

In [25]:
save_trained_model = True  # toggle on/off saving after time-consuming tuning

if save_trained_model:
    dump(catboost_best_model, "../trained_models/catboost_tuned_for_pr_auc.joblib")

-----
### 4.6  Out-of-Fold Evaluation of All Models (Focused on Positive/Fraud Class)

In [8]:
load_trained_models = True

if load_trained_models:
    logistic_regression_model = load("../trained_models/logistic_regression_baseline.joblib")
    catboost_model = load("../trained_models/catboost_default.joblib")
    lightgbm_model = load("../trained_models/lightgbm_default.joblib")
    rf_model = load("../trained_models/random_forest_default.joblib")

    lightgbm_model_tuned = load("../trained_models/lightgbm_tuned_for_pr_auc.joblib")
    catboost_model_tuned = load("../trained_models/catboost_tuned_for_pr_auc.joblib")
    rf_model_tuned = load("../trained_models/random_forest_tuned_for_pr_auc.joblib")

In [None]:
%%time
run_oof_validation = True

if run_oof_validation:
    from src.model import oof_validation

    oof_metrics = oof_validation({"Logistic Regression (Baseline)": logistic_regression_model,
                                  "CatBoost (Default)": catboost_model,
                                  "LightGBM (Default)": lightgbm_model,
                                  "Random Forest (Default)": rf_model,
                                  "CatBoost (Tuned)": catboost_model_tuned,
                                  "LightGBM (Tuned)": lightgbm_model_tuned,
                                  "Random Forest (Tuned)": rf_model_tuned,
                                  }, X_train, y_train, categorical_features=['Time_segment']
                             )

In [15]:
oof_metrics

Unnamed: 0_level_0,precision,recall,f1-score,support,oof pr auc (fraud)
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LightGBM (Tuned),0.94,0.832,0.883,394.0,0.863
CatBoost (Tuned),0.96,0.802,0.874,394.0,0.856
LightGBM (Default),0.827,0.838,0.832,394.0,0.848
Random Forest (Tuned),0.881,0.805,0.841,394.0,0.839
CatBoost (Default),0.798,0.853,0.825,394.0,0.832
Random Forest (Default),0.835,0.81,0.822,394.0,0.803
Logistic Regression (Baseline),0.053,0.893,0.1,394.0,0.756


In [11]:
save_results = True

if save_results:
    oof_metrics.to_csv("../results/tables/oof_validation_metrics_all_models.csv", index=True)

#### **Outcomes:**

This out-of-fold (OOF) evaluation of all models `focuses exclusively on positive/fraud class metrics`, prioritizing performance on the fraud class.

**Baseline** –– The baseline Logistic Regression achieved a PR-AUC of ~0.756, which is  a good starting point given the severe class imbalance. However, the model shows poor precision for the positive class (fraud),  indicating many false positives. This underscores the need for more powerful models.

**Production Candidates** –– The LightGBM (Tuned) model achieved the highest PR-AUC (~0.863) with strong precision and recall for the positive class. The tuned versions of Catboost achieved comparable PR-AUC of 0.856, marginally behind LightGBM (Tuned).

Based on this assessment, LightGBM (Tuned) model is the most suitable candidates for production, as it provided a high F1-score balancing precision and recall of the fraudulent transactions, while having the highest PR AUC. A more in-depth analysis, see the following notebook: 05_Results_analysis.


-----------
Next Step: Results analysis → Analyse models' performance and deploy the best mode for production

------------