###  Model Building and Training
1. Data Preparation
2. Build Baseline Model
3. Build Ensemble Model
4. Cross-Validation (recommended)
5. Model Comparison and Selection

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import sys
sys.path.append('..')


from src.modeling import ModelTrainer
model_trainer = ModelTrainer()

## Load X and y data

In [2]:
X = np.load('../Data/processed/x_fraud.npy', allow_pickle=True)
y = np.load('../Data/processed/y_fraud.npy', allow_pickle=True)
type(X), type(y)

(numpy.ndarray, numpy.ndarray)

## Data preprocessing

In [5]:
# STRATIFIED TRAIN-TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y,random_state=42)


##  Baseline Model: Logistic Regression


In [13]:
lr = LogisticRegression(solver='saga',max_iter=3000,class_weight="balanced", random_state=42, n_jobs=-1)
lr = model_trainer.train_model(lr, X_train, y_train) # Train model
y_pred_lr, y_proba_lr = model_trainer.predict(lr, X_test) # Predictions
metrics_lr = model_trainer.evaluate_model(y_test, y_pred_lr, y_proba_lr) # Evaluation

F1-score: 0.28203246064758175
AUC-PR: 0.4076205226162818
Confusion Matrix:
 [[15298  8078]
 [  725  1729]]


## Build Ensemble Model: Random Forest

In [16]:

rf = RandomForestClassifier(random_state=42,
                            min_samples_split=5,
                            class_weight="balanced", n_jobs=-1)



## Hyperparameter tuning (CV inside)

In [17]:
param_dist = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 5, 10],
    "min_samples_split": [2, 5, 10],
}

tuning_results = model_trainer.hyperparameter_tuning(
    model=rf,
    param_distribution=param_dist,
    x_train=X_train,
    y_train=y_train
)

best_rf = tuning_results["best_model"]
print("Best Parameters:", tuning_results["best_params"])
print("Best CV F1:", tuning_results["best_score"])


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'n_estimators': 200, 'min_samples_split': 10, 'max_depth': 10}
Best CV F1: 0.6380676962296854


## Train Final Model on Full Training Set

In [18]:
best_rf = model_trainer.train_model(best_rf, X_train, y_train) # Train model

## Evaluate ONCE on Test Set

In [19]:

y_pred_rf, y_proba_rf = model_trainer.predict(best_rf, X_test) # Predictions
metrics_rf = model_trainer.evaluate_model(y_test, y_pred_rf, y_proba_rf) # Evaluation

F1-score: 0.70566534914361
AUC-PR: 0.6377535403109238
Confusion Matrix:
 [[23374     2]
 [ 1115  1339]]


### STRATIFIED K-FOLD CROSS-VALIDATION (k=5)

In [20]:
_, lr_cv_results = model_trainer.cross_validation(lr, X_train, y_train)
_, rf_cv_results = model_trainer.cross_validation(best_rf, X_train, y_train)

print("\nLogistic Regression CV:", lr_cv_results)
print("Random Forest CV:", rf_cv_results)



Logistic Regression CV: {'f1_mean': np.float64(0.2797920849514594), 'f1_std': np.float64(0.0008405338231541914), 'auc_pr_mean': np.float64(0.40066866976589066), 'auc_pr_std': np.float64(0.011656365637349037)}
Random Forest CV: {'f1_mean': np.float64(0.7016739751687948), 'f1_std': np.float64(0.003584919463279463), 'auc_pr_mean': np.float64(0.6354803467072089), 'auc_pr_std': np.float64(0.0027828078083432864)}


In [21]:
cv_results = {
    'Model':['Logistic Regression', 'Random Forest'],
    'F1 Score (mean ± std)':[
        f"{lr_cv_results['f1_mean']:.4f} ± {lr_cv_results['f1_std']:.4f}",
        f"{rf_cv_results['f1_mean']:.4f} ± {rf_cv_results['f1_std']:.4f}",
    ],
    'AUC-PR (mean ± std)':[
        f"{lr_cv_results['auc_pr_mean']:.4f} ± {lr_cv_results['auc_pr_std']:.4f}",
        f"{rf_cv_results['auc_pr_mean']:.4f} ± {rf_cv_results['auc_pr_std']:.4f}"
    ]
}
comparison_df = pd.DataFrame(cv_results)
print(comparison_df) 

                 Model F1 Score (mean ± std) AUC-PR (mean ± std)
0  Logistic Regression       0.2798 ± 0.0008     0.4007 ± 0.0117
1        Random Forest       0.7017 ± 0.0036     0.6355 ± 0.0028


In [22]:
recommended_model = "Random Forest" if rf_cv_results['f1_mean'] > lr_cv_results['f1_mean'] else "Logistic Regression"
print(f"Recommended model based on CV F1-score: {recommended_model}")


Recommended model based on CV F1-score: Random Forest


## Save model

In [None]:
import joblib
joblib.dump(best_rf, "../models/fraud_model.joblib")

np.save('../Data/processed/x_fraud_test.npy', X_test)
np.save('../Data/processed/x_fraud_train.npy', X_train)

# Model Comparison and Recommendation

## Logistic Regression

Baseline, interpretable model.

F1-score is low (0.28) and AUC-PR is moderate (0.40).

Shows that a linear model struggles to capture non-linear fraud patterns.

## Random Forest

Ensemble model that captures non-linear relationships.

F1-score is high (0.70) and AUC-PR is significantly better (0.64).

Standard deviation across folds is very low, indicating stable performance.

## Recommendation:

Random Forest is selected as the final model.

Reason: It achieves higher predictive performance while maintaining stable cross-validated results, making it more suitable for detecting fraudulent transactions.

Logistic Regression can still serve as a baseline or interpretable reference, but Random Forest is clearly superior for this imbalanced dataset.