# ðŸ“– Notebook 02 â€” Existing Work Replication

Replicates the paper's methodology:
- **Models**: Naive Bayes, Random Forest (n=5), XGBoost (n=5)
- **Strategies**: Original, Oversampling, Undersampling, SMOTE
- **Datasets**: European & Sparkov
- **Split**: 70 / 15 / 15  |  **CV**: 5-fold

---

In [None]:
import sys, os, warnings
sys.path.insert(0, os.path.abspath('..'))
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, KFold
import joblib

from src.utils.config import (
    RANDOM_SEED, MODELS_DIR, N_SPLITS,
    DS_EUROPEAN, DS_SPARKOV,
    MODEL_NB, MODEL_RF, MODEL_XGB,
    ALL_STRATEGIES,
)
from src.utils.metrics import evaluate_model, results_to_dataframe, get_roc_curve
from src.data.preprocessing import preprocess_european, preprocess_sparkov, load_processed
from src.data.balancing_strategies import get_balanced_datasets, describe_balance
from src.models.baseline_models import get_naive_bayes
from src.models.ensemble_models import get_random_forest, get_xgboost
from src.visualization.plot_utils import (
    plot_roc_curves, plot_performance_by_dataset,
    plot_performance_per_dataset, plot_f1_density,
    plot_average_performance, plot_confusion_matrix,
    plot_methodology_flowchart,
)

np.random.seed(RANDOM_SEED)
%matplotlib inline
print('Setup complete.')

## 1. Load & Preprocess Data

In [None]:
# Preprocess (run once, then use load_processed)
# eu_data = preprocess_european()
# sp_data = preprocess_sparkov()

eu_data = load_processed(DS_EUROPEAN)
sp_data = load_processed(DS_SPARKOV)

datasets = {
    DS_EUROPEAN: eu_data,
    DS_SPARKOV:  sp_data,
}

## 2. Generate Balanced Variants

In [None]:
balanced_sets = {}
for ds_name, data in datasets.items():
    balanced_sets[ds_name] = get_balanced_datasets(data['X_train'], data['y_train'])
    for strat, (X, y) in balanced_sets[ds_name].items():
        describe_balance(y, f'{ds_name}/{strat}')

## 3. Train All Existing Models (3 models Ã— 4 strategies Ã— 2 datasets = 24 runs)

In [None]:
model_factories = {
    MODEL_NB:  get_naive_bayes,
    MODEL_RF:  lambda: get_random_forest(paper_params=True),
    MODEL_XGB: lambda: get_xgboost(paper_params=True),
}

all_results = {}       # {ds: {strat: {model: metrics}}}
trained_models = {}    # {(ds, strat, model): fitted_model}
roc_data_collection = []  # for ROC plots

kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_SEED)

for ds_name, data in datasets.items():
    all_results[ds_name] = {}
    X_test, y_test = data['X_test'], data['y_test']

    for strat, (X_bal, y_bal) in balanced_sets[ds_name].items():
        all_results[ds_name][strat] = {}
        print(f'\n=== {ds_name} / {strat} ===')

        for model_name, factory in model_factories.items():
            model = factory()

            # Cross-validation scores
            cv_f1 = cross_val_score(model, X_bal, y_bal, cv=kf, scoring='f1')

            # Full train & evaluate
            model.fit(X_bal, y_bal)
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
            metrics = evaluate_model(y_test, y_pred, y_prob)
            metrics['cv_f1_mean'] = cv_f1.mean()
            metrics['cv_f1_std'] = cv_f1.std()

            all_results[ds_name][strat][model_name] = metrics
            trained_models[(ds_name, strat, model_name)] = model

            # Collect ROC data
            if y_prob is not None:
                roc_data_collection.append({
                    'label': f'{model_name} / {ds_name} / {strat}',
                    'y_true': y_test, 'y_prob': y_prob,
                })

            print(f'  {model_name:15s}  F1={metrics["f1"]:.4f}  '
                  f'AUC={metrics["roc_auc"]:.4f}  '
                  f'CV-F1={cv_f1.mean():.4f}Â±{cv_f1.std():.4f}')

            # Save model
            joblib.dump(model, MODELS_DIR / f'{ds_name}_{strat}_{model_name}.joblib')

print('\nâœ“ All existing models trained and saved.')

## 4. Results Summary Table

In [None]:
results_df = results_to_dataframe(all_results)
results_df.to_csv(MODELS_DIR / 'existing_results.csv', index=False)

styled = results_df.style.format({
    'accuracy': '{:.4f}', 'precision': '{:.4f}',
    'recall': '{:.4f}', 'f1': '{:.4f}', 'roc_auc': '{:.4f}'
}).background_gradient(subset=['f1', 'roc_auc'], cmap='YlGn')

styled

## 5. Visualisations (Paper Figures)

In [None]:
# Figure 1: Methodology Flowchart
plot_methodology_flowchart()
plt.show()

In [None]:
# Figure 2: Sample ROC Curves
sample_roc = [
    r for r in roc_data_collection
    if ('naive_bayes' in r['label'] and 'european' in r['label'] and 'oversampled' in r['label'])
    or ('random_forest' in r['label'] and 'european' in r['label'] and 'smote' in r['label'])
    or ('xgboost' in r['label'] and 'sparkov' in r['label'] and 'undersampled' in r['label'])
]
if sample_roc:
    plot_roc_curves(sample_roc, title='ROC Curves â€“ Sample Models')
    plt.show()

In [None]:
# Figure 3: Performance by Dataset
plot_performance_by_dataset(results_df)
plt.show()

In [None]:
# Figure 4 & 5: Performance on each dataset
plot_performance_per_dataset(results_df, DS_EUROPEAN)
plt.show()

plot_performance_per_dataset(results_df, DS_SPARKOV)
plt.show()

In [None]:
# Figure 6: F1 Density Distributions
plot_f1_density(results_df)
plt.show()

In [None]:
# Figure 7: Average Performance & Gap
plot_average_performance(results_df)
plt.show()

## 6. Analysis of Existing Work Results

### Key Findings

1. **Ensemble methods outperform Naive Bayes**: Both RF and XGB achieve higher F1 and AUC scores across all dataset Ã— strategy combinations.
2. **SMOTE yields the best overall performance**: Synthetic minority oversampling provides the most balanced training signal, especially for recall.
3. **Real data (EU) > Simulated data (Sparkov)**: The deterministic approval scripts in real-world systems create learnable fraud patterns, while the simulated data has higher stochasticity.
4. **Undersampling degrades performance**: Losing majority-class samples reduces the model's ability to distinguish legitimate transactions.

---
*Proceed to Notebook 03 for proposed deep-learning models.*