### üìò Model Training & Evaluation


<style>
.outer-div {
  text-align: left; /* Ensures content is left-justified */
  padding-left: 10px; /* Pads the content by 10px from the left */
  padding-bottom: 10px; /* Pads the content by 10px from the bottom */
}

.inner-div {
  width: 75%; /* Sets the width of the inner div to 75% of its parent */
  margin-left: 0; /* Ensures left justification if no other margin is applied */
  padding: 10px; /* Applies 10px padding on all sides */
  border-left: 1px solid #485c83; /* Left border: 1px solid with color #485c83 */
  color: #b8bbbf; /* Sets the font color */
  background-color: #303135; /* Sets the background color */
}
</style>
<div class="outer-div">
<div class="inner-div">
<b>Notebook Summary</b><br>
Goal: Train and evaluate baseline ML models<br>
Author: Dennis Fashimpaur<br>
Date: 2025-11-26<br>
Features: numeric-only, standardized<br>
Models: Logistic Regression, Random Forest<br>
Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC
</div></div>

This notebook performs the following tasks:
* Loads unified dataset
* Drops duplicate `Amount` column
* Feature/label splitting
* Train/test split
* Standardization for Logistic Regression
* Trains Logistic Regression
* Trains Random Forest
* Evaluates models (precision, recall, F1, ROC-AUC)
* Displays Confusion matrices
* Compares models
* Plots ROC curves and Random Forest feature importance
* Saves trained models and results

### üß≠ Imports & Setup

In [None]:
import os

import joblib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, \
    roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from src.data_loader import load_csv

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 150)

### üìä Load Unified Dataset

In [None]:
df = load_csv('../data/processed/unified_dataset.csv')
# Drop the redundant 'Amount' column per Option A
if 'Amount' in df.columns:
    df = df.drop(columns=['Amount'])
print("üóÑÔ∏è Loaded unified dataset with shape:", df.shape)
df.head()

### ‰∑ñ Split Features & Labels
We keep only numeric features for modeling; the target column is `label` (1 = anomaly/fraud, 0 = normal).

In [None]:
X = df.drop(columns=['label'])
X = X.select_dtypes(include=['number'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

### üîß Standardize Features (for Logistic Regression)

Standardization rescales numeric features to have **mean 0 and standard deviation 1**.
- Helps Logistic Regression converge faster.
- Makes coefficient values more comparable across features.

In [None]:
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

### üí° Evaluation Helper

In [None]:
def evaluate_model(y_test, y_pred, y_prob):
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_prob)
    }

### üìà Logistic Regression ‚Äî Overview & Train
Logistic Regression is a **supervised binary classification algorithm** that models the probability of the positive class (fraud/anomaly) using a logistic (sigmoid) function applied to a linear combination of features.

In [None]:
logreg = LogisticRegression(solver='lbfgs', max_iter=2000)
logreg.fit(X_train_scaled, y_train)
y_pred_lr = logreg.predict(X_test_scaled)
y_prob_lr = logreg.predict_proba(X_test_scaled)[:, 1]

### üìà Logistic Regression ‚Äî Evaluate

In [None]:
lr_metrics = evaluate_model(y_test, y_pred_lr, y_prob_lr)
pd.DataFrame([lr_metrics])

In [None]:
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(5, 4))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression ‚Äî Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### üå≤ Random Forest ‚Äî Overview & Train
Random Forest is an **ensemble of decision trees**:
- Each tree is trained on a random subset (bootstrap sample) of data.
- Splits consider a random subset of features.
- Final prediction is majority vote.
- Provides feature importance to understand which features contribute most.

In [None]:
rf = RandomForestClassifier(n_estimators=200, max_depth=12, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

### üå≤ Random Forest ‚Äî Evaluate

In [None]:
rf_metrics = evaluate_model(y_test, y_pred_rf, y_prob_rf)
pd.DataFrame([rf_metrics])

In [None]:
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(5, 4))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens')
plt.title('Random Forest ‚Äî Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### üìä ROC Curves ‚Äî Model Comparison
ROC curve visualizes the trade-off between **True Positive Rate (Recall)** and **False Positive Rate** for different thresholds.
- Area Under Curve (AUC) shows discrimination ability: 1 = perfect, 0.5 = random.

In [None]:
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
roc_auc_rf = auc(fpr_rf, tpr_rf)

plt.figure(figsize=(7, 6))
plt.plot(fpr_lr, tpr_lr, color='blue', label=f'LogReg (AUC = {roc_auc_lr:.3f})')
plt.plot(fpr_rf, tpr_rf, color='green', label=f'Random Forest (AUC = {roc_auc_rf:.3f})')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves ‚Äî Model Comparison')
plt.legend()
plt.show()

### üß© Random Forest Feature Importance
Feature importance shows which features contribute most to the Random Forest's decisions.
- Higher importance = more influence on predictions.

In [None]:
importances = rf.feature_importances_
feat_importance_df = pd.DataFrame({'feature': X_train.columns, 'importance': importances}).sort_values(by='importance',
                                                                                                       ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feat_importance_df)
plt.title('Random Forest Feature Importance')
plt.show()

### üÜö Model Comparison

In [None]:
comparison = pd.DataFrame([
    {'model': 'Logistic Regression', **lr_metrics},
    {'model': 'Random Forest', **rf_metrics}
])
comparison

### üíæ Save Models & Evaluation Results

In [None]:
os.makedirs('../models', exist_ok=True)
os.makedirs('../data/results', exist_ok=True)

# Save trained models
joblib.dump(logreg, '../models/logistic_regression.pkl')
joblib.dump(rf, '../models/random_forest.pkl')

# Save comparison results
comparison.to_csv('../data/results/model_results.csv', index=False)

print('üíø Saved models and results successfully.')