# Fraud Detection â€“ Model Building and Training

## Objective
Build, train, and evaluate classification models to detect fraudulent
transactions using techniques appropriate for highly imbalanced data.
Models are compared using AUC-PR, F1-Score, and confusion matrices.

### Load Feature-Engineered Data

In [None]:
import pandas as pd

df = pd.read_csv("../data/processed/fraud_features.csv")
df.head()
df.shape
df.info()

### Target Identification 

In [None]:
target_col = "class"  # Fraud_Data.csv target column

The target variable `class` indicates whether a transaction is fraudulent (1)
or legitimate (0).

### Feature / Target Separation

In [None]:
X = df.drop(columns=[target_col])
y = df[target_col]

print(X.shape, y.shape)

### Class Distribution (Before Handling Imbalance)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x=y)
plt.title("Class Distribution Before Imbalance Handling")
plt.show()

The dataset is highly imbalanced, with fraudulent transactions representing
a very small proportion of all samples. This motivates the use of imbalance-aware
metrics and resampling techniques.

### Stratified Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


A stratified split preserves the fraud ratio across training and test sets.

## PREPROCESSING PIPELINE
### Identify Feature Types

In [None]:
num_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_features = X.select_dtypes(include=["object"]).columns.tolist()

num_features, cat_features
# Exclude datetime columns from categorical features
#cat_features = [col for col in cat_features if "time" not in col]

### Build Preprocessing Pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
    ]
)
# X_train_processed = preprocessor.fit_transform(X_train)
# X_test_processed = preprocessor.transform(X_test)

### Apply Preprocessing (NO SMOTE YET)

In [None]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
print(X_train_processed.shape)
print(X_test_processed.shape)

### HANDLE CLASS IMBALANCE (SMOTE)

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_resampled, y_train_resampled = smote.fit_resample(
    X_train_processed,
    y_train
)
print("Resampled training set shape:", X_train_resampled.shape, y_train_resampled.shape)

SMOTE is applied **only to the training data** to prevent information leakage.

## BASELINE MODEL (LOGISTIC REGRESSION)
### Train Baseline

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    random_state=42
)

lr.fit(X_train_resampled, y_train_resampled)
# y_pred = lr.predict(X_test_processed)

### Evaluate Baseline

In [None]:
from sklearn.metrics import f1_score, confusion_matrix, average_precision_score

y_pred_lr = lr.predict(X_test_processed)
y_prob_lr = lr.predict_proba(X_test_processed)[:, 1]

f1_lr = f1_score(y_test, y_pred_lr)
auc_pr_lr = average_precision_score(y_test, y_prob_lr)
cm_lr = confusion_matrix(y_test, y_pred_lr)

f1_lr, auc_pr_lr, cm_lr


AUC-PR is used instead of ROC-AUC because it is more informative for highly
imbalanced datasets such as fraud detection.

## ENSEMBLE MODEL (RANDOM FOREST)
### Train Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_resampled, y_train_resampled)
# y_pred_rf = rf.predict(X_test_processed)
# y_prob_rf = rf.predict_proba(X_test_processed)[:, 1]

### Evaluate Ensemble

In [None]:
y_pred_rf = rf.predict(X_test_processed)
y_prob_rf = rf.predict_proba(X_test_processed)[:, 1]

f1_rf = f1_score(y_test, y_pred_rf)
auc_pr_rf = average_precision_score(y_test, y_prob_rf)
cm_rf = confusion_matrix(y_test, y_pred_rf)

f1_rf, auc_pr_rf, cm_rf


## HYPERPARAMETER TUNING

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20]
}

grid = GridSearchCV(
    RandomForestClassifier(class_weight="balanced", random_state=42),
    param_grid,
    scoring="f1",
    cv=3,
    n_jobs=-1
)

grid.fit(X_train_resampled, y_train_resampled)
grid.best_params_


## STRATIFIED K-FOLD CV (k=5)

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_validate
import numpy as np

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_results = cross_validate(
    rf,
    X_train_processed,
    y_train,
    cv=cv,
    scoring={"f1": "f1", "auc_pr": "average_precision"}
)

pd.DataFrame({
    "Metric": ["F1", "AUC-PR"],
    "Mean": [
        np.mean(cv_results["test_f1"]),
        np.mean(cv_results["test_auc_pr"])
    ],
    "Std": [
        np.std(cv_results["test_f1"]),
        np.std(cv_results["test_auc_pr"])
    ]
})


## MODEL COMPARISON TABLE

In [None]:
pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest"],
    "F1-Score": [f1_lr, f1_rf],
    "AUC-PR": [auc_pr_lr, auc_pr_rf]
})


## Final Model Selection

The Random Forest model outperformed Logistic Regression in both F1-Score
and AUC-PR, indicating a stronger ability to detect fraudulent transactions
while minimizing false negatives.

Although Logistic Regression offers higher interpretability, its linear
decision boundary limits performance on complex fraud patterns.
Random Forest captures non-linear interactions between behavioral, temporal,
and geolocation features.

Given the business importance of fraud detection accuracy and the observed
performance gains, Random Forest was selected as the final model.

### SAVE FINAL MODEL

In [None]:
import joblib

joblib.dump(rf, "../models/final_fraud_model.pkl")
