## Task 2: Model Building and Training
Dataset: Fraud_Data (E-commerce Transactions)

This notebook covers:
1. Data preparation (train-test split with stratification)
2. Baseline model (Logistic Regression)
3. Ensemble model (Random Forest)
4. Model evaluation and comparison

Note:
- No SMOTE is applied to Fraud_Data
- Class imbalance is handled via stratification and evaluation metrics


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    f1_score,
    average_precision_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import joblib
print("Data prep dependencies loaded")

Data prep dependencies loaded


## Step 1
Data preparation

In [2]:
# Load model-ready datasets
X_train = pd.read_csv("./data/processed/fraud_X_train.csv")
X_test  = pd.read_csv("./data/processed/fraud_X_test.csv")

y_train = pd.read_csv("./data/processed/fraud_y_train.csv")
y_test  = pd.read_csv("./data/processed/fraud_y_test.csv")

print("Train features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Train target shape:", y_train.shape)
print("Test target shape:", y_test.shape)


Train features shape: (120889, 193)
Test features shape: (30223, 193)
Train target shape: (120889, 1)
Test target shape: (30223, 1)


In [3]:
# Ensure alignment between X and y
assert X_train.shape[0] == y_train.shape[0], "Train X/y mismatch"
assert X_test.shape[0] == y_test.shape[0], "Test X/y mismatch"

# Ensure no target leakage
assert "class" not in X_train.columns, "Target leaked into features"

print("Sanity checks passed.")

Sanity checks passed.


In [4]:
X_train.columns.tolist()

['user_id',
 'purchase_value',
 'age',
 'time_since_signup',
 'hour_of_day',
 'day_of_week',
 'user_transaction_count',
 'time_since_last_tx',
 'source_Direct',
 'source_SEO',
 'browser_FireFox',
 'browser_IE',
 'browser_Opera',
 'browser_Safari',
 'sex_M',
 'country_Albania',
 'country_Algeria',
 'country_Angola',
 'country_Antigua and Barbuda',
 'country_Argentina',
 'country_Armenia',
 'country_Australia',
 'country_Austria',
 'country_Azerbaijan',
 'country_Bahamas',
 'country_Bahrain',
 'country_Bangladesh',
 'country_Barbados',
 'country_Belarus',
 'country_Belgium',
 'country_Belize',
 'country_Benin',
 'country_Bermuda',
 'country_Bhutan',
 'country_Bolivia',
 'country_Bosnia and Herzegowina',
 'country_Botswana',
 'country_Brazil',
 'country_British Indian Ocean Territory',
 'country_Brunei Darussalam',
 'country_Bulgaria',
 'country_Burkina Faso',
 'country_Burundi',
 'country_Cambodia',
 'country_Cameroon',
 'country_Canada',
 'country_Cape Verde',
 'country_Cayman Islands',

In [5]:
# Quantify class imbalance with numbers
y_train["class"].value_counts(normalize=True) * 100

class
0    90.635211
1     9.364789
Name: proportion, dtype: float64

## Class imbalance obsevation
Fraudulent transactions represent a very small percentage of the training data, confirming a severe class imbalance. This justifies the use of precision-recall‚Äìbased metrics (AUC-PR, F1-score) rather than accuracy for model evaluation.

In [6]:
# Some models expect 1D arrays
y_train = y_train["class"].values
y_test = y_test["class"].values

## Step 2 Building a baseline model using a logistic regression model

Targets are separated

In [7]:
# Features
X_train = pd.read_csv("./data/processed/fraud_X_train.csv")
X_test  = pd.read_csv("./data/processed/fraud_X_test.csv")

# Targets
y_train = pd.read_csv("./data/processed/fraud_y_train.csv")["class"]
y_test  = pd.read_csv("./data/processed/fraud_y_test.csv")["class"]

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)


Train shape: (120889, 193) (120889,)
Test shape: (30223, 193) (30223,)


In [8]:
# Initialize logistic regression
log_reg = LogisticRegression(
    max_iter=1000, #avoids convergence warnings
    class_weight="balanced",# Compensates for fraud imbalance
    random_state=42
)

In [None]:
# Train the model
# As explained on the above this model only trains upto 1000 iterations
log_reg.fit(X_train, y_train)

In [None]:
# Predict probabilities and classes
# Probability of fraud (class = 1)
y_test_proba = log_reg.predict_proba(X_test)[:, 1]

# Default threshold = 0.5
y_test_pred = log_reg.predict(X_test) 

‚ö†Ô∏è Important:
- AUC-PR uses probabilities

- F1 & Confusion Matrix use predictions

In [None]:
# AUC-PR metric
auc_pr = average_precision_score(y_test, y_test_proba)
print(f"AUC-PR: {auc_pr:.4f}")

In [None]:
# F1 Score
f1 = f1_score(y_test, y_test_pred)
print(f"F1 Score: {f1:.4f}")

In [None]:
cm = confusion_matrix(y_test, y_test_pred)
cm


In [None]:
print(classification_report(y_test, y_test_pred, digits=4))


## Step 3 

Ensemble Model

üéØ Goal of Ensemble Model
- Capture non-linear relationships
- Improve fraud recall
- Increase AUC-PR

Compare directly with Logistic Regression
I will:
- Use Random Forest
- Do light hyperparameter tuning
- Use same evaluation metrics
- Avoid leakage
- No SMOTE (still)

In [9]:
#Initialze random forest
rf_model = RandomForestClassifier(
    n_estimators=200, # Stable performance
    max_depth=12, # avoids overfitting
    min_samples_split=10, # smoother trees
    class_weight="balanced", # Fraud aware
    random_state=42,
    n_jobs=-1
)
print("Initialized random forest")

Initialized random forest


In [10]:
#Train ensemble model
rf_model.fit(X_train, y_train)
print("Trained model")

Trained model


In [11]:
#predictions
y_test_proba_rf = rf_model.predict_proba(X_test)[:, 1]
y_test_pred_rf = rf_model.predict(X_test)

Now I expect the metrics of the Random Forest to beat the metrics of baseline model which is logistic regression

In [12]:
#AUC-PR comparison
auc_pr_rf = average_precision_score(y_test, y_test_proba_rf)
print(f"Random Forest AUC-PR: {auc_pr_rf:.4f}")

Random Forest AUC-PR: 0.6354


In [13]:
# F1 Score
f1_rf = f1_score(y_test, y_test_pred_rf)
print(f"Random Forest F1 Score: {f1_rf:.4f}")

Random Forest F1 Score: 0.7014


In [14]:
#Confusion matrix
confusion_matrix(y_test, y_test_pred_rf)

array([[27392,     1],
       [ 1301,  1529]])

## Model interpretation
Why does the Random Forest perform better than logistic regression?

1Ô∏è‚É£ Non-linear interactions

Fraud is rarely linear:
- time_since_signup √ó purchase_value
- country √ó browser
- transaction velocity

Logistic Regression cannot capture these interactions.

2Ô∏è‚É£ Tree-based models love engineered features

My feature engineering paid off:
- Time-based features
- Transaction counts
- One-hot encoded categories
- Country (geolocation proxy)

üìå Random Forest thrives on this.

## Step 4 Cross validation (Stratified K-fold)
why I included this step is beacuse an F1 of 0.7 in fraud detection is high which doesn't mean overfiiting, leakage, and too good to be true. But it requires validation.

In [15]:
# Defining stratified K-fold
skf = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)
print("Preserve fraud ratio in each fold")

Preserve fraud ratio in each fold


I expect:
- Mean ‚âà test performance
- Low std (‚â§ 0.03‚Äì0.05)

In [None]:
# Cross validated AUC-PR
cv_auc_pr = cross_val_score(
    rf_model,
    X_train,
    y_train,
    cv=skf,
    scoring="average_precision",
    n_jobs=-1
)

print("CV AUC-PR scores:", cv_auc_pr)
print("Mean AUC-PR:", cv_auc_pr.mean())
print("Std AUC-PR:", cv_auc_pr.std())
'''
outputs
CV AUC-PR scores: [0.64006266 0.61775472 0.63995836 0.62879229 0.63124794]
Mean AUC-PR: 0.6315631922788948 
Std AUC-PR: 0.008260400014792486
'''

‚ÄúDue to computational constraints in the development environment, stratified cross-validation was performed using AUC-PR as the primary evaluation metric. F1-score was evaluated on the held-out test set, which is sufficient for assessing precision‚Äìrecall balance in imbalanced fraud detection tasks.‚Äù


In [17]:
# saving the prefered model to models/
# Random Forest is selected as the final model for further explainability and business analysis.
joblib.dump(rf_model, "./models/random_forest_fraud.pkl")

['./models/random_forest_fraud.pkl']

In [19]:
loaded_model = joblib.load("./models/random_forest_fraud.pkl")

print("Model loaded successfully:", type(loaded_model))

Model loaded successfully: <class 'sklearn.ensemble._forest.RandomForestClassifier'>


## Model Persistence

The final Random Forest model, selected based on superior AUC-PR and F1-score performance, is saved using joblib. Persisting the trained model ensures reproducibility and enables downstream explainability analysis (Task 3) without retraining.