## Task 2: Model Building and Training
Dataset: Fraud_Data (E-commerce Transactions)

This notebook covers:
1. Data preparation (train-test split with stratification)
2. Baseline model (Logistic Regression)
3. Ensemble model (Random Forest)
4. Model evaluation and comparison

Note:
- No SMOTE is applied to Fraud_Data
- Class imbalance is handled via stratification and evaluation metrics


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    f1_score,
    average_precision_score
)
print("Data prep dependencies loaded")

Data prep dependencies loaded


## Step 1
Data preparation

In [9]:
# Load model-ready datasets
X_train = pd.read_csv("./data/processed/fraud_X_train.csv")
X_test  = pd.read_csv("./data/processed/fraud_X_test.csv")

y_train = pd.read_csv("./data/processed/fraud_y_train.csv")
y_test  = pd.read_csv("./data/processed/fraud_y_test.csv")

print("Train features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Train target shape:", y_train.shape)
print("Test target shape:", y_test.shape)


Train features shape: (120889, 193)
Test features shape: (30223, 193)
Train target shape: (120889, 1)
Test target shape: (30223, 1)


In [10]:
# Ensure alignment between X and y
assert X_train.shape[0] == y_train.shape[0], "Train X/y mismatch"
assert X_test.shape[0] == y_test.shape[0], "Test X/y mismatch"

# Ensure no target leakage
assert "class" not in X_train.columns, "Target leaked into features"

print("Sanity checks passed.")

Sanity checks passed.


In [11]:
X_train.columns.tolist()

['user_id',
 'purchase_value',
 'age',
 'time_since_signup',
 'hour_of_day',
 'day_of_week',
 'user_transaction_count',
 'time_since_last_tx',
 'source_Direct',
 'source_SEO',
 'browser_FireFox',
 'browser_IE',
 'browser_Opera',
 'browser_Safari',
 'sex_M',
 'country_Albania',
 'country_Algeria',
 'country_Angola',
 'country_Antigua and Barbuda',
 'country_Argentina',
 'country_Armenia',
 'country_Australia',
 'country_Austria',
 'country_Azerbaijan',
 'country_Bahamas',
 'country_Bahrain',
 'country_Bangladesh',
 'country_Barbados',
 'country_Belarus',
 'country_Belgium',
 'country_Belize',
 'country_Benin',
 'country_Bermuda',
 'country_Bhutan',
 'country_Bolivia',
 'country_Bosnia and Herzegowina',
 'country_Botswana',
 'country_Brazil',
 'country_British Indian Ocean Territory',
 'country_Brunei Darussalam',
 'country_Bulgaria',
 'country_Burkina Faso',
 'country_Burundi',
 'country_Cambodia',
 'country_Cameroon',
 'country_Canada',
 'country_Cape Verde',
 'country_Cayman Islands',

In [12]:
# Quantify class imbalance with numbers
y_train["class"].value_counts(normalize=True) * 100

class
0    90.635211
1     9.364789
Name: proportion, dtype: float64

## Class imbalance obsevation
Fraudulent transactions represent a very small percentage of the training data, confirming a severe class imbalance. This justifies the use of precision-recallâ€“based metrics (AUC-PR, F1-score) rather than accuracy for model evaluation.

In [13]:
# Some models expect 1D arrays
y_train = y_train["class"].values
y_test = y_test["class"].values

In [None]:
# Target variable
target = "class"

# Feature columns (exclude target)
features = [col for col in fraud_df.columns if col != target]

print("Number of features:", len(features))
print("Target column:", target)


## Step 2 Building a baseline model using regression model