In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler

#### 4.1. Load Dataset for Model Training

We load the dataset that was previously cleaned and inspected for outliers.
This dataset is ready for splitting into features and target for supervised learning.

In [2]:
df = pd.read_parquet("./data/3/df.parquet")

#### 4.2. Define Features and Target

- `features` contains all columns except the target variable 'isFraud'.
- `target` is the 'isFraud' column, representing whether a transaction is fraudulent.

This separation is necessary for supervised machine learning models.

In [3]:
features = df.drop(columns="isFraud")
target = df["isFraud"]

#### 4.3. Split Data into Train, Validation, and Test Sets

We split the dataset into:
- Training set: used to train the model
- Validation set: used to tune hyperparameters and monitor performance
- Test set: used for final evaluation

Splitting details:
- 3% of data is initially separated into temporary set (`X_temp`) to create validation and test sets.
- Stratified splitting ensures that the proportion of fraud and non-fraud cases is preserved in all sets.
- Random state is fixed for reproducibility.

In [4]:
X_train, X_temp, y_train, y_temp = train_test_split(features, target, test_size=0.03, stratify=target, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

#### 4.4. Handle Class Imbalance with Random Under-Sampling

Fraud detection datasets are highly imbalanced (fraud cases are rare).
We apply RandomUnderSampler to balance the training set:
- Reduces the number of non-fraud cases to match the number of fraud cases
- Ensures the model does not become biased towards the majority class
- Only applied to the training set; validation and test sets remain unchanged

Note:
- We choose under-sampling instead of techniques like SMOTE because the dataset is very large (21 million rows),
  so reducing the majority class still leaves enough data for training.
- After under-sampling, the training set contains 53,292 rows, which is sufficient for model learning.

In [5]:
undersampler = RandomUnderSampler(sampling_strategy="auto", random_state=42)
X_train, y_train = undersampler.fit_resample(X_train, y_train)

In [6]:
df_train = X_train.copy()
df_train["isFraud"] = y_train

df_val = X_val.copy()
df_val["isFraud"] = y_val

df_test = X_test.copy()
df_test["isFraud"] = y_test

#### 4.5. Verify Distribution in Train, Validation, and Test Sets

We print the number of fraud and non-fraud cases in each dataset:
- Confirms class distribution after under-sampling
- Ensures stratification worked correctly
- Helps assess if the dataset is balanced and ready for modeling

In [7]:
for name, df_set in [("Train", df_train), ("Validation", df_val), ("Test", df_test)]:
    counts = df_set["isFraud"].value_counts()
    n_non_fraud = counts.get(0, 0)
    n_fraud = counts.get(1, 0)
    total_rows = len(df_set)
    print(
        f"{name}:\n"
        f"- {total_rows:,} rows\n"
        f"- {n_fraud:,} fraud cases and {n_non_fraud:,} non-fraud cases\n"
        f"- {n_fraud/total_rows:.2%} fraud, {n_non_fraud/total_rows:.2%} non-fraud\n"
    )

Train:
- 53,292 rows
- 26,646 fraud cases and 26,646 non-fraud cases
- 50.00% fraud, 50.00% non-fraud

Validation:
- 315,000 rows
- 412 fraud cases and 314,588 non-fraud cases
- 0.13% fraud, 99.87% non-fraud

Test:
- 315,000 rows
- 412 fraud cases and 314,588 non-fraud cases
- 0.13% fraud, 99.87% non-fraud



#### 4.6. Save Prepared Datasets

We save the train, validation, and test sets as Parquet files:
- Makes it easy to reload for model training or evaluation
- Preserves data types and structure
- Ensures reproducibility of experiments

In [8]:
os.makedirs("./data/4/", exist_ok=True)

df_train.to_parquet("./data/4/df_train.parquet", index=False)
df_val.to_parquet("./data/4/df_val.parquet", index=False)
df_test.to_parquet("./data/4/df_test.parquet", index=False)