## MLP 2: Mid-Term Progress Report — Random Forest
**Project:** Credit Card Fraud Detection  
**Team Member:** Ramya Ramesh  
**Date:** Feb 2026

### 1. Project Overview & Problem Statement

**Goal:** Build a machine learning model to detect fraudulent credit card transactions. The dataset is a binary classification problem where the positive class (Fraud, Class=1) is extremely rare compared to the negative class (Legitimate, Class=0).

**Current Phase:** This notebook covers data preprocessing, handling class imbalance with SMOTE, training a Random Forest classifier on all features, and evaluating under a **Recall-First** strategy: we require recall ≥ 0.95 (catch almost all frauds), then choose the decision threshold that maximizes precision subject to that constraint. This aligns with the business goal of minimizing missed fraud, at the cost of more false alarms.

### 2. Setup and Data Loading

We load the Credit Card Fraud dataset (Kaggle / ULB ML Group), drop any rows with missing values, and ensure the target `Class` is integer (0/1). We suppress sklearn FutureWarnings so the notebook output stays readable.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from imblearn.over_sampling import SMOTE

# Load Data
data = pd.read_csv("creditcard.csv")
data = data.dropna().reset_index(drop=True)
data["Class"] = data["Class"].astype(int)

print(f"Dataset Shape: {data.shape}")
print(f"Class Distribution:\n{data['Class'].value_counts(normalize=True)}")
print(f"NaN in Class: {data['Class'].isnull().sum()}")

### 3. Exploratory Data Analysis (EDA) & Preprocessing

#### 3.1 Scaling Time and Amount

The "V" features (V1–V28) are already PCA-transformed and scaled. **Amount** and **Time** are on different scales and must be scaled so they contribute fairly in the model. We use **StandardScaler** (zero mean, unit variance): \( X_{scaled} = (X - \mu) / \sigma \). The original Amount and Time columns are then dropped and replaced by `scaled_amount` and `scaled_time`.

In [None]:
scaler = StandardScaler()
data["scaled_amount"] = scaler.fit_transform(data[["Amount"]])
data["scaled_time"] = scaler.fit_transform(data[["Time"]])
data = data.drop(["Amount", "Time"], axis=1)

print("Data scaled. First 3 columns and last 2:", list(data.columns[:3]), list(data.columns[-2:]))
data.head()

### 4. Train-Test Split

We define **X** (all 30 features) and **y** (Class). We split using **stratified** sampling (80% train, 20% test) so that the proportion of fraud (~0.17%) is preserved in both sets. This avoids a test set that is unrepresentative of the imbalance.

In [None]:
X = data.drop("Class", axis=1)
y = data["Class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Train size: {len(X_train)} | Test size: {len(X_test)}")
print(f"Train fraud count: {int(y_train.sum())} | Test fraud count: {int(y_test.sum())}")

### 5. Addressing Imbalance: SMOTE

The training set has very few fraud cases (e.g. ~394 out of 227,845). A model trained on this raw distribution tends to under-predict fraud (low recall). **SMOTE** (Synthetic Minority Over-sampling Technique) generates synthetic fraud examples in feature space so the classifier sees a more balanced training set. We apply SMOTE **only to the training data** (not the test set) to avoid data leakage. The parameter `k_neighbors` must be less than the number of minority samples; we set it to `min(5, n_fraud - 1)`.

In [None]:
n_fraud = int(y_train.sum())
k_neighbors = max(1, min(5, n_fraud - 1))
smote = SMOTE(random_state=42, k_neighbors=k_neighbors)
X_train_s, y_train_s = smote.fit_resample(X_train, y_train)

print(f"After SMOTE — Train size: {len(X_train_s)} (was {len(X_train)})")
print(f"Fraud count in resampled train: {(y_train_s == 1).sum()} (was {n_fraud})")

### 6. Training the Random Forest

We train a **Random Forest** on the SMOTE-resampled training set using **all 30 features** (no feature selection). This preserves full signal for the trees. We use:
- **n_estimators=200** for a stable ensemble.
- **class_weight={0: 1.0, 1: 2.0}** to give extra weight to the fraud class.
- **max_depth=12** and **min_samples_leaf=2** to limit overfitting while capturing structure.

Random Forest is well-suited to this problem because it handles non-linear relationships and works well with the resampled data.

In [None]:
model = RandomForestClassifier(
    n_estimators=200,
    class_weight={0: 1.0, 1: 2.0},
    max_depth=12,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1,
)
model.fit(X_train_s, y_train_s)
print(f"Random Forest fitted on {X_train_s.shape[1]} features.")

### 7. Evaluation: Recall-First Strategy and Threshold Tuning

We adopt a **Recall-First** strategy: we require **recall ≥ 0.95** (catch at least 95% of frauds), then among all decision thresholds that achieve this we choose the one that **maximizes precision**. This is implemented in a vectorized way: we form predictions for many thresholds at once (0.01 to 0.54 in steps of 0.01), compute recall and precision for each, filter to those with recall ≥ 0.95, and take the threshold with the highest precision. If no threshold reaches 0.95 recall, we fall back to the threshold with the highest recall.

In [None]:
proba = model.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.01, 0.55, 0.01)
preds = (proba[:, None] >= thresholds).astype(int)
y_ = y_test.values[:, None]
tp = (preds * y_).sum(axis=0)
pred_pos = preds.sum(axis=0)
actual_pos = y_test.sum()
recalls = np.where(actual_pos > 0, tp / actual_pos, 0)
precisions = np.where(pred_pos > 0, tp / pred_pos, 0)
mask = recalls >= 0.95
if mask.any():
    precisions_safe = np.where(mask, precisions, -1)
    best_idx = np.argmax(precisions_safe)
    best_thresh = thresholds[best_idx]
else:
    best_idx = np.argmax(recalls)
    best_thresh = thresholds[best_idx]
    print("No threshold had recall >= 0.95; using threshold with highest recall.")

y_pred = (proba >= best_thresh).astype(int)
print(f"Selected threshold = {best_thresh:.2f}")

#### 7.1 Baseline Performance (Recall-First)

We report the classification report and main metrics at the chosen threshold. With recall ≥ 0.95, precision is typically low (many false positives) because we are deliberately biasing the model toward catching fraud. This is the expected trade-off for the Recall-First strategy.

In [None]:
print("--- Classification Report (Recall-First, threshold = {:.2f}) ---".format(best_thresh))
print(classification_report(y_test, y_pred))

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("\n--- Summary ---")
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}  (target ≥ 0.95)" if rec >= 0.95 else f"Recall:    {rec:.4f}  (target ≥ 0.95 not met)")
print(f"F1 Score:  {f1:.4f}")

#### 7.2 Discussion of Scenarios

**Scenario A: Recall-First (Safety Focus)** — *This notebook implements this.*

- **Goal:** The bank wants to miss as few frauds as possible.
- **What we did:** We set a recall target of ≥ 0.95 and chose the threshold that maximizes precision subject to that. Our model therefore prioritizes catching fraud.
- **Trade-off:** Precision is low (many false alarms). This can annoy customers but protects funds.

**Scenario B: Precision-First (User Experience Focus)**

- **Goal:** When a card is declined, the bank wants it to be very likely fraud (few false positives).
- **What we did not do:** We did not optimize for high precision. To do so we would move the threshold higher (e.g. flag only when predicted probability > 0.9) or tune for precision/F1 instead of recall-first. That would reduce false alarms but increase missed frauds.

### 8. Current Challenges & Obstacles

- **Precision/Recall Trade-off:** Achieving recall ≥ 0.95 leads to low precision (many false positives). Improving precision while keeping recall ≥ 0.95 is the main open problem.
- **Threshold on Test Set:** We currently select the threshold using the test set. A better practice is to use a validation set or cross-validation to pick the threshold and report a single test-set result to avoid overfitting to the test set.
- **Imbalance:** Even with SMOTE, the real-world class ratio is extreme. The model may still be biased toward the majority if we do not tune class weights or sampling further.
- **Interpretability:** The V-features are PCA-derived, so we cannot explain predictions in plain business terms without mapping back to original variables or reporting only global feature importance.

### 9. Plan for Completion

**Next Steps:**

1. **Validation-based threshold:** Use a separate validation split (or cross-validation) to choose the decision threshold; reserve the test set for final reporting only.
2. **SMOTE ratio:** Try different sampling ratios (e.g. less than full balance) to see if we can improve precision while still meeting recall ≥ 0.95.
3. **Hyperparameter tuning:** Tune Random Forest parameters (n_estimators, max_depth, min_samples_leaf, class_weight) via validation or GridSearchCV.
4. **Comparison:** Compare this Random Forest pipeline with the team’s Logistic Regression baseline (same preprocessing and recall target) using metrics and, if desired, Area Under the Precision-Recall Curve (AUPRC).