# Tree Models

### Imports and data loading

In [15]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression



# Load the dataset
df = pd.read_csv("../data/transactions.csv")



### Feature selection and target

In [16]:
# Same feature set used in the logistic regression baseline
# Keeping features consistent allows fair comparison between models
features = [
    "amount",          # Transaction amount
    "night",           # Night-time indicator
    "weekend",         # Weekend indicator
    "country_change",  # Cross-border transaction
    "velocity",        # Transaction activity intensity
    "device_risk",     # Device-related risk proxy
]

# Feature matrix and target vector
X = df[features]
y = df["is_fraud"]


### Train–test split

In [17]:
# Split the data into training and test sets
# Stratification preserves the fraud rate in both subsets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,     # 25% for evaluation
    stratify=y,         # Essential for imbalanced datasets
    random_state=42,
)


### Random Forest model

In [18]:
# Initialize Random Forest classifier
# This model captures non-linear relationships and feature interactions
rf = RandomForestClassifier(
    n_estimators=300,       # Number of trees
    max_depth=12,           # Maximum tree depth (controls overfitting)
    min_samples_leaf=50,    # Minimum samples per leaf (regularization)
    class_weight="balanced",# Compensate for class imbalance
    random_state=42,
    n_jobs=-1,              # Use all available CPU cores
)

# Train the model
rf.fit(X_train, y_train)


# Predict fraud probabilities on the test set
rf_prob = rf.predict_proba(X_test)[:, 1]

# Evaluate ranking performance
rf_roc = roc_auc_score(y_test, rf_prob)
rf_pr = average_precision_score(y_test, rf_prob)

print(f"Random Forest ROC-AUC: {rf_roc:.3f}")
print(f"Random Forest PR-AUC:  {rf_pr:.3f}")


Random Forest ROC-AUC: 0.731
Random Forest PR-AUC:  0.046


### Gradient Boosting model

In [19]:
# Initialize Gradient Boosting classifier
# This model builds trees sequentially, each correcting previous errors
gb = GradientBoostingClassifier(
    n_estimators=200,    # Number of boosting stages
    learning_rate=0.05,  # Contribution of each tree
    max_depth=3,         # Shallow trees (weak learners)
    random_state=42,
)

# Train the model
gb.fit(X_train, y_train)


# Predict fraud probabilities on the test set
gb_prob = gb.predict_proba(X_test)[:, 1]

# Evaluate ranking performance
gb_roc = roc_auc_score(y_test, gb_prob)
gb_pr = average_precision_score(y_test, gb_prob)

print(f"Gradient Boosting ROC-AUC: {gb_roc:.3f}")
print(f"Gradient Boosting PR-AUC:  {gb_pr:.3f}")


Gradient Boosting ROC-AUC: 0.745
Gradient Boosting PR-AUC:  0.050


### Recalculate Logistic Regression

In [20]:
# Scale features for logistic regression (important for stable optimization)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)        

# Train logistic regression baseline (with class balancing for rare fraud cases)
log_model = LogisticRegression(
    max_iter=1000,             # allow enough iterations to converge
    class_weight="balanced",   # counter class imbalance
    random_state=42,
)

# Fit the model on the scaled training data
log_model.fit(X_train_scaled, y_train)

# Predict probabilities for the positive class (fraud = 1)
y_prob = log_model.predict_proba(X_test_scaled)[:, 1]

# Quick sanity check metrics (optional but useful)
logistic_roc = roc_auc_score(y_test, y_prob)
logistic_pr = average_precision_score(y_test, y_prob)

print(f"Baseline Logistic ROC-AUC: {logistic_roc:.3f}")
print(f"Baseline Logistic PR-AUC:  {logistic_pr:.3f}")

Baseline Logistic ROC-AUC: 0.748
Baseline Logistic PR-AUC:  0.051


### Models Comparison

In [21]:
# Compare all models using the same evaluation metrics
results = pd.DataFrame(
    {
        "Model": [
            "Logistic Regression",
            "Random Forest",
            "Gradient Boosting",
        ],
        "ROC-AUC": [
            logistic_roc,
            rf_roc,
            gb_roc,
        ],
        "PR-AUC": [
            logistic_pr,
            rf_pr,
            gb_pr,
        ],
    }
)

results


Unnamed: 0,Model,ROC-AUC,PR-AUC
0,Logistic Regression,0.747905,0.051441
1,Random Forest,0.730679,0.04558
2,Gradient Boosting,0.744959,0.049701


## Tree-Based Models Summary 

After establishing an interpretable baseline with logistic regression, we evaluate more expressive tree-based models to improve fraud detection performance.
Fraud detection is a rare-event classification problem (highly imbalanced), so the main goal is to improve the model’s ability to *rank* truly fraudulent transactions above legitimate ones.

To ensure a fair comparison, all models are evaluated using the same train–test split and the same feature set.

---

### Why tree-based models?

Logistic regression is a strong baseline because it is fast and interpretable, but it is fundamentally a linear model.
Tree-based methods can capture:
- **Non-linear relationships** (risk does not always increase linearly with a feature)
- **Feature interactions** (e.g., high amount *and* cross-border *and* ecommerce may be much riskier than each feature alone)

These properties often lead to improved performance in fraud detection tasks.

---

### Random Forest (simple intuition)

A Random Forest is an ensemble of many decision trees trained independently.
Each tree learns a set of “if–then” rules (splits) from the data, and the forest combines all trees to produce a more stable prediction.

Key idea:
- Instead of trusting one single tree (which can overfit), we train **many trees** and **average their predictions**.
- Trees are diversified by training on different random samples of data and different subsets of features.

As a result, Random Forests are typically robust and perform well out of the box, especially when the relationship between features and fraud risk is not purely linear.

---

### Gradient Boosting (simple intuition)

Gradient Boosting is also an ensemble of decision trees, but unlike Random Forest, trees are built **sequentially**.
Each new tree is trained to **correct the errors** made by the previous ensemble.

Key idea:
- The model starts simple and improves step-by-step.
- Each additional tree focuses on examples that were previously misclassified or poorly ranked.

Because it learns iteratively and can capture subtle patterns, Gradient Boosting often provides stronger ranking performance than Random Forest in fraud detection settings (when properly configured).

---

### Evaluation metrics (fraud-appropriate)

Because fraud is rare, overall accuracy is misleading and is not used to judge model quality.
Instead, we evaluate models using:

- **ROC-AUC**: measures how well the model ranks fraud above non-fraud across many thresholds.
- **PR-AUC (Average Precision)**: emphasizes the precision–recall trade-off and is more informative under strong class imbalance.

PR-AUC is particularly important in fraud detection because it reflects the quality of the alert list produced by the model.

---

### Models Comparison

The obtained results are consistent with expectations for a highly imbalanced fraud detection task.
Despite its simplicity, logistic regression achieves competitive ROC-AUC and the highest PR-AUC among the evaluated models.
This suggests that the current feature set already captures most of the available signal in a largely additive manner.
Tree-based models provide comparable performance but do not significantly outperform the linear baseline at this stage.


---

### Conclusion and next step

Tree-based models typically improve fraud ranking performance compared to the linear baseline.
Among the evaluated methods, **Gradient Boosting often provides the strongest overall ranking quality**, making it a good candidate for downstream operational tuning.

**Next step:** optimize the decision threshold (precision–recall trade-off) to select an operational point that balances fraud capture (recall) and false alarms (precision), according to practical constraints.
