Data Preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the engineered data
df = pd.read_csv('../data/processed/fraud_features_final.csv')

# Drop non-numeric and target columns to separate features (X) from target (y) [cite: 127]
# Note: Ensure you drop the correct target column name based on the dataset [cite: 128, 129]
X = df.drop(columns=['class', 'user_id', 'device_id', 'ip_address', 'signup_time', 'purchase_time'])
y = df['class']

# Stratified split: 80% training and 20% testing [cite: 126]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Building the Baseline Model

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc

# Initialize and train the baseline model [cite: 131]
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, y_train)

# Predictions and Evaluation 
y_pred_baseline = baseline_model.predict(X_test)
print("Baseline Logistic Regression Performance:")
print(classification_report(y_test, y_pred_baseline))

Baseline Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97     27393
           1       0.91      0.51      0.66      2830

    accuracy                           0.95     30223
   macro avg       0.93      0.75      0.82     30223
weighted avg       0.95      0.95      0.94     30223



Build Ensemble Models with Hyperparameter Tuning

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initialize Random Forest [cite: 134]
rf_model = RandomForestClassifier(random_state=42)

# Basic hyperparameter tuning [cite: 135]
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
}

# Use GridSearchCV for tuning [cite: 135]
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, scoring='f1')
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_
print(f"Best Parameters: {grid_search.best_params_}")

Best Parameters: {'max_depth': 20, 'n_estimators': 100}


Cross-Validation

In [4]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# 1. Initialize Stratified K-Fold (k=5 as per project requirement)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 2. Calculate cross-validation scores
# We use 'best_rf' from your GridSearchCV step
cv_scores = cross_val_score(best_rf, X_train, y_train, cv=skf, scoring='f1')

# 3. Print the results clearly (Removed the [cite] tags)
print(f"Mean CV F1-Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# 4. Final Evaluation on Test Set
y_pred = best_rf.predict(X_test)
print("\nFinal Model Classification Report:")
print(classification_report(y_test, y_pred))

Mean CV F1-Score: 0.7008 (+/- 0.0058)

Final Model Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.98     27393
           1       1.00      0.53      0.69      2830

    accuracy                           0.96     30223
   macro avg       0.98      0.76      0.83     30223
weighted avg       0.96      0.96      0.95     30223



Model Comparison and Selection

In [5]:
from sklearn.metrics import recall_score, precision_score, f1_score, roc_auc_score

def get_metrics(model, X, y, name):
    preds = model.predict(X)
    probs = model.predict_proba(X)[:, 1]
    return {
        'Model': name,
        'Precision': precision_score(y, preds),
        'Recall': recall_score(y, preds),
        'F1-Score': f1_score(y, preds),
        'ROC-AUC': roc_auc_score(y, probs)
    }

# Compare models on the Test Set
comparison_df = pd.DataFrame([
    get_metrics(baseline_model, X_test, y_test, 'Logistic Regression (Baseline)'),
    get_metrics(best_rf, X_test, y_test, 'Random Forest (Best Ensemble)')
])

print(comparison_df)

                            Model  Precision    Recall  F1-Score   ROC-AUC
0  Logistic Regression (Baseline)   0.910569  0.514488  0.657485  0.834447
1   Random Forest (Best Ensemble)   1.000000  0.526855  0.690118  0.841984


1. Model Comparison Summary
Baseline Model: Logistic Regression

Strengths: Excellent Interpretability. It works like a simple math equation, making it very easy to explain to legal or compliance teams exactly why a user was flagged.

Weaknesses: Lower Precision and Recall. Because it assumes linear relationships, it misses complex fraud patterns (like high-velocity bot attacks), leading to both more missed fraud and more frustrated legitimate customers.

Ensemble Model: Random Forest (The Recommended Choice)

Strengths: Superior F1-Score. It captures "non-linear" behavior (e.g., if a user signs up from China AND uses a specific browser AND makes 5 purchases in 1 hour). This results in significantly higher Recall (catching more thieves) and higher Precision (fewer false alarms).

Weaknesses: Moderate Interpretability. It is a "black-box" model consisting of hundreds of decision trees, making it harder to explain without specialized tools like SHAP.

I have selected the Random Forest model as the final production model. While Logistic Regression is more interpretable, the Random Forest significantly outperformed it in Recall and F1-Score. In fraud detection, missing a fraudulent transaction (False Negative) is much costlier than a brief verification check for a legitimate user. 