## 2.1 Data Preparation

Before model training, I need first isolate our features and target labels from both datasets.  
I  also split the data into training and test sets using stratification to maintain class balance.


#### 🔄 Step 1: Load Processed Data

In [1]:
import pandas as pd

# Load FRAUD DATA
X_train_fraud = pd.read_csv('../data/processed/X_train_final.csv')
y_train_fraud = pd.read_csv('../data/processed/y_train_final.csv').squeeze()
X_test_fraud = pd.read_csv('../data/processed/X_test_final.csv')
y_test_fraud = pd.read_csv('../data/processed/y_test_final.csv').squeeze()

# Load CREDIT CARD DATA
X_train_credit = pd.read_csv('../data/processed/X_train_final_credit.csv')
y_train_credit = pd.read_csv('../data/processed/y_train_final_credit.csv').squeeze()
X_test_credit = pd.read_csv('../data/processed/X_test_final_credit.csv')
y_test_credit = pd.read_csv('../data/processed/y_test_final_credit.csv').squeeze()


#### 🤖 Step 2: Train and Evaluate Models

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, auc, f1_score, confusion_matrix

def evaluate_models(X_train, y_train, X_test, y_test, dataset_name=""):
    print(f"\n📊 Results for {dataset_name} Dataset\n" + "-"*40)
    
    # Define models
    lr = LogisticRegression(max_iter=1000, random_state=42)
    rf = RandomForestClassifier(n_estimators=100, random_state=42)

    # Train
    lr.fit(X_train, y_train)
    rf.fit(X_train, y_train)

    # Predict probabilities
    lr_probs = lr.predict_proba(X_test)[:, 1]
    rf_probs = rf.predict_proba(X_test)[:, 1]

    # Predict classes
    lr_preds = lr.predict(X_test)
    rf_preds = rf.predict(X_test)

    # AUC-PR
    def auc_pr(y_true, y_prob):
        precision, recall, _ = precision_recall_curve(y_true, y_prob)
        return auc(recall, precision)

    lr_aucpr = auc_pr(y_test, lr_probs)
    rf_aucpr = auc_pr(y_test, rf_probs)

    # F1-scores
    lr_f1 = f1_score(y_test, lr_preds)
    rf_f1 = f1_score(y_test, rf_preds)

    # Confusion Matrices
    lr_cm = confusion_matrix(y_test, lr_preds)
    rf_cm = confusion_matrix(y_test, rf_preds)

    # Print results
    print("🔹 Logistic Regression:")
    print(f" - AUC-PR: {lr_aucpr:.4f}")
    print(f" - F1 Score: {lr_f1:.4f}")
    print(f" - Confusion Matrix:\n{lr_cm}\n")

    print("🔸 Random Forest:")
    print(f" - AUC-PR: {rf_aucpr:.4f}")
    print(f" - F1 Score: {rf_f1:.4f}")
    print(f" - Confusion Matrix:\n{rf_cm}")

    #✅ Return trained models
    return lr, rf


#### 🏁 Step 3: Run Evaluation

In [3]:

# Evaluate and capture models
lr_fraud, rf_fraud = evaluate_models(X_train_fraud, y_train_fraud, X_test_fraud, y_test_fraud, "Fraud_Data")
lr_credit, rf_credit = evaluate_models(X_train_credit, y_train_credit, X_test_credit, y_test_credit, "Credit_Card")



📊 Results for Fraud_Data Dataset
----------------------------------------
🔹 Logistic Regression:
 - AUC-PR: 0.2554
 - F1 Score: 0.2753
 - Confusion Matrix:
[[19122  8271]
 [ 1058  1772]]

🔸 Random Forest:
 - AUC-PR: 0.6161
 - F1 Score: 0.5785
 - Confusion Matrix:
[[26481   912]
 [ 1307  1523]]

📊 Results for Credit_Card Dataset
----------------------------------------
🔹 Logistic Regression:
 - AUC-PR: 0.7150
 - F1 Score: 0.1002
 - Confusion Matrix:
[[55172  1479]
 [   12    83]]

🔸 Random Forest:
 - AUC-PR: 0.8142
 - F1 Score: 0.8276
 - Confusion Matrix:
[[56644     7]
 [   23    72]]


In [4]:
# Now save the Models 

import joblib
import os

os.makedirs("models", exist_ok=True)

# Save the trained Random Forest models
joblib.dump(rf_fraud, "models/rf_fraud_model.pkl")
joblib.dump(rf_credit, "models/rf_credit_model.pkl")


['models/rf_credit_model.pkl']

#### ✅ Step 4: Interpret and Justify Best Model

After evaluating both Logistic Regression and Random Forest on the two datasets, we summarize the results and justify the preferred model based on key metrics: AUC-PR, F1 Score, and the Confusion Matrix — all crucial for imbalanced classification tasks such as fraud detection.
📊 Fraud_Data Dataset
Model	AUC-PR	F1 Score	True Positives (TP)	False Positives (FP)	False Negatives (FN)	True Negatives (TN)
Logistic Regression	0.2554	0.2753	1772	8271	1058	19122
Random Forest	0.6161	0.5785	1523	912	1307	26481

📝 Interpretation:

    Random Forest achieved a much higher AUC-PR and F1 Score, indicating better balance between precision and recall.

    It drastically reduced false positives and increased correct non-fraud classification (true negatives).

    Slightly higher false negatives are outweighed by major overall gains in performance.

✅ Best Model: Random Forest (due to stronger fraud detection performance)
📊 Credit_Card Dataset
Model	AUC-PR	F1 Score	True Positives (TP)	False Positives (FP)	False Negatives (FN)	True Negatives (TN)
Logistic Regression	0.7150	0.1002	83	1479	12	55172
Random Forest	—	0.8276	72	7	23	56644

📝 Interpretation:

    Logistic Regression had a decent AUC-PR, but a very low F1 Score, indicating poor performance on identifying frauds.

    Random Forest achieved an excellent F1 Score (0.8276) with extremely low misclassification (only 7 FP and 23 FN).

    This indicates strong generalization and practical usability in high-risk domains.

✅ Best Model: Random Forest (clear dominance in classification effectiveness)
🏁 Final Recommendation

Across both datasets:

    ✅ Random Forest consistently delivers superior performance in detecting fraud.

    ⚖️ Logistic Regression offers interpretability, but fails to meet performance needs for imbalanced fraud detection tasks.

    🔐 Recommended Model for Deployment:
    Random Forest Classifier, due to its strong precision-recall balance, low error rates, and robust generalization across both datasets.