# Model Building & Training for Fraud and Credit Card Dataset

Importing Libraries

In [1]:
import sys
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

In [2]:
from scripts.data_loader import load_data
from scripts.model import run_modeling_pipeline, select_best_model

Loading Fraud Dataset

In [3]:
train_df = load_data ("/Users/elbethelzewdie/Downloads/fraud-detection/fraud-detection/data/processed/fraud_preprocessed_train_smote.csv")
test_df = load_data("/Users/elbethelzewdie/Downloads/fraud-detection/fraud-detection/data/processed/fraud_preprocessed_test.csv")

Separating the target variable from Features

In [None]:
# Define target variable
TARGET = "class"

# Split features and target
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

In [None]:
# check shapes and class distribution
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

print("\nClass distribution (train):")
print(y_train.value_counts(normalize=True))

print("\nClass distribution (test):")
print(y_test.value_counts(normalize=True))


Train shape: (168303, 42)
Test shape: (25830, 42)

Class distribution (train):
class
0.0    0.555558
1.0    0.444442
Name: proportion, dtype: float64

Class distribution (test):
class
0.0    0.904994
1.0    0.095006
Name: proportion, dtype: float64


Run modeling pipeline

In [None]:
# Run modeling pipeline
results = run_modeling_pipeline(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
)

In [7]:
comparison_df, best_model, justification = select_best_model(results)

comparison_df

Unnamed: 0,Model,F1_mean,AUC_PR_mean
1,Random Forest,0.952732,0.989911
0,Logistic Regression,0.662097,0.776104


- The Random Forest model significantly outperforms the Logistic Regression model, achieving a high F1-score of 0.953 and an AUC-PR of 0.990. This suggests that the Random Forest classifier is far superior at identifying fraud cases, demonstrating excellent precision and recall, which is crucial for imbalanced fraud datasets.

- The Logistic Regression model's lower performance (F1 $\approx 0.662$, AUC-PR $\approx 0.776$) indicates it struggles more with the complexity and imbalance typically present in fraud detection.

Choosing the Best Model

In [8]:
print("Selected model:", best_model)
print("Justification:", justification)


Selected model: Random Forest
Justification: Random Forest was selected based on the highest mean AUC-PR across stratified 5-fold cross-validation. Logistic Regression is retained as a strong baseline due to its interpretability and transparency.


Confusion Matrix

In [9]:
for model_name, metrics in results.items():
    print(f"\n{model_name}")
    print("Confusion Matrix:")
    print(metrics["test_metrics"]["Confusion_Matrix"])



Logistic Regression
Confusion Matrix:
[[15294  8082]
 [  725  1729]]

Random Forest
Confusion Matrix:
[[23242   134]
 [ 1098  1356]]


### Analysis:
- The Logistic Regression model successfully detects a relatively high number of fraudulent transactions (high TP) and has fewer missed fraud cases (lower FN). However, it produces a very large number of false positives. In a real-world fraud detection system, this would lead to many legitimate customers being incorrectly flagged, increasing operational costs and customer dissatisfaction.

- The Random Forest model significantly reduces false positives, making it more conservative when flagging fraud. This is desirable in operational settings where unnecessary fraud alerts are costly. However, the model misses more fraudulent transactions compared to Logistic Regression, resulting in a higher false negative rate.

Loading Credit Card Dataset

In [10]:
train_df = load_data ("/Users/elbethelzewdie/Downloads/fraud-detection/fraud-detection/data/processed/creditcard_preprocessed_train_smote.csv")
test_df = load_data("/Users/elbethelzewdie/Downloads/fraud-detection/fraud-detection/data/processed/creditcard_preprocessed_test.csv")

Separating the target variable from Features

In [11]:
TARGET = 'class'

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

In [12]:
# check shapes and class distribution
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

print("\nClass distribution (train):")
print(y_train.value_counts(normalize=True))

print("\nClass distribution (test):")
print(y_test.value_counts(normalize=True))


Train shape: (407883, 30)
Test shape: (56746, 30)

Class distribution (train):
class
0    0.555556
1    0.444444
Name: proportion, dtype: float64

Class distribution (test):
class
0    0.998326
1    0.001674
Name: proportion, dtype: float64


Run modeling pipeline

In [13]:
results = run_modeling_pipeline(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
)

Choosing the Best Model

In [14]:
comparison_df, best_model, justification = select_best_model(results)

comparison_df

Unnamed: 0,Model,F1_mean,AUC_PR_mean
1,Random Forest,0.999879,0.999997
0,Logistic Regression,0.944133,0.990671


In [15]:
print("Selected model:", best_model)
print("Justification:", justification)


Selected model: Random Forest
Justification: Random Forest was selected based on the highest mean AUC-PR across stratified 5-fold cross-validation. Logistic Regression is retained as a strong baseline due to its interpretability and transparency.


Confusion Matrix

In [16]:
for model_name, metrics in results.items():
    print(f"\n{model_name}")
    print("Confusion Matrix:")
    print(metrics["test_metrics"]["Confusion_Matrix"])
    print("Classification Report:")


Logistic Regression
Confusion Matrix:
[[55192  1459]
 [   12    83]]
Classification Report:

Random Forest
Confusion Matrix:
[[56645     6]
 [   24    71]]
Classification Report:


### Analysis:

- Logistic Regression correctly identifies most fraud cases (TP=83) but misclassifies a small number of non-fraud transactions as fraud (FP=1,459). False negatives are very low (12), which is important because catching fraud is more critical than avoiding false alarms. Overall, the model performs reasonably well, but there is room to improve precision.

- Random Forest drastically reduces false positives (only 6), meaning very few legitimate transactions are flagged as fraud. However, it misses slightly more fraud cases than Logistic Regression (FN=24 vs FN=12), which can be critical depending on the business objective. This trade-off indicates that Random Forest is more conservative in flagging fraud, prioritizing precision over recall.