In [1]:
import pandas as pd

# Load the cleaned dataset
df = pd.read_csv("../data/processed_creditcard.csv")

# Display first few rows
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V22,V23,V24,V25,V26,V27,V28,Class,Scaled_Amount,Scaled_Time
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.2442,-1.996823
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342584,-1.996823
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.1589,-1.996802
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.139886,-1.996802
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073813,-1.996781


Before training any model, we need to load the cleaned dataset and verify it to make sure the data is in the right format.

What to Expect?
The dataset should not have missing values.
The dataset should be numerical (no categorical columns).
The target column (Class) should contain:
0 → Normal Transactions
1 → Fraudulent Transactions

In [2]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df.drop(columns=['Class'])  # Features
y = df['Class']  # Target variable

# Split the dataset into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

We split the dataset into two parts:

Training Set (80%) → Used to train the model.
Test Set (20%) → Used to evaluate how well the model performs on unseen data.
This helps us measure generalization, ensuring the model works well on new data.

What's the objective?
X_train: Features for training (80% of data)
X_test: Features for testing (20% of data)
y_train: Target labels for training
y_test: Target labels for testing

In [3]:
from collections import Counter
from imblearn.over_sampling import SMOTE

# Check class distribution before SMOTE
print("Before SMOTE:", Counter(y_train))

# Apply SMOTE
smote = SMOTE(sampling_strategy=0.2, random_state=42)  # Make fraud cases 20% of total
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check class distribution after SMOTE
print("After SMOTE:", Counter(y_train_resampled))


Before SMOTE: Counter({0: 226602, 1: 378})
After SMOTE: Counter({0: 226602, 1: 45320})


Handled Imbalanced Data with SMOTE

Why?
Fraud cases (Class = 1) are very rare in the dataset.
If we train a model as-is, it might ignore fraud cases because they are too few.
Solution? We use SMOTE (Synthetic Minority Over-sampling Technique) to increase fraud cases in y_train.

Outcome:
Before applying SMOTE, you’ll see a huge imbalance.
After SMOTE, the fraud cases will increase, making it easier for the model to learn.

Now we train 3 models and compare them:

Logistic Regression (Baseline Model)
Random Forest (More complex, handles imbalances well)
XGBoost (Best for fraud detection)

Model 1: Logistic Regression (Baseline)

Why this model?
Simple, fast model.
Good starting point for classification.
Can show whether the dataset is linearly separable.

What to Expect?
If accuracy is too high (~99%), that means the model is ignoring fraud cases.
We need to check Precision, Recall, and F1-score (not just Accuracy).

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Train Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_log = log_reg.predict(X_test)

# Evaluate
print("Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

Logistic Regression Results:
Accuracy: 0.9948013956930885
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56651
           1       0.22      0.82      0.35        95

    accuracy                           0.99     56746
   macro avg       0.61      0.91      0.67     56746
weighted avg       1.00      0.99      1.00     56746



Key Observations:

Overall Accuracy:
The model achieves an impressive overall accuracy of 99.48%. This suggests that the model correctly predicts a high proportion of instances in the dataset.

Class Distribution:
The support values indicate a significant imbalance in the dataset, with 56,651 instances of Class 0 (likely normal transactions) compared to only 95 instances of Class 1 (fraudulent transactions). This imbalance can impact the model's performance and evaluation metrics.

Precision, Recall, and F1-Score for Class 0:

For Class 0:
Precision: 1.00 (100%) indicates that all predicted normal transactions were actually normal, showing no false positives.
Recall: 1.00 (100%) means the model successfully identified all actual normal transactions, indicating perfect sensitivity for this class.
F1-Score: 1.00 reflects a perfect balance between precision and recall for Class 0.
Precision, Recall, and F1-Score for Class 1:

For Class 1:
Precision: 0.22 (22%) indicates a high number of false positives, meaning many transactions predicted as fraudulent were actually normal.
Recall: 0.82 (82%) shows that the model successfully identifies a significant majority of actual fraudulent transactions, but still misses some.
F1-Score: 0.35 indicates poor overall performance for Class 1, highlighting the challenges in accurately predicting this minority class.

Macro and Weighted Averages:

Macro Average:
The macro average precision (0.61), recall (0.91), and F1-score (0.67) indicate that while the model performs well on average, the performance is significantly affected by the imbalance in the classes.

Weighted Average:
The weighted averages (1.00 precision and 0.99 recall) show that the model is heavily influenced by the performance on Class 0, which is the majority class.

Potential Issues:
The model's high accuracy is somewhat misleading due to the class imbalance. While it performs well on Class 0, its performance on Class 1 is inadequate, suggesting that additional techniques (e.g., resampling, using different evaluation metrics, or employing more sophisticated models) may be needed to improve the detection of fraudulent transactions.

Summary
The logistic regression model demonstrates high accuracy primarily driven by its performance on the majority class (Class 0). However, its poor precision and F1-score for the minority class (Class 1) reveal significant limitations in detecting fraud, highlighting the need for strategies to address class imbalance and improve overall classification performance.

Model 2: Random Forest (Better Handling of Imbalance)

Why this model?
Handles imbalanced data better than Logistic Regression.
Works well with non-linear data.

What to Expect?
Higher Recall than Logistic Regression.
Less bias toward non-fraud transactions.

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_rf = rf.predict(X_test)

# Evaluate
print("Random Forest Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

Random Forest Results:
Accuracy: 0.9994713283755683
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56651
           1       0.91      0.76      0.83        95

    accuracy                           1.00     56746
   macro avg       0.96      0.88      0.91     56746
weighted avg       1.00      1.00      1.00     56746



Overall Accuracy:
The model achieves a very high accuracy of 99.95%, indicating it predicts a vast majority of instances correctly.

Class Distribution:
Similar to the logistic regression results, the class distribution is imbalanced, with 56,651 instances of Class 0 (normal transactions) and only 95 instances of Class 1 (fraudulent transactions).

Precision, Recall, and F1-Score for Class 0:

For Class 0:
Precision: 1.00 (100%) indicates that all predicted normal transactions were indeed normal, showing no false positives.
Recall: 1.00 (100%) means the model successfully identified all actual normal transactions, reflecting perfect sensitivity for this class.
F1-Score: 1.00 indicates a perfect balance between precision and recall for Class 0.
Precision, Recall, and F1-Score for Class 1:

For Class 1:
Precision: 0.91 (91%) indicates that a high proportion of transactions predicted as fraudulent were actually fraudulent, showing a significant reduction in false positives compared to the logistic regression model.
Recall: 0.76 (76%) shows that the model correctly identifies a good percentage of actual fraudulent transactions but still misses some.
F1-Score: 0.83 reflects a strong performance for Class 1, indicating a good balance of precision and recall in identifying fraudulent transactions.

Macro and Weighted Averages:

Macro Average:
Macro averages of precision (0.96), recall (0.88), and F1-score (0.91) indicate that the model performs well across both classes, though it still has room for improvement in recall for Class 1.

Weighted Average:
The weighted averages (1.00 precision, 1.00 recall, and 1.00 F1-score) suggest that the model is excellent at identifying Class 0 and reasonably effective at identifying Class 1, with a strong overall performance.

Improvement Over Logistic Regression:
Compared to the logistic regression results, the Random Forest model shows significant improvements in precision, recall, and F1-score for Class 1. This suggests that the Random Forest algorithm is better suited for handling the complexities of the feature set and the class imbalance in this dataset.

Summary
The Random Forest model demonstrates high overall accuracy and performs exceptionally well in identifying normal transactions (Class 0). It also shows substantial improvement in detecting fraudulent transactions (Class 1) compared to logistic regression, with high precision and a reasonable recall. This indicates that Random Forest is a more effective model for this classification problem, particularly in addressing the class imbalance and improving the detection of minority class instances.

Model 3: XGBoost (Best for Fraud Detection)

Why yhis model?
Best model for imbalanced data.
Uses boosting to correct errors made in previous iterations.

In [7]:
from xgboost import XGBClassifier

# Train XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
xgb.fit(X_train_resampled, y_train_resampled)

# Predictions
y_pred_xgb = xgb.predict(X_test)

# Evaluate
print("XGBoost Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

Parameters: { "use_label_encoder" } are not used.



XGBoost Results:
Accuracy: 0.9993127268882388
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56651
           1       0.81      0.77      0.79        95

    accuracy                           1.00     56746
   macro avg       0.91      0.88      0.89     56746
weighted avg       1.00      1.00      1.00     56746



Overall Accuracy:
The model achieves a very high accuracy of 99.93%, indicating that it accurately predicts the vast majority of instances in the dataset.

Class Distribution:
Similar to the previous models, the class distribution is imbalanced, with 56,651 instances of Class 0 (normal transactions) and only 95 instances of Class 1 (fraudulent transactions).

Precision, Recall, and F1-Score for Class 0:

or Class 0:
Precision: 1.00 (100%) indicates that all predicted normal transactions were indeed normal, reflecting no false positives.
Recall: 1.00 (100%) means that the model successfully identified all actual normal transactions, showcasing perfect sensitivity for this class.
F1-Score: 1.00 indicates a perfect balance between precision and recall for Class 0.
Precision, Recall, and F1-Score for Class 1:

For Class 1:
Precision: 0.81 (81%) indicates that a good proportion of transactions predicted as fraudulent were indeed fraudulent, though there are still some false positives.
Recall: 0.77 (77%) shows that the model correctly identifies a significant number of actual fraudulent transactions, but it misses some.
F1-Score: 0.79 reflects a reasonable balance of precision and recall for detecting fraudulent transactions.

Macro and Weighted Averages:

Macro Average:
Macro averages of precision (0.91), recall (0.88), and F1-score (0.89) indicate that the model performs well across both classes, though it still has some room for improvement in identifying Class 1.

Weighted Average:
The weighted averages (1.00 precision, 1.00 recall, and 1.00 F1-score) suggest that the model is excellent at identifying Class 0 and reasonably effective at identifying Class 1, with strong overall performance.

Comparison with Other Models:
Compared to the Random Forest results, the XGBoost model shows slightly lower precision (81% vs. 91%) for Class 1, while still maintaining a good recall (77%). This indicates that while XGBoost is effective, it may not capture fraudulent transactions as accurately as Random Forest.
Overall, both models perform well, but the Random Forest model appears to provide a better balance between precision and recall for the minority class.

Summary
The XGBoost model demonstrates high overall accuracy and performs exceptionally well in identifying normal transactions (Class 0). It also shows good performance in detecting fraudulent transactions (Class 1), although not as strong as Random Forest. The results highlight XGBoost's effectiveness while also indicating that there is room for improvement in detecting the minority class, making it important to consider model selection and potential enhancements for better performance.

Comparison of all Models

Logistic Regression:
Accuracy: 99.48%
Class 0 Precision: 100%
Class 1 Precision: 22%
Class 1 Recall: 82%
F1 Score for Class 1: 35%
Insights: While it has high accuracy, it struggles significantly with the minority class, yielding a very low precision for Class 1.

Random Forest:
Accuracy: 99.95%
Class 0 Precision: 100%
Class 1 Precision: 91%
Class 1 Recall: 76%
F1 Score for Class 1: 83%
Insights: Offers a strong performance for Class 1, with high precision and a good recall. It effectively balances performance across both classes.

XGBoost:
Accuracy: 99.93%
Class 0 Precision: 100%
Class 1 Precision: 81%
Class 1 Recall: 77%
F1 Score for Class 1: 79%
Insights: Provides solid performance for Class 1, though slightly lower in precision than Random Forest. It captures many fraudulent transactions but still has room for improvement.

Summary of Model Performance
Logistic Regression is not effective for identifying the minority class despite high accuracy due to class imbalance.
Random Forest shows the best balance of precision and recall for Class 1, making it the most robust model in this context.
XGBoost performs well but trails slightly behind Random Forest in precision for Class 1.

Recommendations for Improvement

Address Class Imbalance:
Resampling Techniques: Use oversampling (e.g., SMOTE) to generate synthetic samples of the minority class or undersampling to reduce the majority class.
Class Weighting: Adjust the class weights in the model to penalize misclassifications of the minority class more heavily.

Feature Engineering:
Create New Features: Explore additional features that may improve the model’s ability to distinguish between classes (e.g., transaction patterns, time of day).

Feature Selection: Use techniques like Recursive Feature Elimination (RFE) or feature importance from Random Forest/XGBoost to select the most predictive features.

Hyperparameter Tuning:
Perform grid search or randomized search for hyperparameter optimization on Random Forest and XGBoost to enhance their performance.

Ensemble Methods:
Consider combining different models (e.g., using stacking or blending) to leverage the strengths of each model and improve overall classification performance.

Threshold Adjustment:
Adjust the decision threshold for Class 1 predictions to balance precision and recall better, especially if the cost of false negatives is high.

Cross-Validation:
Implement k-fold cross-validation to ensure model robustness and generalization across different subsets of the data.
Advanced Techniques:
Explore advanced algorithms like Gradient Boosting Machines (GBM), LightGBM, or neural networks, which may capture complex relationships in the data better.

Conclusion
While Random Forest currently offers the best balance for this classification task, applying the above strategies can enhance the detection of fraudulent transactions, improving model performance and robustness.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

rf_tuned = RandomizedSearchCV(RandomForestClassifier(), param_grid, cv=5)
rf_tuned.fit(X_train, y_train)