 Credit Card Fraud Detection using Supervised Learning

This notebook demonstrates a complete supervised-learning workflow for detecting fraudulent credit-card transactions. It includes problem description, EDA, preprocessing, modeling, evaluation, and discussion.

Dataset: [Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)

Target: `Class` → 1 = Fraud, 0 = Legitimate

Challenge: Highly imbalanced dataset; precision/recall and ROC-AUC are more informative than accuracy.


1. Imports & Data Loading

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, 
                             roc_curve, ConfusionMatrixDisplay)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load the dataset (place creditcard.csv in the same folder or update the path)
df = pd.read_csv("creditcard.csv")

print("Shape:", df.shape)
df.head()


2. Exploratory Data Analysis (EDA)

In [None]:

# Missing values
print("Total missing values:", df.isnull().sum().sum())

# Class balance
sns.countplot(x='Class', data=df, palette='Set2')
plt.title("Class Distribution – Legit (0) vs Fraud (1)")
plt.show()

print("Class balance (proportion):")
print(df['Class'].value_counts(normalize=True))

# Distributions for Amount & Time
fig, ax = plt.subplots(1,2, figsize=(12,5))
sns.histplot(df['Amount'], bins=50, ax=ax[0], color='teal')
sns.histplot(df['Time'],   bins=50, ax=ax[1], color='orange')
ax[0].set_title("Transaction Amount Distribution")
ax[1].set_title("Transaction Time Distribution")
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), cmap='coolwarm', center=0)
plt.title("Feature Correlation Matrix")
plt.show()


3. Preprocessing

In [None]:

X = df.drop(columns=['Class'])
y = df['Class']

# Scale 'Amount' and 'Time'
scaler = StandardScaler()
X[['Amount','Time']] = scaler.fit_transform(X[['Amount','Time']])

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train shape:", X_train.shape)
print("Fraud ratio in train:", y_train.mean())


4. Modeling: Logistic Regression (Baseline)

In [None]:

lr = LogisticRegression(class_weight='balanced', max_iter=500, random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression Results")
print(classification_report(y_test, y_pred_lr, digits=4))
print("ROC-AUC:", roc_auc_score(y_test, lr.predict_proba(X_test)[:,1]))


5. Modeling: Random Forest

In [None]:

rf = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Results")
print(classification_report(y_test, y_pred_rf, digits=4))
print("ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))


6. Results Visualization: Confusion Matrix & ROC

In [None]:

# Confusion Matrix
ConfusionMatrixDisplay.from_estimator(rf, X_test, y_test, cmap='Blues')
plt.title("Confusion Matrix – Random Forest")
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
roc_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])

plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC={roc_auc:.3f})")
plt.plot([0,1],[0,1],'--',color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Random Forest")
plt.legend()
plt.show()


7. SMOTE Oversampling

In [None]:

# Uncomment to use SMOTE (requires imblearn)
# from imblearn.over_sampling import SMOTE
# sm = SMOTE(random_state=42)
# X_res, y_res = sm.fit_resample(X_train, y_train)
# print("Before SMOTE fraud ratio:", y_train.mean())
# print("After SMOTE fraud ratio:", y_res.mean())
# rf_sm = RandomForestClassifier(n_estimators=100, random_state=42)
# rf_sm.fit(X_res, y_res)
# y_pred_sm = rf_sm.predict(X_test)
# print("Random Forest + SMOTE Results")
# print(classification_report(y_test, y_pred_sm, digits=4))
# print("ROC-AUC:", roc_auc_score(y_test, rf_sm.predict_proba(X_test)[:,1]))


Discussion & Conclusion: 
The dataset is extremely imbalanced; recall and ROC-AUC are critical
Logistic Regression with class weighting provides a baseline
Random Forest generally improves performance (precision/recall trade-off)
SMOTE can further improve recall at the expense of precision
Next Steps: GridSearchCV for RF hyperparameters; try anomaly detection methods