# Credit Card Fraud Detection â€” Exploratory Modeling

## Scope and Disclaimer
This notebook explores baseline machine learning models to support fraud risk analysis.
The goal is not to build a production-ready system, but to understand trade-offs,
evaluation metrics, and decision thresholds in highly imbalanced data.


## Why Modeling After EDA

After understanding fraud patterns, class imbalance, and data limitations,
modeling is used to:
- Evaluate signal strength in the data
- Understand false positive vs false negative trade-offs
- Support risk-based decision making


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve
)

import matplotlib.pyplot as plt
import seaborn as sns


## Data Preparation


In [None]:
X = df.drop("Class", axis=1)
y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

##Stratification preserves fraud proportion in train and test sets.

## Baseline Model: Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(class_weight="balanced", max_iter=1000)
log_reg.fit(X_train_scaled, y_train)


## Tree-Based Model: Random Forest


In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    class_weight="balanced",
    random_state=42
)

rf.fit(X_train, y_train)

##Tree-based models can capture non-linear fraud patterns.

## Model Evaluation


In [None]:
y_pred_prob_rf = rf.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob_rf)


In [None]:
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob_rf)

plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()


## Confusion Matrix and Business Interpretation


In [None]:
y_pred = (y_pred_prob_rf >= 0.5).astype(int)
confusion_matrix(y_test, y_pred)


## Threshold Sensitivity Analysis


In [None]:
for t in [0.1, 0.3, 0.5]:
    y_pred_t = (y_pred_prob_rf >= t).astype(int)
    print(f"Threshold: {t}")
    print(confusion_matrix(y_test, y_pred_t))

    ##Lower thresholds increase recall but also operational cost.


## Feature Importance (Exploratory)


In [None]:
importances = pd.Series(
    rf.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

importances.head(10)

##PCA features limit interpretability.

## Model Limitations

- Anonymized features restrict explainability
- No cost matrix applied
- No resampling techniques used
- Results should not be used in production


## Key Takeaways

- Fraud detection requires metric-aware evaluation
- Recall is often prioritized over accuracy
- Threshold choice is a business decision


## Connection with EDA Insights

Model behavior aligns with EDA findings regarding:
- Class imbalance
- Transaction amount patterns
- Temporal clustering
