# Comprehensive Research: Credit Card Fraud Detection

## 1. Environment & Problem Definition
**Objective**: Identify fraudulent transactions in a highly imbalanced dataset (0.17% Positive Class).
**Metric**: Area Under Precision-Recall Curve (AUPRC) & Recall at Precision > 0.8.
**Research Questions**:
1.  Does SMOTE improve the classifier or just add noise?
2.  What is the optimal Decision Threshold for a business maximizing Recall?
3.  Which features drive fraud: Time, Amount, or V-features?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc, make_scorer
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# Set plotting style for research paper quality
sns.set(style="whitegrid", context="notebook", palette="muted")
plt.rcParams['figure.figsize'] = (12, 6)

# --- STEP 1: DATA GENERATION (Simulating the Problem) ---
# We generate "Dirty" data to demonstrate cleaning.
def generate_research_data(n=20000):
    n_fraud = int(n * 0.02) # 2% Fraud for visibility (Real 0.17% is harder to viz)
    n_normal = n - n_fraud
    
    # Feature V1: Normal dist. Fraud is shifted.
    v1_norm = np.random.normal(0, 1, n_normal)
    v1_fraud = np.random.normal(3, 1.5, n_fraud) # Shifted right
    
    # Feature Amount: Exponential. Fraud has some high value outliers.
    amt_norm = np.random.exponential(100, n_normal)
    amt_fraud = np.random.exponential(150, n_fraud) + np.random.choice([0, 1000, 5000], n_fraud, p=[0.8, 0.15, 0.05])
    
    X = pd.DataFrame({
        "V1": np.concatenate([v1_norm, v1_fraud]),
        "Amount": np.concatenate([amt_norm, amt_fraud])
    })
    y = np.array([0]*n_normal + [1]*n_fraud)
    
    # Add Missing Values (Simulation)
    mask = np.random.random(n) < 0.05 # 5% missing
    X.loc[mask, "V1"] = np.nan
    
    return X, y

X, y = generate_research_data()
print(f"Data Generated: {X.shape}. Fraud Ratio: {y.mean():.2%}")
print("Missing Values:\n", X.isnull().sum())

## 2. Exploratory Data Analysis (EDA)
Before modeling, we must understand the distribution. 
*   **Hypothesis**: Fraud transactions involve uncommon Amounts (either very small or very large).

In [None]:
# 2.1 Univariate Analysis: Amount Distribution
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
sns.histplot(X['Amount'], bins=50, kde=True)
plt.title("Amount Distribution (Raw)")

# 2.2 Bivariate: Amount vs Class
plt.subplot(1, 2, 2)
sns.boxplot(x=y, y=X['Amount'])
plt.title("Amount by Class (0=Normal, 1=Fraud)")
plt.show()

**Observation**: The 'Amount' is heavily right-skewed. The Boxplot shows extreme outliers in the Fraud class. We need to handle this.

In [None]:
# 3. Data Cleaning & Feature Engineering

# 3.1 Imputation
print("Imputing V1 with Median (Robust to outliers)...")
X['V1'].fillna(X['V1'].median(), inplace=True)

# 3.2 Log Transformation
# Standard convention for financial data: log(1 + x)
X['Log_Amount'] = np.log1p(X['Amount'])

# Verify Improvement
plt.figure(figsize=(8, 4))
sns.kdeplot(data=X, x='Log_Amount', hue=y, fill=True)
plt.title("Log-Amount Distribution by Class")
plt.show()

## 4. Modeling Strategy
We will use **SMOTE (Synthetic Minority Over-sampling Technique)**. 

**Critical Research Concept**: 
SMOTE must *only* be applied to the Training sets during Cross-Validation. If we apply it to the whole dataset before splitting, information from the Test set leaks into the Train set (since SMOTE uses Nearest Neighbors), invalidating our results.

In [None]:
# 4.1 Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X[['V1', 'Log_Amount']], y, test_size=0.2, stratify=y, random_state=42)

# 4.2 Define Pipeline
# ImbPipeline allows resampling inside the CV loop
pipeline = ImbPipeline([
    ('scaler', RobustScaler()), # Scale features using Median/IQR
    ('smote', SMOTE(sampling_strategy=0.1, random_state=42)), # Bring Fraud up to 10%
    ('clf', RandomForestClassifier(random_state=42))
])

# 4.3 Hyperparameter Tuning (GridSearch)
# We explore: Does deep trees help? Does balanced class weight help?
param_grid = {
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [5, 10, None],
    'clf__class_weight': [None, 'balanced']
}

# Custom Scorer: AUPRC is better than AUC-ROC for imbalance
def auprc_score(y_true, y_pred_proba):
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    return auc(recall, precision)

scorer = make_scorer(auprc_score, needs_proba=True)

print("Starting Grid Search...")
grid = GridSearchCV(pipeline, param_grid, cv=3, scoring=scorer, verbose=1)
grid.fit(X_train, y_train)

print(f"Best Parameters: {grid.best_params_}")
print(f"Best CV AUPRC: {grid.best_score_:.4f}")

## 5. Diagnostics & Evaluation
A simple accuracy score is not enough. We need to visualize the trade-off between Precision and Recall.

In [None]:
# 5.1 Learning Curve (Bias vs Variance Analysis)
train_sizes, train_scores, val_scores = learning_curve(
    grid.best_estimator_, X_train, y_train, cv=3, scoring=scorer, train_sizes=np.linspace(0.1, 1.0, 5)
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, np.mean(train_scores, axis=1), 'o-', label="Training Score")
plt.plot(train_sizes, np.mean(val_scores, axis=1), 'o-', label="Validation Score")
plt.title("Learning Curve: Are we Overfitting?")
plt.xlabel("Training Examples")
plt.ylabel("AUPRC Score")
plt.legend()
plt.show()

# 5.2 Precision-Recall Curve on Test Set
best_model = grid.best_estimator_
probas = best_model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, probas)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, label=f'Model (AP={auc(recall, precision):.2f})')
plt.xlabel('Recall (Fraud Captured)')
plt.ylabel('Precision (True Fraud %)')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid()
plt.show()

## 6. Threshold Moving (Optimization)
The default threshold is 0.5. However, for fraud, missing a fraud (False Negative) costs $10,000, while calling a customer (False Positive) costs $5. We should lower the threshold to increase Recall.

In [None]:
# Find threshold that guarantees 90% Recall
target_recall = 0.90
idx = np.argmin(np.abs(recall - target_recall))
optimal_threshold = thresholds[idx]

print(f"To achieve {target_recall*100}% Recall, we need Threshold = {optimal_threshold:.4f}")
print(f"At this level, Precision is {precision[idx]:.2f}")

# Final Confusion Matrix
y_pred = (probas >= optimal_threshold).astype(int)
print("\nConfusion Matrix at Optimal Threshold:")
print(confusion_matrix(y_test, y_pred))

# Feature Importance
importances = best_model.named_steps['clf'].feature_importances_
print("\nFeature Importances:")
for name, imp in zip(['V1', 'Log_Amount'], importances):
    print(f"{name}: {imp:.4f}")