# Credit Card Fraud Detection

## Project Overview

Credit card fraud is a significant problem in the financial industry, resulting in billions of dollars in losses annually. This project demonstrates the application of machine learning techniques to detect fraudulent transactions from a highly imbalanced dataset.

### Objectives

Perform exploratory data analysis on credit card transaction data, handle class imbalance using SMOTE and undersampling, build and evaluate multiple classification models, and compare performance using appropriate metrics for imbalanced data.

### Dataset

The dataset contains transactions made by European cardholders in September 2013. It presents transactions that occurred over two days, with 492 frauds out of 284,807 transactions (0.172% fraud rate).

**Features** - `Time` (seconds elapsed between transactions), `V1-V28` (PCA-transformed features, anonymized), `Amount` (transaction amount), `Class` (target - 0 for legitimate, 1 for fraud)

## 1. Import Libraries

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# Handling imbalanced data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Evaluation metrics
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    roc_auc_score, 
    roc_curve,
    precision_recall_curve,
    average_precision_score,
    f1_score,
    accuracy_score,
    precision_score,
    recall_score
)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the dataset
df = pd.read_csv('../data/creditcard.csv')

# Display basic information
print(f"Dataset Shape - {df.shape}")
print(f"Total Transactions - {len(df):,}")
print(f"\nFeatures - {df.columns.tolist()}")

In [None]:
# First look at the data
df.head()

In [None]:
# Data types and missing values
print("Data Types")
print(df.dtypes.value_counts())
print(f"\nMissing Values - {df.isnull().sum().sum()}")

In [None]:
# Statistical summary
df.describe()

## 3. Exploratory Data Analysis

### 3.1 Class Distribution Analysis

In [None]:
# Class distribution
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

print("Class Distribution")
print(f"Legitimate (0) - {class_counts[0]:,} transactions ({class_percentages[0]:.3f}%)")
print(f"Fraudulent (1) - {class_counts[1]:,} transactions ({class_percentages[1]:.3f}%)")
print(f"\nImbalance Ratio - 1:{class_counts[0]//class_counts[1]}")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
colors = ['#2ecc71', '#e74c3c']
axes[0].bar(['Legitimate', 'Fraud'], class_counts.values, color=colors, edgecolor='black')
axes[0].set_ylabel('Number of Transactions', fontsize=12)
axes[0].set_title('Class Distribution - Count', fontsize=14, fontweight='bold')
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v + 5000, f'{v:,}', ha='center', fontsize=11, fontweight='bold')

# Pie chart
axes[1].pie(class_counts.values, labels=['Legitimate', 'Fraud'], autopct='%1.3f%%',
            colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
axes[1].set_title('Class Distribution - Percentage', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../figures/class_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### 3.2 Transaction Amount Analysis

In [None]:
# Transaction amount statistics by class
print("Transaction Amount Statistics")
print("\nLegitimate Transactions")
print(df[df['Class'] == 0]['Amount'].describe())
print("\nFraudulent Transactions")
print(df[df['Class'] == 1]['Amount'].describe())

In [None]:
# Visualize transaction amounts
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of amounts
axes[0].hist(df[df['Class'] == 0]['Amount'], bins=50, alpha=0.7, label='Legitimate', color='#2ecc71')
axes[0].hist(df[df['Class'] == 1]['Amount'], bins=50, alpha=0.7, label='Fraud', color='#e74c3c')
axes[0].set_xlabel('Transaction Amount ($)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Transaction Amounts', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].set_xlim(0, 500)

# Box plot
df_plot = df[['Amount', 'Class']].copy()
df_plot['Class'] = df_plot['Class'].map({0: 'Legitimate', 1: 'Fraud'})
sns.boxplot(x='Class', y='Amount', data=df_plot, ax=axes[1], palette=colors)
axes[1].set_ylabel('Transaction Amount ($)', fontsize=12)
axes[1].set_title('Transaction Amount by Class', fontsize=14, fontweight='bold')
axes[1].set_ylim(0, 500)

plt.tight_layout()
plt.savefig('../figures/amount_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### 3.3 Time Analysis

In [None]:
# Convert time to hours
df['Hour'] = (df['Time'] / 3600) % 24

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Transactions over time
axes[0].hist(df[df['Class'] == 0]['Hour'], bins=24, alpha=0.7, label='Legitimate', color='#2ecc71', density=True)
axes[0].hist(df[df['Class'] == 1]['Hour'], bins=24, alpha=0.7, label='Fraud', color='#e74c3c', density=True)
axes[0].set_xlabel('Hour of Day', fontsize=12)
axes[0].set_ylabel('Density', fontsize=12)
axes[0].set_title('Transaction Distribution Over Time', fontsize=14, fontweight='bold')
axes[0].legend()

# Fraud rate by hour
hourly_fraud = df.groupby(df['Hour'].astype(int))['Class'].mean() * 100
axes[1].bar(hourly_fraud.index, hourly_fraud.values, color='#3498db', edgecolor='black')
axes[1].set_xlabel('Hour of Day', fontsize=12)
axes[1].set_ylabel('Fraud Rate (%)', fontsize=12)
axes[1].set_title('Fraud Rate by Hour', fontsize=14, fontweight='bold')
axes[1].axhline(y=df['Class'].mean()*100, color='red', linestyle='--', label='Average Fraud Rate')
axes[1].legend()

plt.tight_layout()
plt.savefig('../figures/time_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

### 3.4 Feature Correlation Analysis

In [None]:
# Correlation with target variable
correlations = df.drop('Hour', axis=1).corr()['Class'].sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#e74c3c' if x > 0 else '#3498db' for x in correlations.values[1:]]
correlations[1:].plot(kind='barh', color=colors, ax=ax)
ax.set_xlabel('Correlation Coefficient', fontsize=12)
ax.set_title('Feature Correlation with Fraud', fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.savefig('../figures/feature_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Top correlated features visualization
top_positive = ['V11', 'V4', 'V2']
top_negative = ['V14', 'V12', 'V10']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for idx, feature in enumerate(top_negative):
    ax = axes[0, idx]
    sns.kdeplot(data=df, x=feature, hue='Class', ax=ax, palette=colors, fill=True, alpha=0.5)
    ax.set_title(f'{feature} Distribution - Negative Correlation', fontsize=12, fontweight='bold')

for idx, feature in enumerate(top_positive):
    ax = axes[1, idx]
    sns.kdeplot(data=df, x=feature, hue='Class', ax=ax, palette=colors, fill=True, alpha=0.5)
    ax.set_title(f'{feature} Distribution - Positive Correlation', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('../figures/top_features_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Data Preprocessing

In [None]:
# Drop the Hour column (was created for analysis)
df = df.drop('Hour', axis=1)

# Scale Amount and Time features using RobustScaler (handles outliers better)
scaler = RobustScaler()
df['Amount_scaled'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df['Time_scaled'] = scaler.fit_transform(df['Time'].values.reshape(-1, 1))

# Drop original Amount and Time
df = df.drop(['Amount', 'Time'], axis=1)

print(f"Preprocessed dataset shape - {df.shape}")
df.head()

In [None]:
# Prepare features and target
X = df.drop('Class', axis=1)
y = df['Class']

print(f"Features shape - {X.shape}")
print(f"Target shape - {y.shape}")
print(f"\nTarget distribution")
print(y.value_counts())

In [None]:
# Split data into train and test sets (stratified to maintain class ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set - {X_train.shape[0]:,} samples")
print(f"Test set - {X_test.shape[0]:,} samples")
print(f"\nTraining set class distribution")
print(y_train.value_counts())
print(f"\nTest set class distribution")
print(y_test.value_counts())

## 5. Handling Class Imbalance

### 5.1 SMOTE - Synthetic Minority Over-sampling Technique

In [None]:
# Apply SMOTE to training data only
smote = SMOTE(random_state=42, sampling_strategy=0.5)  # Create fraud samples = 50% of legitimate
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Original training set shape - {X_train.shape}")
print(f"SMOTE training set shape - {X_train_smote.shape}")
print(f"\nOriginal class distribution")
print(y_train.value_counts())
print(f"\nSMOTE class distribution")
print(pd.Series(y_train_smote).value_counts())

### 5.2 Random Undersampling

In [None]:
# Apply undersampling to training data
undersampler = RandomUnderSampler(random_state=42, sampling_strategy=1.0)  # Equal ratio
X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)

print(f"Undersampled training set shape - {X_train_under.shape}")
print(f"\nUndersampled class distribution")
print(pd.Series(y_train_under).value_counts())

## 6. Model Building and Evaluation

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """
    Evaluate model performance with multiple metrics.
    
    Parameters
    ----------
    model : trained model object
    X_test : test features
    y_test : test labels
    model_name : string name for the model
    
    Returns
    -------
    dict : dictionary containing all evaluation metrics
    """
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    metrics = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_pred_proba),
        'Avg Precision': average_precision_score(y_test, y_pred_proba)
    }
    
    return metrics, y_pred, y_pred_proba

In [None]:
def plot_confusion_matrix(y_test, y_pred, model_name, ax):
    """Plot confusion matrix."""
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Legitimate', 'Fraud'],
                yticklabels=['Legitimate', 'Fraud'])
    ax.set_xlabel('Predicted', fontsize=11)
    ax.set_ylabel('Actual', fontsize=11)
    ax.set_title(f'{model_name}', fontsize=12, fontweight='bold')

### 6.1 Train Models

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss', use_label_encoder=False)
}

# Store results
results = []
predictions = {}

In [None]:
# Train and evaluate each model on SMOTE data
print("Training models on SMOTE-resampled data...\n")
print("=" * 80)

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    model.fit(X_train_smote, y_train_smote)
    
    # Evaluate
    metrics, y_pred, y_pred_proba = evaluate_model(model, X_test, y_test, name)
    results.append(metrics)
    predictions[name] = {'y_pred': y_pred, 'y_pred_proba': y_pred_proba}
    
    # Print classification report
    print(f"\n{name} Results")
    print("-" * 40)
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

print("\n" + "=" * 80)
print("Training complete!")

### 6.2 Model Comparison

In [None]:
# Create results dataframe
results_df = pd.DataFrame(results)
results_df = results_df.set_index('Model')

# Format as percentages
results_display = results_df.copy()
for col in results_display.columns:
    results_display[col] = results_display[col].apply(lambda x: f'{x:.4f}')

print("Model Performance Comparison")
print("=" * 80)
results_display

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Metrics comparison
metrics_to_plot = ['Precision', 'Recall', 'F1-Score', 'ROC-AUC']
results_df[metrics_to_plot].plot(kind='bar', ax=axes[0], rot=0, width=0.8)
axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].legend(loc='lower right')
axes[0].set_ylim(0, 1.1)
for container in axes[0].containers:
    axes[0].bar_label(container, fmt='%.3f', fontsize=8)

# ROC-AUC comparison
colors_bar = ['#3498db', '#2ecc71', '#e74c3c']
results_df['ROC-AUC'].plot(kind='bar', ax=axes[1], color=colors_bar, rot=0)
axes[1].set_ylabel('ROC-AUC Score', fontsize=12)
axes[1].set_title('ROC-AUC Comparison', fontsize=14, fontweight='bold')
axes[1].set_ylim(0.9, 1.0)
for i, v in enumerate(results_df['ROC-AUC'].values):
    axes[1].text(i, v + 0.002, f'{v:.4f}', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('../figures/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.3 Confusion Matrices

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, (name, preds) in enumerate(predictions.items()):
    plot_confusion_matrix(y_test, preds['y_pred'], name, axes[idx])

plt.tight_layout()
plt.savefig('../figures/confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.4 ROC Curves

In [None]:
# Plot ROC curves
fig, ax = plt.subplots(figsize=(10, 8))

colors_roc = ['#3498db', '#2ecc71', '#e74c3c']

for idx, (name, preds) in enumerate(predictions.items()):
    fpr, tpr, _ = roc_curve(y_test, preds['y_pred_proba'])
    auc_score = roc_auc_score(y_test, preds['y_pred_proba'])
    ax.plot(fpr, tpr, color=colors_roc[idx], linewidth=2,
            label=f'{name} (AUC = {auc_score:.4f})')

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../figures/roc_curves.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.5 Precision-Recall Curves

In [None]:
# Plot Precision-Recall curves
fig, ax = plt.subplots(figsize=(10, 8))

for idx, (name, preds) in enumerate(predictions.items()):
    precision, recall, _ = precision_recall_curve(y_test, preds['y_pred_proba'])
    ap_score = average_precision_score(y_test, preds['y_pred_proba'])
    ax.plot(recall, precision, color=colors_roc[idx], linewidth=2,
            label=f'{name} (AP = {ap_score:.4f})')

# Baseline (proportion of positive class)
baseline = y_test.sum() / len(y_test)
ax.axhline(y=baseline, color='gray', linestyle='--', linewidth=1, label=f'Baseline ({baseline:.4f})')

ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curves', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../figures/precision_recall_curves.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.6 Feature Importance - Random Forest

In [None]:
# Get feature importances from Random Forest
rf_model = models['Random Forest']
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot top 15 features
fig, ax = plt.subplots(figsize=(10, 8))

top_features = feature_importance.head(15)
colors_feat = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(top_features)))
ax.barh(top_features['Feature'], top_features['Importance'], color=colors_feat)
ax.set_xlabel('Feature Importance', fontsize=12)
ax.set_title('Top 15 Most Important Features - Random Forest', fontsize=14, fontweight='bold')
ax.invert_yaxis()

plt.tight_layout()
plt.savefig('../figures/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Summary and Conclusions

In [None]:
# Final summary
print("=" * 80)
print("CREDIT CARD FRAUD DETECTION - FINAL SUMMARY")
print("=" * 80)

print("\n1. DATASET CHARACTERISTICS")
print(f"   Total transactions - 284,807")
print(f"   Fraudulent transactions - 492 (0.172%)")
print(f"   Imbalance ratio - 1:578")

print("\n2. PREPROCESSING")
print("   Applied RobustScaler to Amount and Time features")
print("   Used SMOTE for handling class imbalance")
print("   80-20 train-test split with stratification")

print("\n3. MODEL PERFORMANCE")
print(results_df.to_string())

best_model = results_df['F1-Score'].idxmax()
print(f"\n4. BEST PERFORMING MODEL - {best_model}")
print(f"   F1-Score - {results_df.loc[best_model, 'F1-Score']:.4f}")
print(f"   ROC-AUC - {results_df.loc[best_model, 'ROC-AUC']:.4f}")
print(f"   Recall - {results_df.loc[best_model, 'Recall']:.4f}")

print("\n5. KEY FINDINGS")
print("   V14, V12, and V10 are the most negatively correlated features with fraud")
print("   V17, V14, and V12 are the most important features for prediction")
print("   SMOTE effectively addresses class imbalance")
print("   Ensemble methods outperform Logistic Regression")

print("\n" + "=" * 80)

## 8. Future Improvements

**Hyperparameter Tuning** - Use GridSearchCV or RandomizedSearchCV to optimize model parameters

**Additional Models** - Test LightGBM, CatBoost, or Neural Networks

**Feature Engineering** - Create new features from existing ones such as transaction velocity

**Ensemble Methods** - Combine multiple models using stacking or voting classifiers

**Threshold Optimization** - Adjust classification threshold based on business requirements

**Cost-Sensitive Learning** - Incorporate different costs for false positives vs false negatives

**Real-time Detection** - Implement streaming pipeline for real-time fraud detection

In [None]:
# Save the best model
import joblib

best_model_obj = models[best_model]
joblib.dump(best_model_obj, '../models/best_fraud_detector.pkl')
print(f"Best model ({best_model}) saved to '../models/best_fraud_detector.pkl'")