# Credit Card Fraud Detection

**Author:** Piyush Ramteke  
**Program:** CodSoft Data Science Internship  

---

## 1. Problem Statement

Credit card fraud causes billions of dollars in losses every year. As digital payments grow, so does fraudulent activity. The challenge is to **automatically detect fraudulent transactions** among hundreds of thousands of legitimate ones.

This is a **binary classification** problem:
- **Class 0** → Genuine transaction
- **Class 1** → Fraudulent transaction

The key difficulty is **class imbalance** — frauds make up less than 0.2% of all transactions. A naive model that predicts everything as "genuine" would get 99.8% accuracy but catch zero frauds.

**Goal:** Build a model that maximizes **Recall** (catching as many frauds as possible) while maintaining acceptable **Precision** (minimizing false alarms).

## 2. Dataset Overview

The dataset contains credit card transactions made by European cardholders in **September 2013** over a two-day period.

| Feature | Description |
|---------|-------------|
| `Time` | Seconds elapsed since the first transaction |
| `V1` – `V28` | PCA-transformed features (anonymized for privacy) |
| `Amount` | Transaction amount |
| `Class` | **Target** — 0 = Genuine, 1 = Fraud |

In [None]:
# ── Import Libraries ──────────────────────────────────────────

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
)
from imblearn.over_sampling import SMOTE

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

print('All libraries loaded.')

In [None]:
# ── Load Dataset ─────────────────────────────────────────────

df = pd.read_csv('creditcard.csv')

print(f'Shape: {df.shape[0]:,} rows × {df.shape[1]} columns')
df.head()

In [None]:
# ── Dataset Info ─────────────────────────────────────────────

df.info()
print()
df[['Time', 'Amount', 'Class']].describe()

In [None]:
# ── Check for Missing Values ─────────────────────────────────

missing = df.isnull().sum().sum()
print(f'Total missing values: {missing}')
print('→ The dataset has no missing values.')

**Observations:**
- 284,807 transactions with 31 columns
- `V1`–`V28` are already PCA-scaled; `Time` and `Amount` are not
- No missing values — the dataset is clean

---

## 3. Exploratory Data Analysis

In [None]:
# ── 3.1 Class Distribution ───────────────────────────────────

counts = df['Class'].value_counts()
pct    = df['Class'].value_counts(normalize=True) * 100

print('Class Distribution:')
print(f'  Genuine (0): {counts[0]:>7,}   ({pct[0]:.3f}%)')
print(f'  Fraud   (1): {counts[1]:>7,}   ({pct[1]:.3f}%)')
print(f'\n  Ratio: 1 fraud per {counts[0]//counts[1]} genuine transactions')

In [None]:
# ── 3.2 Visualize Class Imbalance ────────────────────────────

colors = ['#27ae60', '#c0392b']

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Bar chart
bars = axes[0].bar(['Genuine (0)', 'Fraud (1)'], counts.values,
                   color=colors, edgecolor='black')
axes[0].set_title('Class Distribution', fontsize=15, fontweight='bold')
axes[0].set_ylabel('Count')
for b, c in zip(bars, counts.values):
    axes[0].text(b.get_x() + b.get_width()/2, b.get_height()+2000,
                f'{c:,}', ha='center', fontweight='bold', fontsize=12)

# Pie chart
axes[1].pie(counts, labels=['Genuine','Fraud'], autopct='%1.3f%%',
            colors=colors, explode=(0.02, 0.12), shadow=True,
            textprops={'fontsize':12})
axes[1].set_title('Class Proportion', fontsize=15, fontweight='bold')

plt.tight_layout()
plt.show()

**Interpretation:** The dataset is **extremely imbalanced** — only 0.173% of transactions are fraudulent. Any model trained on this data directly will be biased toward predicting "genuine" and will fail to detect fraud. We will address this in **Section 5**.

In [None]:
# ── 3.3 Transaction Amount Distribution ──────────────────────

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram (zoom to $0–$500)
axes[0].hist(df[df['Class']==0]['Amount'], bins=60, alpha=0.7,
             color=colors[0], label='Genuine', edgecolor='black')
axes[0].hist(df[df['Class']==1]['Amount'], bins=60, alpha=0.7,
             color=colors[1], label='Fraud', edgecolor='black')
axes[0].set_xlim(0, 500)
axes[0].set_title('Transaction Amount (≤ $500)', fontsize=15, fontweight='bold')
axes[0].set_xlabel('Amount ($)')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Boxplot
sns.boxplot(x='Class', y='Amount', data=df, palette=colors, ax=axes[1])
axes[1].set_ylim(0, 400)
axes[1].set_xticklabels(['Genuine', 'Fraud'])
axes[1].set_title('Amount by Class', fontsize=15, fontweight='bold')

plt.tight_layout()
plt.show()

print(f'Genuine → Mean: ${df[df["Class"]==0]["Amount"].mean():.2f}, '
      f'Median: ${df[df["Class"]==0]["Amount"].median():.2f}')
print(f'Fraud   → Mean: ${df[df["Class"]==1]["Amount"].mean():.2f}, '
      f'Median: ${df[df["Class"]==1]["Amount"].median():.2f}')

**Interpretation:** Fraudulent transactions tend to have **lower median amounts** compared to genuine ones. Most fraud occurs at smaller transaction values, making it harder to detect by amount alone.

In [None]:
# ── 3.4 Top Features Correlated with Fraud ───────────────────

corr = df.corr()['Class'].drop('Class').abs().sort_values(ascending=False)
top10 = corr.head(10)

fig, ax = plt.subplots(figsize=(10, 5))
top10.plot(kind='barh', ax=ax, color='#2980b9', edgecolor='black')
ax.set_title('Top 10 Features Correlated with Fraud', fontsize=15, fontweight='bold')
ax.set_xlabel('Absolute Correlation')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

**Interpretation:** Features like `V17`, `V14`, `V12`, and `V10` have the strongest correlations with the fraud label. Since these are PCA components, they represent hidden patterns in the original transaction data that distinguish fraud from genuine activity.

---

## 4. Data Preprocessing

In [None]:
# ── 4.1 Normalize Amount and Time ────────────────────────────
# V1–V28 are already PCA-scaled.
# Amount and Time are on very different scales, so we
# standardize them (mean=0, std=1) to match.

scaler = StandardScaler()

df['Scaled_Amount'] = scaler.fit_transform(df[['Amount']])
df['Scaled_Time']   = scaler.fit_transform(df[['Time']])

# Drop original unscaled columns
df_processed = df.drop(['Amount', 'Time'], axis=1)

print('Scaled_Amount — '
      f'mean: {df_processed["Scaled_Amount"].mean():.4f}, '
      f'std: {df_processed["Scaled_Amount"].std():.4f}')
print('Scaled_Time   — '
      f'mean: {df_processed["Scaled_Time"].mean():.4f}, '
      f'std: {df_processed["Scaled_Time"].std():.4f}')
print('\nAmount and Time normalized successfully.')

In [None]:
# ── 4.2 Separate Features and Target ────────────────────────

X = df_processed.drop('Class', axis=1)
y = df_processed['Class']

print(f'Features shape: {X.shape}')
print(f'Target shape:   {y.shape}')

In [None]:
# ── 4.3 Stratified Train/Test Split ─────────────────────────
# stratify=y ensures both sets maintain the same fraud ratio.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training set: {X_train.shape[0]:,} samples  '
      f'(Genuine: {(y_train==0).sum():,}, Fraud: {(y_train==1).sum()})')
print(f'Testing set:  {X_test.shape[0]:,} samples  '
      f'(Genuine: {(y_test==0).sum():,}, Fraud: {(y_test==1).sum()})')
print(f'\nFraud ratio preserved — '
      f'Train: {(y_train==1).mean()*100:.3f}%, '
      f'Test: {(y_test==1).mean()*100:.3f}%')

**Summary of Preprocessing:**
- No missing values found
- `Amount` and `Time` standardized using **StandardScaler**
- Stratified 80/20 split preserves original fraud ratio in both sets

---

## 5. Handling Class Imbalance

We use **SMOTE (Synthetic Minority Oversampling Technique)** to balance the training data.

**How SMOTE works:**
1. For each fraud sample, it finds its k-nearest fraud neighbors
2. It creates new synthetic fraud samples by interpolating between them
3. This produces realistic, diverse fraud examples — better than simple duplication

> **Important:** SMOTE is applied **only on the training set**. The test set remains untouched to simulate real-world conditions.

In [None]:
# ── 5.1 Apply SMOTE ─────────────────────────────────────────

smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

print('Before SMOTE:')
print(f'  Genuine: {(y_train==0).sum():,}  |  Fraud: {(y_train==1).sum()}')

print('\nAfter SMOTE:')
print(f'  Genuine: {(y_train_sm==0).sum():,}  |  Fraud: {(y_train_sm==1).sum():,}')
print(f'  Total:   {len(y_train_sm):,}')

In [None]:
# ── 5.2 Visualize Before vs After SMOTE ─────────────────────

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

pd.Series(y_train).value_counts().plot(
    kind='bar', ax=axes[0], color=colors, edgecolor='black')
axes[0].set_title('Before SMOTE', fontsize=15, fontweight='bold')
axes[0].set_xticklabels(['Genuine', 'Fraud'], rotation=0)
axes[0].set_ylabel('Count')

pd.Series(y_train_sm).value_counts().plot(
    kind='bar', ax=axes[1], color=colors, edgecolor='black')
axes[1].set_title('After SMOTE', fontsize=15, fontweight='bold')
axes[1].set_xticklabels(['Genuine', 'Fraud'], rotation=0)
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

**Result:** After SMOTE, both classes are perfectly balanced. The model can now learn fraud patterns without being overwhelmed by the majority class.

---

## 6. Model Training

We train two classifiers:

| Model | Why |
|-------|-----|
| **Logistic Regression** | Simple, fast, interpretable — good baseline |
| **Random Forest** | Ensemble of decision trees — captures complex patterns |

In [None]:
# ── 6.1 Train Logistic Regression ───────────────────────────

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_sm, y_train_sm)

lr_pred = lr.predict(X_test)
lr_prob = lr.predict_proba(X_test)[:, 1]

print('Logistic Regression — trained.')

In [None]:
# ── 6.2 Train Random Forest ─────────────────────────────────

rf = RandomForestClassifier(
    n_estimators=100,   # number of trees
    max_depth=15,       # limit depth to prevent overfitting
    random_state=42,
    n_jobs=-1           # use all CPU cores
)
rf.fit(X_train_sm, y_train_sm)

rf_pred = rf.predict(X_test)
rf_prob = rf.predict_proba(X_test)[:, 1]

print('Random Forest — trained.')

Both models are trained on **SMOTE-balanced data** and evaluated on the **original imbalanced test set** to reflect real-world performance.

---

## 7. Model Evaluation

### Why Recall Matters Most in Fraud Detection

| Metric | Meaning | In fraud context |
|--------|---------|------------------|
| **Precision** | % of predicted frauds that are real | High precision → fewer false alarms |
| **Recall** | % of real frauds that we caught | **High recall → fewer missed frauds** |
| **F1-Score** | Harmonic mean of precision and recall | Best single metric for imbalanced data |

**Key insight:** A missed fraud (False Negative) means **real financial loss**. A false alarm (False Positive) is just a phone call to the customer. Therefore, **Recall is the priority metric.**

In [None]:
# ── 7.1 Confusion Matrices ──────────────────────────────────

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, name, pred in [
    (axes[0], 'Logistic Regression', lr_pred),
    (axes[1], 'Random Forest', rf_pred)
]:
    cm = confusion_matrix(y_test, pred)
    sns.heatmap(cm, annot=True, fmt=',d', cmap='Blues', ax=ax,
                xticklabels=['Genuine','Fraud'],
                yticklabels=['Genuine','Fraud'],
                annot_kws={'fontsize': 14})
    ax.set_title(name, fontsize=14, fontweight='bold')
    ax.set_ylabel('Actual')
    ax.set_xlabel('Predicted')

plt.suptitle('Confusion Matrices', fontsize=17, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print('Reading guide:')
print('  Top-left  = True Negative  (Genuine → Genuine)  ✓')
print('  Top-right = False Positive (Genuine → Fraud)    ← false alarm')
print('  Bot-left  = False Negative (Fraud → Genuine)    ← MISSED fraud!')
print('  Bot-right = True Positive  (Fraud → Fraud)      ✓ caught!')

In [None]:
# ── 7.2 Classification Reports ──────────────────────────────

print('=' * 55)
print('Logistic Regression — Classification Report')
print('=' * 55)
print(classification_report(y_test, lr_pred,
                            target_names=['Genuine','Fraud']))

print('=' * 55)
print('Random Forest — Classification Report')
print('=' * 55)
print(classification_report(y_test, rf_pred,
                            target_names=['Genuine','Fraud']))

In [None]:
# ── 7.3 ROC Curves ──────────────────────────────────────────

fig, ax = plt.subplots(figsize=(9, 7))

for name, prob, clr in [
    ('Logistic Regression', lr_prob, '#2980b9'),
    ('Random Forest', rf_prob, '#c0392b')
]:
    fpr, tpr, _ = roc_curve(y_test, prob)
    auc = roc_auc_score(y_test, prob)
    ax.plot(fpr, tpr, color=clr, lw=2,
            label=f'{name} (AUC = {auc:.4f})')

ax.plot([0,1],[0,1], 'k--', lw=1, alpha=0.5, label='Random (AUC = 0.5)')
ax.set_xlabel('False Positive Rate', fontsize=13)
ax.set_ylabel('True Positive Rate (Recall)', fontsize=13)
ax.set_title('ROC Curve', fontsize=16, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ── 7.4 Precision-Recall Curves ─────────────────────────────
# More informative than ROC for imbalanced datasets.

fig, ax = plt.subplots(figsize=(9, 7))

for name, prob, clr in [
    ('Logistic Regression', lr_prob, '#2980b9'),
    ('Random Forest', rf_prob, '#c0392b')
]:
    p, r, _ = precision_recall_curve(y_test, prob)
    ap = average_precision_score(y_test, prob)
    ax.plot(r, p, color=clr, lw=2,
            label=f'{name} (AP = {ap:.4f})')

ax.set_xlabel('Recall', fontsize=13)
ax.set_ylabel('Precision', fontsize=13)
ax.set_title('Precision-Recall Curve', fontsize=16, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

**Interpretation:**
- The **ROC curve** shows overall discriminative ability — closer to the top-left is better.
- The **Precision-Recall curve** is more relevant for imbalanced problems because it focuses on the fraud class specifically.

---

## 8. Model Comparison

In [None]:
# ── 8.1 Build Comparison Table ──────────────────────────────

def get_metrics(name, y_true, y_pred, y_prob):
    return {
        'Model': name,
        'Accuracy':  accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall':    recall_score(y_true, y_pred),
        'F1-Score':  f1_score(y_true, y_pred),
        'ROC-AUC':   roc_auc_score(y_true, y_prob)
    }

results = pd.DataFrame([
    get_metrics('Logistic Regression', y_test, lr_pred, lr_prob),
    get_metrics('Random Forest',       y_test, rf_pred, rf_prob)
]).set_index('Model').round(4)

print('Final Model Comparison:')
print('=' * 65)
results

In [None]:
# ── 8.2 Bar Chart Comparison ────────────────────────────────

metrics = ['Precision', 'Recall', 'F1-Score', 'ROC-AUC']
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

bar_colors = ['#2980b9', '#c0392b']

for i, m in enumerate(metrics):
    ax = axes[i]
    vals = results[m].values
    bars = ax.bar(results.index, vals, color=bar_colors, edgecolor='black')
    ax.set_title(m, fontsize=14, fontweight='bold')
    ax.set_ylim(0, 1.15)
    ax.tick_params(axis='x', rotation=15)
    for b, v in zip(bars, vals):
        ax.text(b.get_x()+b.get_width()/2, b.get_height()+0.02,
                f'{v:.4f}', ha='center', fontweight='bold', fontsize=11)

plt.suptitle('Model Performance Comparison',
             fontsize=18, fontweight='bold', y=1.04)
plt.tight_layout()
plt.show()

In [None]:
# ── 8.3 Best Model Selection ────────────────────────────────

best = results['F1-Score'].idxmax()

print('Best Model (by F1-Score):', best)
print(f'  Precision: {results.loc[best, "Precision"]:.4f}')
print(f'  Recall:    {results.loc[best, "Recall"]:.4f}')
print(f'  F1-Score:  {results.loc[best, "F1-Score"]:.4f}')
print(f'  ROC-AUC:   {results.loc[best, "ROC-AUC"]:.4f}')

---

## 9. Conclusion

### Key Findings

1. **The dataset is heavily imbalanced** (0.17% fraud). Accuracy is misleading — a model predicting all-genuine achieves 99.8% accuracy but catches zero fraud.

2. **SMOTE effectively balanced the training data** by generating synthetic fraud samples, allowing models to learn fraud patterns without losing genuine data.

3. **Both Logistic Regression and Random Forest** performed well after SMOTE balancing. Random Forest generally achieves higher precision, while Logistic Regression may achieve higher recall depending on the threshold.

4. **Recall is the most important metric** for fraud detection:
   - A **missed fraud** (False Negative) = real financial loss to the customer and bank.
   - A **false alarm** (False Positive) = minor inconvenience, resolved with a verification call.
   - Therefore, we prioritize catching every fraud even at the cost of some false alarms.

5. **F1-Score** provides the best single metric for this problem, balancing precision and recall.

### Recommendations

- Use **threshold tuning** to adjust the precision-recall tradeoff for business needs
- Consider **Gradient Boosting** (XGBoost, LightGBM) for potential improvements
- Implement **cost-sensitive learning** to weight fraud misclassification more heavily
- **Retrain periodically** — fraud patterns evolve over time

---

*Project by **Piyush Ramteke** — CodSoft Data Science Internship*