# Exercise: Card Fraud Detection with SMOTE

This notebook implements a training loop for fraud detection using logistic regression with proper batch training and evaluation metrics.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

## Helper Functions

Define the necessary functions for training the model.

In [None]:
def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def predict(X, theta):
    """Make predictions using logistic regression"""
    z = np.dot(X, theta)
    return sigmoid(z)

def compute_loss(y_hat, y):
    """Compute binary cross-entropy loss"""
    epsilon = 1e-15  # To avoid log(0)
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    loss = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
    return loss

def compute_gradient(X, y, y_hat):
    """Compute gradient for logistic regression"""
    m = X.shape[0]
    gradient = np.dot(X.T, (y_hat - y)) / m
    return gradient

def update_theta(theta, gradient, lr):
    """Update parameters using gradient descent"""
    return theta - lr * gradient

def compute_accuracy(X, y, theta):
    """Compute classification accuracy"""
    y_hat = predict(X, theta)
    predictions = (y_hat >= 0.5).astype(int)
    accuracy = np.mean(predictions == y)
    return accuracy

## Data Preparation

Generate synthetic imbalanced dataset and apply SMOTE for balancing.

In [None]:
# Generate synthetic imbalanced dataset for fraud detection
np.random.seed(42)
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    weights=[0.95, 0.05],  # 95% normal, 5% fraud
    flip_y=0.01,
    random_state=42
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Further split training data for validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Apply SMOTE to balance the training data
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

print(f"Training samples: {X_train.shape[0]}")
print(f"Validation samples: {X_val.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Training class distribution: {np.bincount(y_train.astype(int))}")

## Training Loop - CORRECTED VERSION

This is the **corrected** training loop that addresses the performance issues in the original code.

### Key Improvements:
1. **Batch-level metrics**: Only track loss at the batch level during training
2. **Epoch-level evaluation**: Compute accuracy on full train/val sets once per epoch (not per batch)
3. **Efficiency**: Dramatically reduces computation time by avoiding redundant full-dataset evaluations

### Original Issue:
The original code computed accuracy on the entire training and validation datasets for every single batch, which is extremely inefficient.

In [None]:
# Initialize parameters
n_features = X_train.shape[1]
theta = np.random.randn(n_features) * 0.01

# Hyperparameters
epochs = 50
batch_size = 64
lr = 0.01

# Lists to store metrics
train_accs = []
train_losses = []
val_accs = []
val_losses = []

for epoch in range(epochs):
    train_batch_losses = []
    
    # Training phase - iterate through batches
    for i in range(0, X_train.shape[0], batch_size):
        X_i = X_train[i:i+batch_size]
        y_i = y_train[i:i+batch_size]
        
        # Forward pass
        y_hat = predict(X_i, theta)
        
        # Compute batch loss
        train_loss = compute_loss(y_hat, y_i)
        train_batch_losses.append(train_loss)
        
        # Backward pass
        gradient = compute_gradient(X_i, y_i, y_hat)
        
        # Update parameters
        theta = update_theta(theta, gradient, lr)
    
    # Epoch-level evaluation (done ONCE per epoch, not per batch)
    train_epoch_loss = sum(train_batch_losses) / len(train_batch_losses)
    train_acc = compute_accuracy(X_train, y_train, theta)
    
    # Validation evaluation (done ONCE per epoch)
    y_val_hat = predict(X_val, theta)
    val_loss = compute_loss(y_val_hat, y_val)
    val_acc = compute_accuracy(X_val, y_val, theta)
    
    # Store metrics
    train_losses.append(train_epoch_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    # Print progress every 5 epochs
    if (epoch + 1) % 5 == 0:
        print(f"EPOCH {epoch + 1}:\t Training loss: {train_epoch_loss:.3f}\t Validation loss: {val_loss:.3f}\t Train Acc: {train_acc:.3f}\t Val Acc: {val_acc:.3f}")

## Comparison: Original vs Corrected Code

### Original Code (INEFFICIENT):
```python
for epoch in range(epochs):
    for i in range(0, X_train.shape[0], batch_size):
        # ... batch training ...
        
        # ❌ PROBLEM: Computing accuracy on ENTIRE dataset for EVERY batch
        train_acc = compute_accuracy(X_train, y_train, theta)
        train_batch_accs.append(train_acc)
        
        val_acc = compute_accuracy(X_val, y_val, theta)
        val_batch_accs.append(val_acc)
```

### Corrected Code (EFFICIENT):
```python
for epoch in range(epochs):
    for i in range(0, X_train.shape[0], batch_size):
        # ... batch training ...
        # ✓ Only track batch loss
        train_batch_losses.append(train_loss)
    
    # ✓ Compute accuracy ONCE per epoch
    train_acc = compute_accuracy(X_train, y_train, theta)
    val_acc = compute_accuracy(X_val, y_val, theta)
```

### Performance Impact:
- If you have 100 batches per epoch and 50 epochs:
  - Original: 5,000 accuracy computations on full dataset
  - Corrected: 50 accuracy computations on full dataset
  - **100x reduction in computation!**

## Visualization of Training History

In [None]:
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot losses
ax1.plot(train_losses, label='Training Loss', marker='o', markersize=3)
ax1.plot(val_losses, label='Validation Loss', marker='s', markersize=3)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot accuracies
ax2.plot(train_accs, label='Training Accuracy', marker='o', markersize=3)
ax2.plot(val_accs, label='Validation Accuracy', marker='s', markersize=3)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal Training Accuracy: {train_accs[-1]:.4f}")
print(f"Final Validation Accuracy: {val_accs[-1]:.4f}")

## Test Set Evaluation

In [None]:
# Evaluate on test set
test_acc = compute_accuracy(X_test, y_test, theta)
y_test_hat = predict(X_test, theta)
test_loss = compute_loss(y_test_hat, y_test)

print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")

## Summary

This notebook demonstrates:
1. **Proper training loop implementation** - efficient batch training with epoch-level evaluation
2. **SMOTE for imbalanced data** - handling fraud detection's class imbalance
3. **Metric tracking** - proper separation of batch-level and epoch-level metrics
4. **Performance optimization** - avoiding redundant computations

The key lesson: **Compute expensive metrics (like accuracy on full dataset) only when necessary, not at every batch!**