# PyTorch Tutorial: Working with Real Data

In the real world, data doesn't come in nice pre-packaged tensors. It comes in CSV files, image folders, text documents, and databases. This notebook teaches you how to handle real-world data in PyTorch.

## Learning Objectives
- Understand `Dataset` and `DataLoader` classes
- Create custom datasets for CSV and image data
- Use DataLoaders for batching and shuffling
- Apply data transformations and augmentation

---

## The PyTorch Data Pipeline

The standard pipeline for handling data in PyTorch involves two main classes:

1. **`torch.utils.data.Dataset`**: Stores the samples and their corresponding labels.
2. **`torch.utils.data.DataLoader`**: Wraps an iterable around the Dataset to enable easy access to the samples.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

## 1. Custom Dataset from CSV

Let's simulate a common scenario: you have a CSV file with features and labels.

First, we'll create a dummy CSV file.

In [None]:
# Create a dummy CSV file
data = {
    'feature1': np.random.rand(100),
    'feature2': np.random.rand(100),
    'feature3': np.random.rand(100),
    'label': np.random.randint(0, 2, 100)  # Binary classification
}
df = pd.DataFrame(data)
df.to_csv('dummy_data.csv', index=False)
print("Created dummy_data.csv")
print(df.head())

Now, let's create a custom Dataset class to read this CSV.

A custom Dataset must implement three functions:
- `__init__`: Initialize data, download, etc.
- `__len__`: Return the size of the dataset
- `__getitem__`: Return one sample (feature, label) at the given index

### Design Decisions in `__getitem__`

**Lazy loading vs eager loading**: In `__init__`, you can either load all data into memory (eager) or store only file paths and load on-demand (lazy).
- **Eager**: `self.data = pd.read_csv(file)` â€” fast access, high memory usage. Good for datasets that fit in RAM.
- **Lazy**: `self.file = file` and read one row in `__getitem__` â€” slow access, low memory. Good for huge datasets or images.

**Transform placement**: Apply transforms in `__getitem__`, not `__init__`. This way, data augmentation (random flips, crops) gives a different result each epoch, effectively increasing your dataset size.

**Return types**: Always return tensors (not numpy arrays). PyTorch's DataLoader expects tensors for automatic batching and GPU transfer.

In [None]:
class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Get features (columns 0-2)
        features = self.data.iloc[idx, 0:3].values.astype(np.float32)
        # Get label (column 3)
        label = self.data.iloc[idx, 3]
        
        # Convert to tensors
        features = torch.tensor(features)
        label = torch.tensor(label, dtype=torch.long)
        
        return features, label

# Test the dataset
dataset = CSVDataset('dummy_data.csv')
print(f"Dataset length: {len(dataset)}")
features, label = dataset[0]
print(f"First sample features: {features}")
print(f"First sample label: {label}")

## 2. Using DataLoader

The `DataLoader` handles batching, shuffling, and parallel data loading.

In [None]:
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Iterate through the dataloader
print("Iterating through DataLoader:")
for i, (features, labels) in enumerate(dataloader):
    print(f"Batch {i}:")
    print(f"  Features shape: {features.shape}")
    print(f"  Labels: {labels}")
    if i == 2:  # Stop after 3 batches
        break

## 3. Image Data and Transforms

For images, we often use `torchvision.transforms` for preprocessing and augmentation.

Common transforms:
- `Resize`: Change image size
- `ToTensor`: Convert image to tensor (scales to 0-1, moves channels first)
- `Normalize`: Normalize with mean and std
- `RandomHorizontalFlip`: Data augmentation

### Why Normalize?

Neural networks train faster when input features are on a similar scale. Raw pixel values (0-255) create large activations and gradients. Normalization centers data around 0 with unit variance.

The standard ImageNet normalization values (`mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`) are used everywhere because most pretrained models were trained with these values. Using different normalization would break transfer learning.

### Data Augmentation â€” Why It Works

Augmentation artificially increases dataset size by applying random transformations:
- **RandomHorizontalFlip**: A cat flipped horizontally is still a cat
- **RandomCrop**: Forces the model to recognize objects anywhere in the frame
- **ColorJitter**: Makes the model robust to lighting changes
- **RandAugment**: Applies a random subset of augmentations (modern approach)

**The key insight**: Augmentation acts as a regularizer. It tells the model "these variations don't change the label," preventing overfitting to exact pixel patterns.

### Train vs Test Transforms

Always use **different transforms** for training and evaluation:
```python
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),    # Random augmentation
    transforms.RandomHorizontalFlip(),    # Random augmentation
    transforms.ToTensor(),
    transforms.Normalize(mean, std),
])

test_transform = transforms.Compose([
    transforms.Resize(256),               # Deterministic
    transforms.CenterCrop(224),           # Deterministic
    transforms.ToTensor(),
    transforms.Normalize(mean, std),
])
```
Using random augmentations during evaluation would make results non-reproducible.

In [None]:
# Define transforms
transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.RandomHorizontalFlip(p=0.5),  # Augmentation
    transforms.ToTensor(),  # Converts [0, 255] to [0, 1]
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalize to [-1, 1]
])

print("Transforms defined!")

## 4. ImageFolder

If your images are organized in folders by class (e.g., `data/cats/`, `data/dogs/`), you can use `torchvision.datasets.ImageFolder`.

```python
from torchvision.datasets import ImageFolder

# dataset = ImageFolder(root='path/to/data', transform=transform)
```

This automatically assigns labels based on folder names!

## Key Takeaways (Part 1)

1. **Dataset Class**: Implement `__len__` and `__getitem__` to handle your custom data. Choose eager vs lazy loading based on dataset size.
2. **DataLoader**: Use this for efficient batching and shuffling. Set `num_workers > 0` for parallel loading and `pin_memory=True` for GPU training.
3. **Transforms**: Essential for image preprocessing and augmentation. Always use different transforms for training (random augmentations) and evaluation (deterministic).
4. **ImageFolder**: The easiest way to load classification datasets organized by folder.
5. **Normalization**: Standardize inputs so the network trains faster. Use ImageNet stats for transfer learning.

---

# PART 2: HANDLING IMBALANCED DATA (CRITICAL for Real-World ML)

This is one of the most common problems in production ML!

## 5. The Imbalanced Data Problem

### What is Class Imbalance?
When classes have very different numbers of samples:
- **Fraud detection**: 99.9% normal, 0.1% fraud
- **Medical diagnosis**: 95% healthy, 5% disease
- **Click prediction**: 98% no-click, 2% click

### Why It's a Problem
- Model learns to predict majority class always
- **Example**: 99% accuracy by always predicting "normal" (useless!)
- Minority class (the important one!) gets ignored

### The Four Strategies to Handle Imbalanced Data

**1. Class Weights (Most Common)**
Penalize mistakes on the minority class more heavily in the loss function:
```python
# Inverse frequency weighting
weights = 1.0 / class_counts
loss_fn = nn.CrossEntropyLoss(weight=torch.tensor(weights))
```
This tells the model: "Getting a fraud case wrong costs 1000x more than getting a normal case wrong."

**2. Oversampling (SMOTE)**
Duplicate or synthetically generate minority class samples. SMOTE creates new samples by interpolating between existing minority samples.
- Pro: No information loss
- Con: Can overfit to minority class patterns

**3. Undersampling**
Remove majority class samples to balance the dataset.
- Pro: Faster training
- Con: Throws away potentially useful data

**4. Focal Loss**
Automatically focuses training on hard examples by down-weighting easy (confident) predictions:
```
FL(p) = -Î±(1-p)^Î³ Â· log(p)
```
When `Î³ > 0`, well-classified examples (high p) contribute less to the loss. Originally developed for object detection (RetinaNet), now widely used for any imbalanced problem.

### Which Metric to Use?

| Metric | When to Use | Why |
|--------|------------|-----|
| **Accuracy** | Balanced classes only | Misleading for imbalanced data |
| **Precision** | When false positives are costly (spam filter) | "Of everything flagged as fraud, how many were real fraud?" |
| **Recall** | When false negatives are costly (cancer detection) | "Of all actual cancers, how many did we catch?" |
| **F1-Score** | When you need a balance | Harmonic mean of precision and recall |
| **AUC-ROC** | Overall model quality | Threshold-independent; measures ranking ability |
| **PR-AUC** | Severely imbalanced data | More informative than AUC-ROC when positive class is rare |

In [None]:
# Example: Training with proper metrics for imbalanced data
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Simulate predictions from a model
torch.manual_seed(42)
model_simple = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 2)
)

# Get predictions
with torch.no_grad():
    logits = model_simple(features)
    probs = F.softmax(logits, dim=1)
    predictions = logits.argmax(dim=1)

# Calculate metrics
print("=" * 60)
print("IMBALANCED DATA METRICS (What FAANG Interviewers Expect)")
print("=" * 60)

# Confusion Matrix
cm = confusion_matrix(labels.numpy(), predictions.numpy())
print("\nConfusion Matrix:")
print(cm)
print(f"  True Negatives:  {cm[0,0]}")
print(f"  False Positives: {cm[0,1]}")
print(f"  False Negatives: {cm[1,0]}")
print(f"  True Positives:  {cm[1,1]}")

# Classification Report
print("\nClassification Report:")
print(classification_report(labels.numpy(), predictions.numpy(), 
                          target_names=['Class 0 (Majority)', 'Class 1 (Minority)']))

# Calculate metrics manually for understanding
tp = cm[1, 1]  # True Positives
fp = cm[0, 1]  # False Positives
fn = cm[1, 0]  # False Negatives

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nðŸ“Š Key Metrics for Class 1 (Minority):")
print(f"  Precision: {precision:.3f} (Of predicted frauds, how many were real?)")
print(f"  Recall:    {recall:.3f} (Of actual frauds, how many did we catch?)")
print(f"  F1-Score:  {f1:.3f} (Harmonic mean of precision and recall)")

# ROC-AUC (requires probabilities)
try:
    auc = roc_auc_score(labels.numpy(), probs[:, 1].numpy())
    print(f"  AUC-ROC:   {auc:.3f} (Overall discrimination ability)")
except:
    print("  AUC-ROC:   N/A (need more diverse predictions)")

print("\nâœ… These are the metrics that matter for imbalanced data!")

In [None]:
import os
if os.path.exists('dummy_data.csv'):
    os.remove('dummy_data.csv')
    print("Removed dummy_data.csv")