# Module 1, Task 3: Data Loading and Augmentation Using PyTorch

**Objective:** Build an efficient data loading pipeline in PyTorch using `Dataset` and `DataLoader`, and apply image augmentations with `torchvision.transforms`.

In [None]:
# Install necessary libraries
!pip install torch torchvision torchaudio matplotlib numpy

### Setup
Import libraries and download the EuroSAT dataset using `torchvision.datasets`.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import matplotlib.pyplot as plt
import numpy as np

BATCH_SIZE = 64
IMG_SIZE = 128

# Note: EuroSAT is not a default dataset in torchvision. 
# It must be downloaded manually or using a helper. For simplicity, 
# we will use a similar, built-in dataset: CIFAR-10, and describe how the process 
# would work for a custom dataset like EuroSAT.
# If EuroSAT were available, the code would look like this:
# dataset = torchvision.datasets.EuroSAT(root='./data', download=True, transform=...)

# Using CIFAR-10 as a stand-in for demonstration purposes.
print("Using CIFAR-10 as a demonstrative dataset.")
# We will define transforms later
full_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True)

# Split into training and validation
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

CLASS_NAMES = full_dataset.classes
NUM_CLASSES = len(CLASS_NAMES)

print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of validation samples: {len(val_dataset)}")
print(f"Number of classes: {NUM_CLASSES}")
print(f"Class names: {CLASS_NAMES}")

### Data Augmentation and Transformation

In PyTorch, we define a sequence of transformations using `transforms.Compose`. We'll create separate transform pipelines for training (with augmentation) and validation (without augmentation).

In [None]:
# Transforms for the training set (with augmentation)
train_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomResizedCrop(IMG_SIZE, scale=(0.8, 1.0)),
    transforms.ToTensor(), # Converts image to [C, H, W] tensor and scales to [0, 1]
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Standard normalization
])

# Transforms for the validation set (only resize, convert to tensor, and normalize)
val_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

### Applying Transforms to Datasets

For a custom dataset created with `random_split`, we need to assign the transforms to the underlying dataset object. This feels a bit tricky, so often it's better to create custom `Dataset` classes. For this example, we'll re-create the datasets with the transforms applied from the start.

In [None]:
# Re-downloading and applying transforms directly. A better practice for custom datasets
# is to write a custom Dataset class.
train_dataset_transformed = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
val_dataset_transformed = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=val_transform)

# We still need to split them.
train_dataset_final, _ = random_split(train_dataset_transformed, [train_size, val_size])
_, val_dataset_final = random_split(val_dataset_transformed, [train_size, val_size])

print("Transforms have been applied.")

### Visualizing Augmented Data
Let's check the effect of our augmentation pipeline.

In [None]:
def imshow(img, title):
    """Helper function to un-normalize and display an image."""
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    img = img.numpy().transpose((1, 2, 0))
    img = std * img + mean # Un-normalize
    img = np.clip(img, 0, 1)
    plt.imshow(img)
    plt.title(title)
    plt.axis('off')

# Get one original image for comparison
original_img, original_label = val_dataset[0]

plt.figure(figsize=(12, 12))
plt.subplot(3, 3, 1)
plt.imshow(original_img)
plt.title(f"Original: {CLASS_NAMES[original_label]}")
plt.axis('off')

# Show 8 augmented versions of the same image class
for i in range(8):
    augmented_img, label = train_dataset_final[i]
    plt.subplot(3, 3, i + 2)
    imshow(augmented_img, title=f"Augmented: {CLASS_NAMES[label]}")

plt.suptitle("Data Augmentation Examples")
plt.show()

### Creating DataLoaders

`DataLoader` is a powerful utility that wraps our `Dataset` and provides:
1.  **Batching:** Automatically groups data into batches.
2.  **Shuffling:** Shuffles the data every epoch to prevent model bias.
3.  **Parallel Loading:** Uses multiple subprocesses (`num_workers`) to load data in parallel, preventing the GPU from waiting.

In [None]:
train_loader = DataLoader(
    dataset=train_dataset_final,
    batch_size=BATCH_SIZE,
    shuffle=True,      # Shuffle training data
    num_workers=2,     # Use 2 parallel processes for data loading
    pin_memory=True    # Speeds up CPU to GPU data transfer
)

val_loader = DataLoader(
    dataset=val_dataset_final,
    batch_size=BATCH_SIZE,
    shuffle=False,     # No need to shuffle validation data
    num_workers=2,
    pin_memory=True
)

print("Finalized PyTorch DataLoaders for training and validation.")
print(train_loader)
print(val_loader)

### Conclusion

We have built an efficient PyTorch data pipeline. We defined separate augmentation transforms for our training and validation sets, applied them, and then wrapped the datasets in `DataLoaders`. These loaders are optimized for performance with shuffling, parallel loading, and pinned memory, and are ready to be used in a PyTorch training loop.