# ðŸ“¦ Datasets and DataLoaders in PyTorch

This notebook demonstrates how to use PyTorch's built-in datasets and DataLoaders for efficient data handling.

## What You'll Learn
- Loading built-in datasets (FashionMNIST)
- Exploring dataset structure
- Using DataLoader for batching and shuffling
- Visualizing samples

In [None]:
import torch
from torchvision import datasets
from torchvision.transforms import ToTensor

from matplotlib import pyplot as plt

## Loading FashionMNIST Dataset

FashionMNIST is a dataset of 70,000 grayscale images of clothing items:
- 60,000 training images
- 10,000 test images
- 10 classes (T-shirt, Trouser, Pullover, etc.)
- Each image is 28Ã—28 pixels

**Parameters:**
- `root`: Where to store the data
- `train`: True for training set, False for test set
- `download`: Download if not present
- `transform`: Preprocessing to apply (ToTensor converts to tensor)

In [None]:
training_data = datasets.FashionMNIST(
    root='data',
    train=True,
    download=True,
    transform=ToTensor()
)

testing_data = datasets.FashionMNIST(
    root='data',
    train=False,
    download=True,
    transform=ToTensor()
)

## Dataset Size

Let's verify the dataset sizes match what we expect.

In [None]:
print(f"Training samples: {len(training_data)}")
print(f"Testing samples: {len(testing_data)}")

## Accessing Individual Samples

Each sample is a tuple of `(image, label)`:
- `image`: Tensor of shape [1, 28, 28] (1 channel, 28Ã—28 pixels)
- `label`: Integer class index (0-9)

In [None]:
image, label = training_data[0]
print(f"Label: {label}")
print(f"Image shape: {image.shape}")
print(f"Image shape (squeezed): {image.squeeze().shape}")

## Visualizing a Single Image

We use `squeeze()` to remove the channel dimension for visualization.

In [None]:
plt.figure(figsize=(2,2))
plt.imshow(image.squeeze(), cmap='gray')
plt.show()

## Class Labels Mapping

FashionMNIST has 10 classes. Let's create a mapping from index to name.

In [None]:
labels_map = {
    0: "T-shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot"
}

print(f"Class {label} = {labels_map[label]}")

## Visualizing Multiple Random Samples

Let's display a 3Ã—3 grid of random samples from the training set.

In [None]:
figure = plt.figure(figsize=(8,8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

## Creating DataLoaders

DataLoader wraps a dataset and provides:
- **Batching**: Groups samples into batches
- **Shuffling**: Randomizes order each epoch (important for training!)
- **Parallel loading**: Uses multiple workers for faster loading

**Key parameters:**
- `batch_size=64`: Process 64 samples at a time
- `shuffle=True`: Randomize order (use True for training, False for testing)

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(testing_data, batch_size=64, shuffle=False)

## Iterating Through DataLoader

DataLoader returns batches of data. Each batch contains:
- `images`: Tensor of shape [batch_size, 1, 28, 28]
- `labels`: Tensor of shape [batch_size]

In [None]:
data_iter = iter(train_dataloader)
images, labels = next(data_iter)

print(f"Batch images shape: {images.shape}")
print(f"Batch labels shape: {labels.shape}")

## Viewing a Sample from the Batch

In [None]:
print(f"First image label: {labels_map[labels[0].item()]}")

plt.figure(figsize=(2,2))
plt.imshow(images[0].squeeze(), cmap='gray')
plt.show()

## Training Loop Pattern

This is how you typically iterate through a DataLoader during training:

In [None]:
for images, labels in train_dataloader:
    print(f"Batch - Images: {images.shape}, Labels: {labels.shape}")
    break  # Just show first batch

## Key Takeaways

1. **Datasets** provide access to individual samples via indexing
2. **DataLoaders** handle batching, shuffling, and parallel loading
3. **Always shuffle training data** to prevent learning data order
4. **Don't shuffle test data** for reproducible evaluation
5. **Batch size** affects memory usage and training dynamics