In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, Sampler, SequentialSampler, RandomSampler, SubsetRandomSampler, BatchSampler
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(42)

<torch._C.Generator at 0x10c549210>

<span style="font-size: 15px;">Data does not always come in its final processed form that is required for training machine learning algorithms. We use **transforms** to perform some manipulation of the data and make it suitable for training. Moreover, we need an efficient way to **load** data in batches during training.

PyTorch provides two fundamental data primitives:

1. **`torch.utils.data.Dataset`**: An abstract class representing a dataset. It stores samples and their corresponding labels.
2. **`torch.utils.data.DataLoader`**: Wraps an iterable around a `Dataset` to enable easy access to samples with support for batching, shuffling, and parallel loading.

In this notebook, we will investigate:

1. **Transforms**: How to preprocess and augment data using callable transforms
2. **Dataset**: How to create custom datasets by subclassing `torch.utils.data.Dataset`
3. **DataLoader**: How to efficiently load data in batches with various configurations
4. **Samplers**: How to control the order in which data is loaded

Throughout this notebook:
- $B$ denotes the batch size
- $N$ denotes the total number of samples in the dataset
- $C$, $H$, $W$ denote channels, height, and width for image data


**Overview**

| Component | Purpose | Key Methods/Parameters | Primary Use Cases |
|-----------|---------|----------------------|-------------------|
| **Dataset** | Store and access individual samples | `__init__`, `__len__`, `__getitem__` | Custom data handling, preprocessing |
| **DataLoader** | Batch, shuffle, and load data efficiently | `batch_size`, `shuffle`, `num_workers`, `collate_fn` | Training loops, batch iteration |
| **Transforms** | Preprocess and augment data | Callable classes/functions | Normalization, augmentation, tensor conversion |
| **Samplers** | Control data loading order | `SequentialSampler`, `RandomSampler`, `SubsetRandomSampler` | Custom sampling strategies |

Detailed explanations of each component, including implementation examples, follow below.

</span>

# Transforms

<span style="font-size: 15px;">

Transforms are **callable objects** (functions or classes with a `__call__` method) that take data as input and return transformed data. They are used to:

1. **Convert data formats**: e.g., PIL Image → Tensor, NumPy array → Tensor
2. **Normalize data**: Scale pixel values, standardize features
3. **Augment data**: Random crops, flips, rotations for training robustness
4. **Transform labels**: One-hot encoding, label smoothing

In PyTorch, transforms are typically applied in two places:
- `transform`: Applied to input features (e.g., images)
- `target_transform`: Applied to labels/targets

**Key Principle**: A transform is simply a callable that takes input and returns output:

$$
\text{output} = \text{Transform}(\text{input})
$$

</span>

## Custom Transforms

<span style="font-size: 15px;">

A transform can be implemented as:
1. **A simple function**: Takes input, returns output
2. **A callable class**: Has `__init__` for parameters and `__call__` for the transformation

Using a class is preferred when the transform has configurable parameters.

</span>

In [2]:
# Method 1: Transform as a simple function
def normalize_0_1(x):
    """Normalize tensor to [0, 1] range"""
    return (x - x.min()) / (x.max() - x.min())

# Test the function transform
x = torch.randn(3, 4) * 10 + 5  # Random values
x_normalized = normalize_0_1(x)
print(f"Original range: [{x.min():.2f}, {x.max():.2f}]")
print(f"Normalized range: [{x_normalized.min():.2f}, {x_normalized.max():.2f}]")

Original range: [-6.23, 27.08]
Normalized range: [0.00, 1.00]


In [7]:
# Method 2: Transform as a callable class (preferred for configurable transforms)
# In this method, we normalize (each channel in a) tensor with given mean and std.
class Normalize:
    def __init__(self, mean, std):
        self.mean = torch.tensor(mean).view(-1, 1, 1)  # Shape: (C, 1, 1) for broadcasting
        self.std = torch.tensor(std).view(-1, 1, 1)
    def __call__(self, x):
        return (x - self.mean) / self.std
    def __repr__(self):
        return f"Normalize(mean={self.mean.squeeze().tolist()}, std={self.std.squeeze().tolist()})"

# Test the class transform (ImageNet normalization values)
normalize = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
print(normalize)


# Simulate a 3-channel image (C, H, W) = (3, 4, 4)
image = torch.rand(3, 4, 4)
normalized_image = normalize(image)
print(f"\nOriginal image stats per channel:")
for c in range(3):
    print(f"  Channel {c}: mean={image[c].mean():.3f}, std={image[c].std():.3f}")
print(f"\nNormalized image stats per channel:")
for c in range(3):
    print(f"  Channel {c}: mean={normalized_image[c].mean():.3f}, std={normalized_image[c].std():.3f}")

Normalize(mean=[0.48500001430511475, 0.4560000002384186, 0.4059999883174896], std=[0.2290000021457672, 0.2240000069141388, 0.22499999403953552])

Original image stats per channel:
  Channel 0: mean=0.454, std=0.257
  Channel 1: mean=0.571, std=0.180
  Channel 2: mean=0.445, std=0.319

Normalized image stats per channel:
  Channel 0: mean=-0.136, std=1.120
  Channel 1: mean=0.512, std=0.804
  Channel 2: mean=0.172, std=1.419


## Composing Transforms

<span style="font-size: 15px;">

Often we need to apply multiple transforms in sequence. We can create a `Compose` class that chains transforms together:

$$
\text{output} = T_n(T_{n-1}(...T_2(T_1(\text{input}))))
$$

This is equivalent to `torchvision.transforms.Compose`.

</span>

In [8]:
class Compose:
    """Compose multiple transforms together.
    
    Args:
        transforms: List of transforms to compose
    """
    def __init__(self, transforms):
        self.transforms = transforms
    
    def __call__(self, x):
        for transform in self.transforms:
            x = transform(x)
        return x
    
    def __repr__(self):
        format_string = self.__class__.__name__ + '('
        for t in self.transforms:
            format_string += f'\n    {t},'
        format_string += '\n)'
        return format_string

In [9]:
# Define additional transforms
class ToTensor:
    """Convert numpy array to PyTorch tensor."""
    def __call__(self, x):
        if isinstance(x, np.ndarray):
            # Handle image data: (H, W, C) -> (C, H, W)
            if x.ndim == 3:
                x = x.transpose(2, 0, 1)
            return torch.from_numpy(x).float()
        return x
    
    def __repr__(self):
        return "ToTensor()"

class RandomHorizontalFlip:
    """Randomly flip tensor horizontally with given probability."""
    def __init__(self, p=0.5):
        self.p = p
    
    def __call__(self, x):
        if torch.rand(1).item() < self.p:
            return x.flip(-1)  # Flip along last dimension (width)
        return x
    
    def __repr__(self):
        return f"RandomHorizontalFlip(p={self.p})"

# Compose multiple transforms
transform = Compose([
    ToTensor(),
    RandomHorizontalFlip(p=0.5),
    Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

print(transform)

Compose(
    ToTensor(),
    RandomHorizontalFlip(p=0.5),
    Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
)


In [10]:
# Test composed transform
# Simulate a numpy image (H, W, C) with values in [0, 255]
np_image = np.random.randint(0, 256, size=(32, 32, 3), dtype=np.uint8).astype(np.float32) / 255.0
print(f"Input: numpy array, shape={np_image.shape}, dtype={np_image.dtype}")

transformed = transform(np_image)
print(f"Output: tensor, shape={transformed.shape}, dtype={transformed.dtype}")
print(f"Output range: [{transformed.min():.3f}, {transformed.max():.3f}]")

Input: numpy array, shape=(32, 32, 3), dtype=float32
Output: tensor, shape=torch.Size([3, 32, 32]), dtype=torch.float32
Output range: [-1.000, 1.000]


## Lambda Transforms

<span style="font-size: 15px;">

For simple, one-off transforms, we can use a `Lambda` wrapper that applies any user-defined function:

</span>

In [11]:
class Lambda:
    """Apply a user-defined lambda function as a transform.
    
    Args:
        lambd: Lambda function to apply
    """
    def __init__(self, lambd):
        self.lambd = lambd
    
    def __call__(self, x):
        return self.lambd(x)
    
    def __repr__(self):
        return f"Lambda({self.lambd})"

In [14]:
# Example: One-hot encoding for labels
num_classes = 10
one_hot_transform = Lambda(
    lambda y: torch.zeros(num_classes, dtype=torch.float).scatter_(0, torch.tensor(y), value=1)
)

# Test one-hot encoding
label = 3
one_hot_label = one_hot_transform(label)
print(f"Original label: {label}")
print(f"One-hot encoded: {one_hot_label}")
print(f"Shape: {one_hot_label.shape}")

Original label: 3
One-hot encoded: tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])
Shape: torch.Size([10])


In [13]:
# Another example: Adding noise to data
add_noise = Lambda(lambda x: x + torch.randn_like(x) * 0.1)

x = torch.ones(5)
x_noisy = add_noise(x)
print(f"Original: {x}")
print(f"With noise: {x_noisy}")

Original: tensor([1., 1., 1., 1., 1.])
With noise: tensor([1.1109, 0.9746, 1.0581, 0.9678, 0.9511])


## Common Transform Patterns

<span style="font-size: 15px;">

Here we implement several commonly used transforms from scratch to understand their mechanics:

</span>

In [15]:
class Resize:
    """Resize tensor to given size using interpolation.
    
    Args:
        size: Target size (H, W) or single int for square
    """
    def __init__(self, size):
        if isinstance(size, int):
            self.size = (size, size)
        else:
            self.size = size
    
    def __call__(self, x):
        # x shape: (C, H, W) -> add batch dim -> (1, C, H, W)
        x = x.unsqueeze(0)
        x = torch.nn.functional.interpolate(x, size=self.size, mode='bilinear', align_corners=False)
        return x.squeeze(0)  # Remove batch dim
    
    def __repr__(self):
        return f"Resize(size={self.size})"

class CenterCrop:
    """Crop the center of the tensor.
    
    Args:
        size: Crop size (H, W) or single int for square
    """
    def __init__(self, size):
        if isinstance(size, int):
            self.size = (size, size)
        else:
            self.size = size
    
    def __call__(self, x):
        _, h, w = x.shape
        th, tw = self.size
        
        # Calculate crop coordinates
        top = (h - th) // 2
        left = (w - tw) // 2
        
        return x[:, top:top+th, left:left+tw]
    
    def __repr__(self):
        return f"CenterCrop(size={self.size})"

class RandomCrop:
    """Randomly crop the tensor.
    
    Args:
        size: Crop size (H, W) or single int for square
    """
    def __init__(self, size):
        if isinstance(size, int):
            self.size = (size, size)
        else:
            self.size = size
    
    def __call__(self, x):
        _, h, w = x.shape
        th, tw = self.size
        
        # Random crop coordinates
        top = torch.randint(0, h - th + 1, (1,)).item()
        left = torch.randint(0, w - tw + 1, (1,)).item()
        
        return x[:, top:top+th, left:left+tw]
    
    def __repr__(self):
        return f"RandomCrop(size={self.size})"

class StandardScaler:
    """Standardize features by removing the mean and scaling to unit variance.
    
    For each feature:
        z = (x - mean) / std
    """
    def __init__(self, mean=None, std=None):
        self.mean = mean
        self.std = std
    
    def fit(self, X):
        """Compute mean and std from data."""
        self.mean = X.mean(dim=0)
        self.std = X.std(dim=0)
        return self
    
    def __call__(self, x):
        if self.mean is None or self.std is None:
            raise ValueError("Must call fit() first or provide mean and std")
        return (x - self.mean) / (self.std + 1e-8)
    
    def __repr__(self):
        return f"StandardScaler(mean={self.mean}, std={self.std})"

In [16]:
# Test the transforms
image = torch.rand(3, 64, 64)  # (C, H, W)

resize = Resize(32)
center_crop = CenterCrop(48)
random_crop = RandomCrop(48)

print(f"Original shape: {image.shape}")
print(f"After Resize(32): {resize(image).shape}")
print(f"After CenterCrop(48): {center_crop(image).shape}")
print(f"After RandomCrop(48): {random_crop(image).shape}")

Original shape: torch.Size([3, 64, 64])
After Resize(32): torch.Size([3, 32, 32])
After CenterCrop(48): torch.Size([3, 48, 48])
After RandomCrop(48): torch.Size([3, 48, 48])


**Transforms Quick Reference**

| Transform | Purpose | Input → Output |
|-----------|---------|----------------|
| `ToTensor` | Convert to PyTorch tensor | numpy/PIL → Tensor |
| `Normalize` | Channel-wise normalization | Tensor → Tensor |
| `Resize` | Resize to target size | (C,H,W) → (C,H',W') |
| `CenterCrop` | Crop center region | (C,H,W) → (C,h,w) |
| `RandomCrop` | Random position crop | (C,H,W) → (C,h,w) |
| `RandomHorizontalFlip` | Flip with probability p | (C,H,W) → (C,H,W) |
| `Compose` | Chain multiple transforms | Input → Output |
| `Lambda` | Apply custom function | Input → Output |

# Dataset

<span style="font-size: 15px;">

The `Dataset` class is an abstract class representing a dataset. A custom dataset must inherit from `torch.utils.data.Dataset` and implement the following methods:

1. **`__init__(self, ...)`**: Initialize the dataset (load data paths, set transforms, etc.)
2. **`__len__(self)`**: Return the total number of samples in the dataset
3. **`__getitem__(self, idx)`**: Return the sample (and label) at index `idx`

PyTorch supports two types of datasets:

| Type | Base Class | Access Pattern | Use Case |
|------|------------|----------------|----------|
| **Map-style** | `Dataset` | `dataset[idx]` | Random access, known size |
| **Iterable-style** | `IterableDataset` | `iter(dataset)` | Streaming data, unknown size |

We focus on **map-style** datasets, which are the most common.

</span>

## Basic Custom Dataset

<span style="font-size: 15px;">

Let's create a simple dataset from in-memory data:

</span>

In [20]:
class SimpleDataset(Dataset):
    """A simple dataset holding features and labels in memory.
    
    Args:
        features: Tensor of shape (N, *) where N is number of samples
        labels: Tensor of shape (N,) or (N, *)
        transform: Optional transform to apply to features
        target_transform: Optional transform to apply to labels
    """
    def __init__(self, features, labels, transform=None, target_transform=None):
        assert len(features) == len(labels), "Features and labels must have same length"
        self.features = features
        self.labels = labels
        self.transform = transform
        self.target_transform = target_transform
    
    def __len__(self):
        """Return the total number of samples."""
        return len(self.features)
    
    def __getitem__(self, idx):
        """Return the sample and label at index idx."""
        x = self.features[idx]
        y = self.labels[idx]
        
        # Apply transforms if specified
        if self.transform:
            x = self.transform(x)
        if self.target_transform:
            y = self.target_transform(y)
        
        return x, y

In [21]:
# Create sample data
N = 100  # Number of samples
D = 10   # Feature dimension

X = torch.randn(N, D)  # Features: (N, D)
y = torch.randint(0, 5, (N,))  # Labels: (N,) with 5 classes

# Create dataset without transforms
dataset = SimpleDataset(X, y)

print(f"Dataset size: {len(dataset)}")
print(f"First sample:")
x_0, y_0 = dataset[0]
print(f"  Features shape: {x_0.shape}")
print(f"  Label: {y_0}")

Dataset size: 100
First sample:
  Features shape: torch.Size([10])
  Label: 0


In [27]:
# Create dataset with transforms
scaler = StandardScaler().fit(X)  # Fit scaler on data
one_hot = Lambda(lambda y: torch.zeros(5).scatter_(0, torch.tensor(y), 1.0))

dataset_with_transforms = SimpleDataset(X, y, transform=scaler, target_transform=one_hot)

x_0, y_0 = dataset_with_transforms[0]
print(f"With transforms:")
print(f"  Features (standardized): mean={x_0.mean():.3f}, std={x_0.std():.3f}")
print(f"  Label (one-hot): {y_0}")

With transforms:
  Features (standardized): mean=0.032, std=0.994
  Label (one-hot): tensor([1., 0., 0., 0., 0.])


  one_hot = Lambda(lambda y: torch.zeros(5).scatter_(0, torch.tensor(y), 1.0))


## Image Dataset from Files

<span style="font-size: 15px;">

A more realistic example: loading images from disk. This pattern is common when:
- Dataset is too large to fit in memory
- Images are stored in a directory structure
- Labels are in a CSV/annotation file

**Typical directory structure:**
```
data/
├── images/
│   ├── img001.jpg
│   ├── img002.jpg
│   └── ...
└── labels.csv
```

**labels.csv format:**
```
filename,label
img001.jpg,0
img002.jpg,1
...
```

</span>

In [29]:
import os

class ImageDataset(Dataset):
    """Dataset for loading images from a directory with labels from a CSV file.
    
    Args:
        annotations_file: Path to CSV file with columns [filename, label]
        img_dir: Directory containing images
        transform: Optional transform to apply to images
        target_transform: Optional transform to apply to labels
    """
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        # In practice, use: self.img_labels = pd.read_csv(annotations_file)
        # For this example, we'll simulate it
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform
        
        # Simulated annotations (filename, label)
        self.img_labels = [
            ('img001.jpg', 0),
            ('img002.jpg', 1),
            ('img003.jpg', 0),
        ]
    
    def __len__(self):
        return len(self.img_labels)
    
    def __getitem__(self, idx):
        # Get filename and label
        img_name, label = self.img_labels[idx]
        img_path = os.path.join(self.img_dir, img_name)
        
        # In practice, load image:
        # from torchvision.io import read_image
        # image = read_image(img_path)
        
        # For this example, simulate with random tensor
        image = torch.rand(3, 224, 224)  # Simulated RGB image
        
        # Apply transforms
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        
        return image, label

# Demo
img_dataset = ImageDataset(
    annotations_file='labels.csv',
    img_dir='data/images',
    transform=Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
)

print(f"Dataset size: {len(img_dataset)}")
img, label = img_dataset[0]
print(f"Image shape: {img.shape}, Label: {label}")

Dataset size: 3
Image shape: torch.Size([3, 224, 224]), Label: 0


## CSV/Tabular Dataset

<span style="font-size: 15px;">

For tabular data (e.g., from CSV files), we create a dataset that loads and returns feature vectors:

</span>

In [30]:
class CSVDataset(Dataset):
    """Dataset for loading tabular data from a CSV file.
    
    Args:
        csv_file: Path to CSV file
        feature_cols: List of column names/indices for features
        label_col: Column name/index for labels
        transform: Optional transform for features
        target_transform: Optional transform for labels
    """
    def __init__(self, data, feature_cols, label_col, transform=None, target_transform=None):
        # In practice: self.data = pd.read_csv(csv_file)
        # For this example, data is passed directly as numpy array
        self.features = torch.tensor(data[:, feature_cols], dtype=torch.float32)
        self.labels = torch.tensor(data[:, label_col], dtype=torch.long)
        self.transform = transform
        self.target_transform = target_transform
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        x = self.features[idx]
        y = self.labels[idx]
        
        if self.transform:
            x = self.transform(x)
        if self.target_transform:
            y = self.target_transform(y)
        
        return x, y

# Create sample tabular data
# Columns: [feature1, feature2, feature3, label]
np.random.seed(42)
data = np.random.randn(200, 4)
data[:, 3] = (data[:, 0] + data[:, 1] > 0).astype(int)  # Binary classification

csv_dataset = CSVDataset(
    data=data,
    feature_cols=[0, 1, 2],  # First 3 columns are features
    label_col=3              # Last column is label
)

print(f"Dataset size: {len(csv_dataset)}")
x, y = csv_dataset[0]
print(f"Features: {x}")
print(f"Label: {y}")

Dataset size: 200
Features: tensor([ 0.4967, -0.1383,  0.6477])
Label: 1


## Sequence Dataset (for RNNs)

<span style="font-size: 15px;">

For sequential data (time series, text), we often need to create overlapping windows or handle variable-length sequences:

</span>

In [31]:
class SequenceDataset(Dataset):
    """Dataset for creating sliding windows from sequential data.
    
    Given a sequence and window size, creates overlapping windows.
    Each window is used to predict the next value(s).
    
    Args:
        data: Tensor of shape (T, D) where T is sequence length, D is features
        seq_len: Length of input sequence (window size)
        pred_len: Length of prediction horizon (default: 1)
    """
    def __init__(self, data, seq_len, pred_len=1):
        self.data = data
        self.seq_len = seq_len
        self.pred_len = pred_len
    
    def __len__(self):
        # Number of valid windows
        return len(self.data) - self.seq_len - self.pred_len + 1
    
    def __getitem__(self, idx):
        # Input: seq_len consecutive values starting at idx
        x = self.data[idx:idx + self.seq_len]
        # Target: pred_len values after the input window
        y = self.data[idx + self.seq_len:idx + self.seq_len + self.pred_len]
        return x, y

# Create sample time series data
T = 100  # Total sequence length
D = 3    # Number of features
time_series = torch.randn(T, D)

seq_dataset = SequenceDataset(time_series, seq_len=10, pred_len=1)

print(f"Original sequence length: {T}")
print(f"Sequence length (input): 10")
print(f"Prediction length: 1")
print(f"Number of windows: {len(seq_dataset)}")

x, y = seq_dataset[0]
print(f"\nInput shape: {x.shape} (seq_len, features)")
print(f"Target shape: {y.shape} (pred_len, features)")

Original sequence length: 100
Sequence length (input): 10
Prediction length: 1
Number of windows: 90

Input shape: torch.Size([10, 3]) (seq_len, features)
Target shape: torch.Size([1, 3]) (pred_len, features)


# DataLoader

<span style="font-size: 15px;">

The `DataLoader` class wraps a `Dataset` and provides:

1. **Batching**: Group samples into mini-batches
2. **Shuffling**: Randomize the order of samples each epoch
3. **Parallel Loading**: Use multiple worker processes for faster data loading
4. **Collation**: Combine individual samples into batched tensors

**Constructor signature:**
```python
DataLoader(
    dataset,           # Dataset to load from
    batch_size=1,      # Number of samples per batch
    shuffle=False,     # Whether to shuffle data each epoch
    sampler=None,      # Custom sampler for loading order
    batch_sampler=None,# Custom batch sampler
    num_workers=0,     # Number of subprocesses for loading
    collate_fn=None,   # Function to merge samples into batch
    pin_memory=False,  # Copy tensors to CUDA pinned memory
    drop_last=False,   # Drop last incomplete batch
)
```

</span>

## Basic DataLoader Usage

In [32]:
# Create a simple dataset
N = 100
X = torch.randn(N, 5)
y = torch.randint(0, 3, (N,))
dataset = SimpleDataset(X, y)

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True
)

print(f"Dataset size: {len(dataset)}")
print(f"Batch size: 16")
print(f"Number of batches: {len(dataloader)}")
print(f"Expected batches: ceil({N}/16) = {(N + 15) // 16}")

Dataset size: 100
Batch size: 16
Number of batches: 7
Expected batches: ceil(100/16) = 7


In [33]:
# Iterate through DataLoader
for batch_idx, (batch_x, batch_y) in enumerate(dataloader):
    print(f"Batch {batch_idx}: features shape = {batch_x.shape}, labels shape = {batch_y.shape}")
    if batch_idx >= 2:  # Only show first 3 batches
        print("...")
        break

Batch 0: features shape = torch.Size([16, 5]), labels shape = torch.Size([16])
Batch 1: features shape = torch.Size([16, 5]), labels shape = torch.Size([16])
Batch 2: features shape = torch.Size([16, 5]), labels shape = torch.Size([16])
...


In [34]:
# Get a single batch using next(iter(...))
batch_x, batch_y = next(iter(dataloader))
print(f"Single batch:")
print(f"  Features: {batch_x.shape}")
print(f"  Labels: {batch_y.shape}")
print(f"  Label values: {batch_y}")

Single batch:
  Features: torch.Size([16, 5])
  Labels: torch.Size([16])
  Label values: tensor([1, 2, 0, 1, 1, 0, 1, 0, 2, 2, 2, 0, 0, 0, 2, 0])


## Key DataLoader Parameters

<span style="font-size: 15px;">

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `batch_size` | int | 1 | Number of samples per batch |
| `shuffle` | bool | False | Reshuffle data at each epoch |
| `num_workers` | int | 0 | Subprocesses for data loading (0 = main process) |
| `pin_memory` | bool | False | Copy to CUDA pinned memory (faster GPU transfer) |
| `drop_last` | bool | False | Drop last batch if incomplete |
| `collate_fn` | callable | None | Custom function to create batches |

</span>

In [35]:
# drop_last=True: Drops the last batch if it's smaller than batch_size
N = 100
batch_size = 32

dataset = SimpleDataset(torch.randn(N, 5), torch.randint(0, 3, (N,)))

loader_keep = DataLoader(dataset, batch_size=batch_size, drop_last=False)
loader_drop = DataLoader(dataset, batch_size=batch_size, drop_last=True)

print(f"Dataset size: {N}, Batch size: {batch_size}")
print(f"drop_last=False: {len(loader_keep)} batches")
print(f"drop_last=True:  {len(loader_drop)} batches")

# Check last batch size
for i, (x, y) in enumerate(loader_keep):
    if i == len(loader_keep) - 1:
        print(f"\nLast batch size (drop_last=False): {len(x)}")

Dataset size: 100, Batch size: 32
drop_last=False: 4 batches
drop_last=True:  3 batches

Last batch size (drop_last=False): 4


In [36]:
# shuffle=True: Data is reshuffled at each epoch
small_dataset = SimpleDataset(
    torch.arange(10).float().unsqueeze(1),  # Features: 0-9
    torch.arange(10)                         # Labels: 0-9
)

loader_shuffle = DataLoader(small_dataset, batch_size=5, shuffle=True)

print("Epoch 1:")
for x, y in loader_shuffle:
    print(f"  Labels: {y.tolist()}")

print("\nEpoch 2:")
for x, y in loader_shuffle:
    print(f"  Labels: {y.tolist()}")

Epoch 1:
  Labels: [1, 6, 8, 3, 5]
  Labels: [9, 7, 2, 4, 0]

Epoch 2:
  Labels: [3, 2, 8, 7, 5]
  Labels: [6, 9, 4, 1, 0]


## Custom Collate Function

<span style="font-size: 15px;">

The `collate_fn` parameter specifies how to merge a list of samples into a batch. The default collate function:
1. Stacks tensors along a new dimension (creating the batch dimension)
2. Handles nested structures (tuples, dicts, lists)

Custom collate functions are useful when:
- Samples have variable lengths (need padding)
- Samples have complex structure
- Special batch processing is needed

</span>

In [37]:
# Default collate behavior
from torch.utils.data import default_collate

# List of samples
samples = [
    (torch.tensor([1, 2, 3]), torch.tensor(0)),
    (torch.tensor([4, 5, 6]), torch.tensor(1)),
    (torch.tensor([7, 8, 9]), torch.tensor(2)),
]

batch = default_collate(samples)
print("Default collate result:")
print(f"  Features: {batch[0]}")
print(f"  Labels: {batch[1]}")

Default collate result:
  Features: tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
  Labels: tensor([0, 1, 2])


In [38]:
# Custom collate function for variable-length sequences
def collate_with_padding(batch):
    """Collate function that pads sequences to the same length.
    
    Args:
        batch: List of (sequence, label) tuples where sequences may have different lengths
    
    Returns:
        padded_sequences: Tensor of shape (B, max_len, D)
        lengths: Tensor of original lengths (B,)
        labels: Tensor of labels (B,)
    """
    sequences, labels = zip(*batch)
    
    # Get lengths and max length
    lengths = torch.tensor([len(seq) for seq in sequences])
    max_len = lengths.max().item()
    
    # Get feature dimension
    feature_dim = sequences[0].shape[-1] if sequences[0].dim() > 1 else 1
    
    # Create padded tensor
    batch_size = len(sequences)
    padded = torch.zeros(batch_size, max_len, feature_dim)
    
    for i, seq in enumerate(sequences):
        seq = seq.view(-1, feature_dim)  # Ensure 2D
        padded[i, :len(seq)] = seq
    
    labels = torch.stack([torch.tensor(l) for l in labels])
    
    return padded, lengths, labels

# Dataset with variable-length sequences
class VariableLengthDataset(Dataset):
    def __init__(self, n_samples=20):
        self.data = []
        for i in range(n_samples):
            length = torch.randint(3, 10, (1,)).item()  # Random length 3-9
            seq = torch.randn(length, 4)  # 4 features
            label = i % 3
            self.data.append((seq, label))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

var_dataset = VariableLengthDataset()
var_loader = DataLoader(var_dataset, batch_size=4, collate_fn=collate_with_padding)

padded_seqs, lengths, labels = next(iter(var_loader))
print(f"Padded sequences shape: {padded_seqs.shape}")
print(f"Original lengths: {lengths}")
print(f"Labels: {labels}")

Padded sequences shape: torch.Size([4, 6, 4])
Original lengths: tensor([6, 6, 6, 3])
Labels: tensor([0, 1, 2, 0])


## Multi-Process Data Loading

<span style="font-size: 15px;">

Setting `num_workers > 0` enables multi-process data loading. This is useful when:
- Data loading is I/O bound (reading from disk)
- Data preprocessing is CPU-intensive
- You want to overlap data loading with GPU computation

**Important considerations:**
- `num_workers=0`: Data is loaded in the main process
- `num_workers>0`: Data is loaded in separate worker processes
- Workers are spawned at the start of each epoch
- Each worker loads a subset of the data

**Best practices:**
- Start with `num_workers=0` for debugging
- Typical values: 2-8 workers (depends on CPU cores)
- Use `pin_memory=True` when using GPU

</span>

In [39]:
# Example: Comparing loading times (conceptual - actual speedup depends on data loading cost)
import time

class SlowDataset(Dataset):
    """Dataset that simulates slow data loading."""
    def __init__(self, n_samples=100):
        self.n_samples = n_samples
    
    def __len__(self):
        return self.n_samples
    
    def __getitem__(self, idx):
        # Simulate I/O delay (in practice, this would be disk read time)
        # time.sleep(0.001)  # Uncomment to see effect
        return torch.randn(3, 32, 32), torch.tensor(idx % 10)

slow_dataset = SlowDataset(n_samples=100)

# Single process (num_workers=0)
loader_single = DataLoader(slow_dataset, batch_size=16, num_workers=0)

# Note: Multi-process loading (num_workers>0) may not work in notebook environments
# In a script, you would use:
# loader_multi = DataLoader(slow_dataset, batch_size=16, num_workers=4)

print("Single-process DataLoader created")
print(f"To use multi-process loading, set num_workers > 0")
print(f"Example: DataLoader(dataset, batch_size=16, num_workers=4, pin_memory=True)")

Single-process DataLoader created
To use multi-process loading, set num_workers > 0
Example: DataLoader(dataset, batch_size=16, num_workers=4, pin_memory=True)


# Samplers

<span style="font-size: 15px;">

Samplers define the strategy to draw samples from the dataset. They are iterators that yield sample indices.

**Built-in Samplers:**

| Sampler | Description | Use Case |
|---------|-------------|----------|
| `SequentialSampler` | Yields indices in order (0, 1, 2, ...) | Evaluation/inference |
| `RandomSampler` | Yields indices in random order | Training |
| `SubsetRandomSampler` | Random sampling from a subset of indices | Train/val splits |
| `WeightedRandomSampler` | Weighted random sampling | Class imbalance |
| `BatchSampler` | Wraps another sampler to yield batches of indices | Custom batching |

**Note:** When using a custom sampler, set `shuffle=False` (or don't set shuffle at all).

</span>

In [40]:
# Create a small dataset for demonstration
dataset = SimpleDataset(
    torch.arange(10).float().unsqueeze(1),
    torch.arange(10)
)

# SequentialSampler: indices in order
seq_sampler = SequentialSampler(dataset)
print("SequentialSampler indices:", list(seq_sampler))

# RandomSampler: indices in random order
rand_sampler = RandomSampler(dataset)
print("RandomSampler indices:", list(rand_sampler))

# SubsetRandomSampler: random sampling from specific indices
subset_indices = [0, 2, 4, 6, 8]  # Only even indices
subset_sampler = SubsetRandomSampler(subset_indices)
print("SubsetRandomSampler indices:", list(subset_sampler))

SequentialSampler indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
RandomSampler indices: [5, 2, 9, 4, 7, 6, 3, 0, 1, 8]
SubsetRandomSampler indices: [2, 8, 4, 6, 0]


In [41]:
# Using sampler with DataLoader
loader = DataLoader(dataset, batch_size=3, sampler=subset_sampler)

print("Batches from SubsetRandomSampler:")
for x, y in loader:
    print(f"  Labels: {y.tolist()}")

Batches from SubsetRandomSampler:
  Labels: [0, 8, 4]
  Labels: [2, 6]


## Train/Validation Split with Samplers

In [42]:
# Create dataset
N = 100
dataset = SimpleDataset(
    torch.randn(N, 5),
    torch.randint(0, 3, (N,))
)

# Create train/val split
train_ratio = 0.8
train_size = int(train_ratio * N)

# Shuffle indices
indices = torch.randperm(N).tolist()
train_indices = indices[:train_size]
val_indices = indices[train_size:]

print(f"Total samples: {N}")
print(f"Training samples: {len(train_indices)}")
print(f"Validation samples: {len(val_indices)}")

# Create samplers
train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)

# Create DataLoaders
train_loader = DataLoader(dataset, batch_size=16, sampler=train_sampler)
val_loader = DataLoader(dataset, batch_size=16, sampler=val_sampler)

print(f"\nTrain batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")

Total samples: 100
Training samples: 80
Validation samples: 20

Train batches: 5
Val batches: 2


## Weighted Sampling for Imbalanced Data

In [43]:
from torch.utils.data import WeightedRandomSampler

# Create imbalanced dataset: class 0 has 90 samples, class 1 has 10 samples
labels = torch.cat([torch.zeros(90), torch.ones(10)]).long()
features = torch.randn(100, 5)
imbalanced_dataset = SimpleDataset(features, labels)

# Count class frequencies
class_counts = torch.bincount(labels)
print(f"Class distribution: {class_counts.tolist()}")

# Calculate weights: inverse of class frequency
class_weights = 1.0 / class_counts.float()
print(f"Class weights: {class_weights.tolist()}")

# Assign weight to each sample based on its class
sample_weights = class_weights[labels]

# Create WeightedRandomSampler
weighted_sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),  # Number of samples to draw
    replacement=True  # With replacement to allow oversampling
)

weighted_loader = DataLoader(imbalanced_dataset, batch_size=20, sampler=weighted_sampler)

# Check class balance in a batch
x, y = next(iter(weighted_loader))
batch_counts = torch.bincount(y, minlength=2)
print(f"\nBatch class distribution: {batch_counts.tolist()}")
print("Classes are now roughly balanced in each batch!")

Class distribution: [90, 10]
Class weights: [0.011111111380159855, 0.10000000149011612]

Batch class distribution: [8, 12]
Classes are now roughly balanced in each batch!


# Putting It All Together: Training Loop

In [44]:
import torch.optim as optim

# 1. Create Dataset with transforms
class MyDataset(Dataset):
    def __init__(self, n_samples=1000, n_features=20, n_classes=5, transform=None):
        self.features = torch.randn(n_samples, n_features)
        self.labels = torch.randint(0, n_classes, (n_samples,))
        self.transform = transform
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        x = self.features[idx]
        y = self.labels[idx]
        if self.transform:
            x = self.transform(x)
        return x, y

# 2. Create transforms
class AddNoise:
    def __init__(self, std=0.1):
        self.std = std
    def __call__(self, x):
        return x + torch.randn_like(x) * self.std

train_transform = Compose([
    AddNoise(std=0.1),  # Data augmentation
])

# 3. Create datasets
full_dataset = MyDataset(n_samples=1000, transform=train_transform)

# 4. Create train/val split
train_size = int(0.8 * len(full_dataset))
indices = torch.randperm(len(full_dataset)).tolist()
train_indices = indices[:train_size]
val_indices = indices[train_size:]

train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)

# 5. Create DataLoaders
train_loader = DataLoader(
    full_dataset,
    batch_size=32,
    sampler=train_sampler,
    num_workers=0,  # Use >0 in production
    pin_memory=False  # Set True if using GPU
)

val_loader = DataLoader(
    full_dataset,
    batch_size=32,
    sampler=val_sampler
)

# 6. Create model, loss, optimizer
model = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Linear(64, 5)
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 7. Training loop
n_epochs = 3

for epoch in range(n_epochs):
    # Training phase
    model.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    
    for batch_x, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Statistics
        train_loss += loss.item() * batch_x.size(0)
        _, predicted = torch.max(outputs, 1)
        train_total += batch_y.size(0)
        train_correct += (predicted == batch_y).sum().item()
    
    train_loss /= train_total
    train_acc = 100 * train_correct / train_total
    
    # Validation phase
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for batch_x, batch_y in val_loader:
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            
            val_loss += loss.item() * batch_x.size(0)
            _, predicted = torch.max(outputs, 1)
            val_total += batch_y.size(0)
            val_correct += (predicted == batch_y).sum().item()
    
    val_loss /= val_total
    val_acc = 100 * val_correct / val_total
    
    print(f"Epoch {epoch+1}/{n_epochs}: "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")

Epoch 1/3: Train Loss: 1.6265, Train Acc: 20.00% | Val Loss: 1.6276, Val Acc: 21.00%
Epoch 2/3: Train Loss: 1.6022, Train Acc: 23.88% | Val Loss: 1.6330, Val Acc: 21.50%
Epoch 3/3: Train Loss: 1.5887, Train Acc: 26.12% | Val Loss: 1.6303, Val Acc: 20.50%


# Quick Reference Table

**Dataset Methods**

| Method | Required | Purpose |
|--------|----------|--------|
| `__init__` | Yes | Initialize dataset, load metadata |
| `__len__` | Yes | Return total number of samples |
| `__getitem__` | Yes | Return sample at given index |
| `__getitems__` | No | Batch loading optimization (optional) |

**DataLoader Parameters**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `dataset` | Dataset | required | Dataset to load from |
| `batch_size` | int | 1 | Samples per batch |
| `shuffle` | bool | False | Randomize order each epoch |
| `sampler` | Sampler | None | Custom sampling strategy |
| `num_workers` | int | 0 | Parallel loading processes |
| `collate_fn` | callable | None | Custom batch creation |
| `pin_memory` | bool | False | Use CUDA pinned memory |
| `drop_last` | bool | False | Drop incomplete final batch |

**Sampler Types**

| Sampler | Use Case |
|---------|----------|
| `SequentialSampler` | Evaluation (in order) |
| `RandomSampler` | Training (shuffle) |
| `SubsetRandomSampler` | Train/val splits |
| `WeightedRandomSampler` | Class imbalance |
| `BatchSampler` | Custom batch strategies |

# Conclusion

<span style="font-size: 15px;">

Key takeaways for working with PyTorch data loading:

1. **Transforms** are callables that preprocess data. Use `Compose` to chain multiple transforms.

2. **Dataset** is an abstract class requiring `__len__` and `__getitem__`. Use it to:
   - Decouple data handling from model code
   - Apply transforms lazily (on access)
   - Handle various data sources (files, databases, etc.)

3. **DataLoader** wraps Dataset for efficient training:
   - `batch_size`: Start with 32 or 64, adjust based on GPU memory
   - `shuffle=True`: Always shuffle training data
   - `num_workers`: Use 2-8 for faster loading (0 for debugging)
   - `pin_memory=True`: When using GPU
   - `drop_last=True`: When batch size must be consistent

4. **Samplers** control loading order:
   - `SubsetRandomSampler`: For train/val splits
   - `WeightedRandomSampler`: For class imbalance

5. **Best Practices**:
   - Load data lazily in `__getitem__`, not `__init__`
   - Use `collate_fn` for variable-length data
   - Profile data loading to find bottlenecks
   - Consider `prefetch_factor` for overlapping data loading with computation

</span>