# Recitation 0L : Dataloaders

The purpose of this notebook is to understand the main concepts behind the PyTorch DataLoader and see a pedogogical, trivial example of using DataLoaders. Additionally, DataLoader customization (the collate function) and GPU memory efficiency will be discussed.

PyTorch Reference: https://pytorch.org/docs/stable/data.html

# Contents

1. Introduction to PyTorch DataLoader
2. Initializing a DataLoader Object
3. On-the-Fly Data Loading
4. Handling Different Batching Strategies
5. Creating a Custom Batch Sampler
6. Customizing Data Loading with the Collate Function
7. Leveraging Multi-Process Data Loading and Pin Memory

# PyTorch DataLoader

In [None]:
import torch
import torchvision
import numpy as np

## Manual data feed

**1 epoch**: one complete pass of the training dataset through the algorithm

**batch_size**: the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you will need.


**No of iterations = No of batches**: number of passes, each pass using batch_size number of examples.

Example: With 100 training examples and batch size of 20 it will take 5 iterations to complete 1 epoch.



```
x = a list of 10000 input samples
y = a list of 10000 target labels corresponding to x

# Load data manually in batches
for epoch in range(10):
    for i in range(n_batches):
        # Local batches and labels
        local_X, local_y = x[i*n_batches:(i+1)*n_batches,], y[i*n_batches:(i+1)*n_batches,]

        # Your model
        [...]
```



# Dataloaders (PyTorch)

Documentation:
[Read Docs](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

The Dataset retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to

1.   Pass samples in “minibatches”
2.   Reshuffle the data at every epoch to reduce model overfitting
3.   Use Python's multiprocessing to speed up data retrieval

# 1- Sample DataLoader

Handles data loading logic


In [None]:
from torch.utils.data import Dataset, DataLoader, Sampler
import os
from PIL import Image
import numpy as np
# Dataloader will use dataset to create batches, process data etc.
# Visit Dataset Recitation for more details

class MyDataset(Dataset):
    # constructor, in this case it contains the data
    def __init__(self, xs, ys):
        self.input = input
        self.target = target

    # returns the length of the dataset
    def __len__(self):
        return len(self.input)

    # returns the item at index i
    def __getitem__(self, i):
        return self.input[i], self.target[i]

You want to train a model to learn that the target = 2 x input, and hence created the following dataset:

In [None]:
# We are creating a dummy dataset to test Dataloaders
input = list(range(10))
target = list(range(0, 20, 2))
print('input values: ', input)
print('target values: ', target)

# Create an instance of MyDataset class
dataset = MyDataset(input, target)
print("The second sample is: ", dataset[2]) # returns the tuple (input[2], target[2])
# This is basically same as
print("The second sample is: ", dataset.__getitem__(2))
# Which the dataloader needs

input values:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
target values:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
The second sample is:  (2, 4)
The second sample is:  (2, 4)


# 2- Initializing a DataLoader Object
### Let's look at different ways of creating the Dataloader object using the Dataloader class


In [None]:
# batch size of 1, so we the size of x and y is 1 and no shuffling
for x, y in DataLoader(dataset):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([0]), batch of labels: tensor([0])
batch of inputs: tensor([1]), batch of labels: tensor([2])
batch of inputs: tensor([2]), batch of labels: tensor([4])
batch of inputs: tensor([3]), batch of labels: tensor([6])
batch of inputs: tensor([4]), batch of labels: tensor([8])
batch of inputs: tensor([5]), batch of labels: tensor([10])
batch of inputs: tensor([6]), batch of labels: tensor([12])
batch of inputs: tensor([7]), batch of labels: tensor([14])
batch of inputs: tensor([8]), batch of labels: tensor([16])
batch of inputs: tensor([9]), batch of labels: tensor([18])


# 3- On-the-Fly Data Loading

The key to handling large datasets is to load and process data in batches, only when needed. This strategy, known as on-the-fly or lazy loading, can be implemented in PyTorch by customizing the ```Dataset``` class.

In [None]:
class LargeImageDataset(Dataset):
    def __init__(self, image_dir, transform=None):
        self.image_dir = image_dir
        self.image_files = os.listdir(image_dir)
        self.transform = transform

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        img_name = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_name).convert('RGB')

        if self.transform:
            image = self.transform(image)

        return image

# Usage
image_dir = '/path/to/large/image/dataset'

# on-the-fly augmantation
transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize((256, 256)),
    torchvision.transforms.ToTensor()
])
large_dataset = LargeImageDataset(image_dir, transform=transform)


# 4- Handling Different Batching Strategies

In [None]:
# batch size of 4, so x and y both have a size of 4, no shuffling
for x, y in DataLoader(dataset, batch_size=4):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([0, 1, 2, 3]), batch of labels: tensor([0, 2, 4, 6])
batch of inputs: tensor([4, 5, 6, 7]), batch of labels: tensor([ 8, 10, 12, 14])
batch of inputs: tensor([8, 9]), batch of labels: tensor([16, 18])


In [None]:
# batch size of 4, so x and y both have a size of 4, random shuffle
for x, y in DataLoader(dataset, batch_size=4, shuffle=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([3, 5, 8, 2]), batch of labels: tensor([ 6, 10, 16,  4])
batch of inputs: tensor([1, 9, 7, 6]), batch of labels: tensor([ 2, 18, 14, 12])
batch of inputs: tensor([4, 0]), batch of labels: tensor([8, 0])


In [None]:
# batch size of 4, drop the last batch with less than 4 samples
for x, y in DataLoader(dataset, batch_size=4, shuffle=True, drop_last=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([2, 1, 3, 5]), batch of labels: tensor([ 4,  2,  6, 10])
batch of inputs: tensor([8, 4, 0, 6]), batch of labels: tensor([16,  8,  0, 12])



# 5- Creating a Custom [Batch Sampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler)

In PyTorch, a Batch Sampler is used to define how batches are formed from the dataset. The default behavior is to sequentially or randomly sample elements to form a batch. However, there are scenarios where you may need more control over how batches are created, such as when dealing with sequences of varying lengths or when implementing certain types of sampling strategies like stratified sampling for imbalanced data. In such cases, a custom batch sampler is invaluable.


A custom batch sampler in PyTorch is a class that implements two key methods: ```__iter__ ``` and ```__len__```. The ```__iter__``` method yields a list of indices representing a single batch, and ```__len__``` returns the number of batches expected in an epoch.



In [None]:
# Create an object of the custom dataset class, here we will create a toy dataset that contains odd and even numbers along with their labels
class EvenOddNumberDataset(Dataset):
    def __init__(self, start, end):

        self.numbers = list(range(start, end))

    def __len__(self):

        # Return the total number of items in the dataset.

        return len(self.numbers)

    def __getitem__(self, idx):

        # Retrieve an item by its index.

        number = self.numbers[idx]
        label = 1 if number % 2 != 0 else 0  # 1 for odd, 0 for even
        return number, label

# Usage example
start = 0
end = 100
dataset = EvenOddNumberDataset(start, end)

# Example of accessing the dataset
for i in range(5):
    print(dataset[i])


(0, 0)
(1, 1)
(2, 0)
(3, 1)
(4, 0)


In [None]:
class MyBatchSampler(Sampler):
    def __init__(self, dataset, batch_size):

        # Initialize the batch sampler.

        self.dataset = dataset
        self.batch_size = batch_size
        self.labels = [label for _, label in dataset]
        self.class_indices = self._get_class_indices()
        self.num_batches = np.ceil(len(self.dataset) / batch_size)

    def _get_class_indices(self):

        # Group dataset indices by class (even/odd).

        class_indices = {}
        for idx, label in enumerate(self.labels):
            if label not in class_indices:
                class_indices[label] = []
            class_indices[label].append(idx)
        return class_indices

    def __iter__(self):

        # Iterator method to yield batches.

        for _ in range(int(self.num_batches)):
            batch = []
            for class_idx in self.class_indices.values():
                samples_per_class = int(self.batch_size / len(self.class_indices))
                selected_indices = np.random.choice(class_idx, samples_per_class, replace=False)
                batch.extend(selected_indices)
            np.random.shuffle(batch)
            yield batch

    def __len__(self):

        # Return the total number of batches.

        return int(self.num_batches)


In [None]:
batch_sampler = MyBatchSampler(dataset, batch_size=16)
data_loader = DataLoader(MyDataset, batch_sampler=batch_sampler)


# 6- Collate function

A dataloader parameter which can be customized to achieve custom automatic batching.

You may apply some transformation in the collate function;
One can choose to apply transformation in the collate function instaed of dataset class if transformation needs to be applied on batches.
Also, since data loader support multiprocess through multi-workers, hence ```collate_fn()``` also can take advantage of multi-workers performance speed up.

In [None]:
# Create an object of the custom dataset class
class MyNormalDataset(Dataset):
    # constructor, in this case it contains the data
    def __init__(self, xs, ys):
        self.input = input
        self.target = target

    # returns the length of the dataset
    def __len__(self):
        return len(self.input)

    # returns the item at index i
    def __getitem__(self, i):
        return self.input[i], self.target[i]# create a dict of arguments, another way of passing arguments

    def collate_fn(self, batch):
        x, y = zip(*batch)
        x_mean = np.mean(x)
        x_std = np.std(x)
        x_normal = (x-x_mean)/(x_std+1e-9)
        return x_normal, y


input = np.array(list(range(10)))
target = np.array(list(range(0, 20, 2)))
print('input values: ', input)
print('target values: ', target)

# Create an instance of MyDataset class
dataset = MyNormalDataset(input, target)
# Use the custom collate_fn
# pass the arguments
train_dataloader_custom = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn)

# Display collated inputs and labels.
for i, (x, y) in enumerate(train_dataloader_custom):
    print(f"batch of inputs: {x}, batch of labels: {y}")


input values:  [0 1 2 3 4 5 6 7 8 9]
target values:  [ 0  2  4  6  8 10 12 14 16 18]
batch of inputs: [ 1.53281263 -0.90575292  0.48771311 -1.25411943  0.1393466 ], batch of labels: (16, 2, 10, 0, 8)
batch of inputs: [ 0.62092042  0.23284516  1.39707095 -0.93138063 -1.31945589], batch of labels: (14, 12, 18, 6, 4)


## 7- Single and multi-process loading

We can use the ```num_workers``` to specify how many subprocesses to use for data loading. \
0 means that the data will be loaded in the main process. \

In [None]:
# 2 subprocesses
train_dataloader_fast = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=2)

In [None]:
# The maximum subprocesses you can use depends on the machine you are training on
# you can try to increase it until you see a warning.

train_dataloader_fast = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=4)



Use ```pin_memory``` to copy Tensors into device/CUDA pinned memory before returning them -> faster processing.

In [None]:
train_dataloader_faster = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=4, pin_memory= True)