In [1]:
import torch
import numpy as np

**1 epoch**: one complete pass of the training dataset through the algorithm

**batch_size**: the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you will need.

**No of iterations = No of batches**: number of passes, each pass using batch_size number of examples.

Example: With 100 training examples and batch size of 20 it will take 5 iterations to complete 1 epoch.

# Dataloaders (PyTorch)

The Dataset retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to

1.   Pass samples in “minibatches”
2.   Reshuffle the data at every epoch to reduce model overfitting
3.   Use Python's multiprocessing to speed up data retrieval

# Sample DataLoader

Handles data loading logic


In [1]:
from torch.utils.data import Dataset, DataLoader
# Dataloader will use dataset to create batches, process data etc.

class MyDataset(Dataset):
    # constructor, in this case it contains the data
    def __init__(self, xs, ys):
        self.input = input
        self.target = target

    # returns the length of the dataset
    def __len__(self):
        return len(self.input)

    # returns the item at index i
    def __getitem__(self, i):
        return self.input[i], self.target[i]

You want to train a model to learn that the target = 2 x input, and hence created the following dataset:

In [2]:
# We are creating a dummy dataset to test Dataloaders
input = list(range(10))
target = list(range(0, 20, 2))
print('input values: ', input)
print('target values: ', target)

# Create an instance of MyDataset class
dataset = MyDataset(input, target)

dataset[4]

input values:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
target values:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


(4, 8)

### Let's look at different ways of creating the Dataloader object using the Dataloader class


In [5]:
# batch size of 1, so we the size of x and y is 1 and no shuffling
for x, y in DataLoader(dataset):
    print(f"batch of inputs: {x.numpy()}, batch of labels: {y.numpy()}")

batch of inputs: [0], batch of labels: [0]
batch of inputs: [1], batch of labels: [2]
batch of inputs: [2], batch of labels: [4]
batch of inputs: [3], batch of labels: [6]
batch of inputs: [4], batch of labels: [8]
batch of inputs: [5], batch of labels: [10]
batch of inputs: [6], batch of labels: [12]
batch of inputs: [7], batch of labels: [14]
batch of inputs: [8], batch of labels: [16]
batch of inputs: [9], batch of labels: [18]


In [6]:
# batch size of 4, so x and y both have a size of 4, no shuffling
for x, y in DataLoader(dataset, batch_size=4):
    print(f"batch of inputs: {x.numpy()}, batch of labels: {y.numpy()}")

batch of inputs: [0 1 2 3], batch of labels: [0 2 4 6]
batch of inputs: [4 5 6 7], batch of labels: [ 8 10 12 14]
batch of inputs: [8 9], batch of labels: [16 18]


In [7]:
# batch size of 4, so x and y both have a size of 4, random shuffle
for x, y in DataLoader(dataset, batch_size=4, shuffle=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([7, 5, 3, 9]), batch of labels: tensor([14, 10,  6, 18])
batch of inputs: tensor([6, 4, 0, 8]), batch of labels: tensor([12,  8,  0, 16])
batch of inputs: tensor([2, 1]), batch of labels: tensor([4, 2])


In [None]:
# batch size of 4, drop the last batch with less than 4 samples
for x, y in DataLoader(dataset, batch_size=4, shuffle=True, drop_last=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([4, 9, 0, 6]), batch of labels: tensor([ 8, 18,  0, 12])
batch of inputs: tensor([3, 8, 5, 7]), batch of labels: tensor([ 6, 16, 10, 14])


We can use the ```num_workers``` to specify how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

In [None]:
# 2 sub processess
for x, y in DataLoader(dataset, batch_size=4, shuffle=True, drop_last=True,num_workers=2):
    print(f"batch of inputs: {x}, batch of labels: {y}")

Use ```pin_memory``` to copy Tensors into device/CUDA pinned memory before returning them -> faster processing.

In [9]:
DataLoader(dataset, batch_size=4, shuffle=True, drop_last=True,num_workers=2,pin_memory=True)

<torch.utils.data.dataloader.DataLoader at 0x115e6371cd0>

: 