# PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

https://www.youtube.com/watch?v=PXOzkkB5eH0&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=10

In [1]:
import torch
# import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

## Dataset child-class creation

https://www.educba.com/dataset-pytorch/

In [2]:
class WineDataset(Dataset):
    def __init__(self):
        # data loading
        xy = np.loadtxt('wine.csv', delimiter=',', dtype=np.float32, skiprows=1)
        self.x = torch.from_numpy(xy[:, 1:])  # the first column is skipped, because it is the target
        self.y = torch.from_numpy(xy[:, [0]])  # n_samples, 1
        self.n_samples = xy.shape[0]
        
    def __getitem__(self, index):
        return self.x[index], self.y[index]
        
    def __len__(self):
        return self.n_samples

#### Dataset demonstration; let's take a look into the first element of our dataset

In [3]:
# dataset = WineDataset()
# first_data = dataset[0]
# features, labels = first_data
# print(features, labels)

## DataLoader class creation

The data loading process is done in a parallel mode where collecting the batch details is carried out automatically with the help of PyTorch, which is called PyTorch DataLoader. This helps in doing the data loading process faster than ever with less memory in place. DataLoader has both dataset and sampler within itself so that an iterable can be formed in the dataset. We can do single loading or multi-process loading based on the amount of data and the speed required for the process and can be combined with map-style or iterable-style of the datasets where the loading order can be customized.

https://www.educba.com/pytorch-dataloader/

In [4]:
dataset = WineDataset()
dataloader = DataLoader(dataset=dataset, 
                        batch_size=4, 
                        shuffle=True, 
                        num_workers=0)  # 'num_workers' was set to 0 to prevent errors with 'dataloader';
                                        # probably, 'num_workers' can be different from 0 in CUDA PyTorch version;
                                        # there's also an opinion that it's Windows problem with multiprocessing 
dataloader

<torch.utils.data.dataloader.DataLoader at 0x2ae6caf0340>

#### DataLoader demonstration

In [5]:
# dataiter = iter(dataloader)
# data = next(dataiter)
# features, labels = data

# features, labels  # 4 elements in a batch as was specified 'batch_size=4'

In [6]:
# training loop
num_epochs = 2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4)

total_samples, n_iterations

(178, 45)

In [7]:
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(dataloader):
        # forward, backward, update
        if (i+1) % 5 == 0:
            print(f'epoch {epoch+1}/{num_epochs}, step {i+1}/{n_iterations}, inputs {inputs.shape}')

epoch 1/2, step 5/45, inputs torch.Size([4, 13])
epoch 1/2, step 10/45, inputs torch.Size([4, 13])
epoch 1/2, step 15/45, inputs torch.Size([4, 13])
epoch 1/2, step 20/45, inputs torch.Size([4, 13])
epoch 1/2, step 25/45, inputs torch.Size([4, 13])
epoch 1/2, step 30/45, inputs torch.Size([4, 13])
epoch 1/2, step 35/45, inputs torch.Size([4, 13])
epoch 1/2, step 40/45, inputs torch.Size([4, 13])
epoch 1/2, step 45/45, inputs torch.Size([2, 13])
epoch 2/2, step 5/45, inputs torch.Size([4, 13])
epoch 2/2, step 10/45, inputs torch.Size([4, 13])
epoch 2/2, step 15/45, inputs torch.Size([4, 13])
epoch 2/2, step 20/45, inputs torch.Size([4, 13])
epoch 2/2, step 25/45, inputs torch.Size([4, 13])
epoch 2/2, step 30/45, inputs torch.Size([4, 13])
epoch 2/2, step 35/45, inputs torch.Size([4, 13])
epoch 2/2, step 40/45, inputs torch.Size([4, 13])
epoch 2/2, step 45/45, inputs torch.Size([2, 13])
