# Dataset & DataLoader - Batch Training

In this tutorial, we will look at the `pytorch` `Dataset` and `DataLoader` classes. So far, we have a dataset that we loaded somehow, such as from a `.csv` file. Then, we had our training loop that looped over the number of epochs and we optimized the model based on the whole dataset.

`data = numpy.loadtxt('wine.csv')
#training loop
for epoch in range(1000):
    x, y = data
    # forward + backward + weight updates` 

This might be very time consuming if you did gradient calculations on the whole training dataset. One better way of doing this, which is computationally less expensive, is to do the divide the samples into smaller batches. Then, our training loop will look something like this:

`#training loop
for epoch in range(1000):
    #loop over all batches
    for i in range(total_batches):
        x_batch, y_batch = ...`

We do the optimization based only on those batches. However, if we use the built in the built-in `Dataset` and `DataLoader` classes from `pytorch`, it will do the batch calculations and iterations for us, so it is very easy to use.

Before we jump into the code, let's talk about some terms relating to batch training. 

* epoch = 1 complete forward and backward pass of ALL training samples
* batch_size = number of training samples in one forward and backward pass
* number of iterations = number of passes, each uses the *batch_size* number of samples

Example: 100 samples, batch_size = 20 -> 100/20 = 5 iterations for 1 epoch. 

First, let's import the modules we need.

In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

Now, let's make our dataset. This will be the dataset `wine.csv`, as provided by the tutorial. Our outcome will be type of wine, with labels 1, 2, and 3, in the first column. All the other columns are the features. Let's load this and split our columns into `X` and `y`. 

In [3]:
class WineDataset(Dataset):
    
    def __init__(self):
        #data loading 
        #will also convert to torch
        xy = np.loadtxt('C:\\Users\\onef0\\Desktop\\PyTorch Tutorial\\wine.csv', delimiter = ",", dtype = np.float32, skiprows = 1)
        self.x = torch.from_numpy(xy[:, 1:]) #we want only the features, so all rows but all collumns excluding the first column
        self.y = torch.from_numpy(xy[:, [0]]) #we want only the outcome, so all rows and only the first column - we had the extra 
        #brackets because it will be the n_samples, 1
        self.n_samples = xy.shape[0] #first dimension is the number of samples
        
    def __getitem__(self, index):
        #dataset[0]
        return self.x[index], self.y[index]
    
    def __len__(self):
        #len(dataset)
        return self.n_samples

Now, we will look at our dataset. 

In [4]:
dataset = WineDataset()

#look at the first sample
first_data = dataset[0]

#lets unpack this into features and labels
features, labels = first_data
print(features, labels)

tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03]) tensor([1.])


So, this is how we get the dataset, now let's get a `DataLoader`. 

In [7]:
#now, we will make a dataloader with batch size of 2 and also shuffling the dataset
dataloader = DataLoader(dataset = dataset, batch_size = 4, shuffle = True)
    #num_workers = 2 uses multiple subprocesses so can make loading faster

Now, let's see how we can use this `DataLoader` object. 

In [8]:
#first, we can convert it to a iterator
dataiter = iter(dataloader)
data = dataiter.next()
features, labels = data

print(features, labels)

tensor([[1.2370e+01, 1.0700e+00, 2.1000e+00, 1.8500e+01, 8.8000e+01, 3.5200e+00,
         3.7500e+00, 2.4000e-01, 1.9500e+00, 4.5000e+00, 1.0400e+00, 2.7700e+00,
         6.6000e+02],
        [1.3720e+01, 1.4300e+00, 2.5000e+00, 1.6700e+01, 1.0800e+02, 3.4000e+00,
         3.6700e+00, 1.9000e-01, 2.0400e+00, 6.8000e+00, 8.9000e-01, 2.8700e+00,
         1.2850e+03],
        [1.2080e+01, 1.3300e+00, 2.3000e+00, 2.3600e+01, 7.0000e+01, 2.2000e+00,
         1.5900e+00, 4.2000e-01, 1.3800e+00, 1.7400e+00, 1.0700e+00, 3.2100e+00,
         6.2500e+02],
        [1.3050e+01, 5.8000e+00, 2.1300e+00, 2.1500e+01, 8.6000e+01, 2.6200e+00,
         2.6500e+00, 3.0000e-01, 2.0100e+00, 2.6000e+00, 7.3000e-01, 3.1000e+00,
         3.8000e+02]]) tensor([[2.],
        [1.],
        [2.],
        [2.]])


In the above example, we have 4 batches, that is why there are 4 different feature vectors. For each vector, the class, so 4 output labels. 

We also can iterate over the whole data loader and not only get the next item. Let's do a dummy training loop. First, let's define some things.

In [9]:
num_epochs = 2 
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4)
print(total_samples, n_iterations) 

178 45


As we can see, we have 178 samples and 45 iterations.

In [10]:
#now, we will do the actually loop
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(dataloader):
        #forward pass
        
        #backward pass
        
        #update weights 
        
        #since this is an example, we will only print information
        if (i+1) % 5 == 0:
            print(f'epoch {epoch+1}/{num_epochs}, step {i+1}/{n_iterations}, inputs {inputs.shape}')

epoch 1/2, step 5/45, inputs torch.Size([4, 13])
epoch 1/2, step 10/45, inputs torch.Size([4, 13])
epoch 1/2, step 15/45, inputs torch.Size([4, 13])
epoch 1/2, step 20/45, inputs torch.Size([4, 13])
epoch 1/2, step 25/45, inputs torch.Size([4, 13])
epoch 1/2, step 30/45, inputs torch.Size([4, 13])
epoch 1/2, step 35/45, inputs torch.Size([4, 13])
epoch 1/2, step 40/45, inputs torch.Size([4, 13])
epoch 1/2, step 45/45, inputs torch.Size([2, 13])
epoch 2/2, step 5/45, inputs torch.Size([4, 13])
epoch 2/2, step 10/45, inputs torch.Size([4, 13])
epoch 2/2, step 15/45, inputs torch.Size([4, 13])
epoch 2/2, step 20/45, inputs torch.Size([4, 13])
epoch 2/2, step 25/45, inputs torch.Size([4, 13])
epoch 2/2, step 30/45, inputs torch.Size([4, 13])
epoch 2/2, step 35/45, inputs torch.Size([4, 13])
epoch 2/2, step 40/45, inputs torch.Size([4, 13])
epoch 2/2, step 45/45, inputs torch.Size([2, 13])


So, here we see that we have 2 epochs. In every epoch, we have 45 steps. Every 5th step, we print some information. We see that our tensor is 4x13; our batch size is 4 and 13 features in each batch. 

`pytorch` also has some built in datasets:

* `torchvision.datasets.MNIST()`

Additionally, you can load the `fashion-mnist`, `cifar` and `coco` datasets. 

We will see more about this in the next tutorial.