##### ***Epoch:***
An epoch completes once a whole dataset has undergone forward propagation and backpropagation. Passing the whole dataset at once causes the underfitting of the curve. Therefore, we must give the entire dataset through the neural network model more than once to make the fitting curve from underfitting to optimal. But it can also cause overfitting of the curve if there are more epochs than needed.

##### ***Batch Size:***
The total number of data points in a single batch passed through the neural networks is called batch size.

Sometimes the whole dataset can not be passed through the neural network at once due to insufficient memory or the dataset being too large. We divide the entire dataset into smaller numbers of parts called batches. These batches are passed through the model for training.

#### ***Iteration:***
The total ***number of batches*** needed to complete one epoch is called iteration.
For example:` #no of dataset = 1000 #batch_size = 100 total #no of batch=1000/100= 10`. So `10` iteration is needed to complete ***one epoch.***

#### ***Dataset:***
Dataset stores the samples and their corresponding labels. `torch.utils.data.Dataset`

#### ***DataLoader:***
DataLoader wraps an iterable around the Dataset to enable easy access to the samples. `torch.utils.data.DataLoader`. To use the Dataloader, we need to set the following parameters:

1. data the training data that will be used to train the model; and test data to evaluate the model
2. batch size the number of records to be processed in each batch
3. shuffle the randoms sample of the data by indices.
4. Syntax: `train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)`

[More Info](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

#### ***ToTensor:***
`ToTensor` converts a PIL image or NumPy ndarray into a FloatTensor and scales the image's pixel intensity values in the range [0., 1.]


#### ***collate_fn:*** 
The collate_fn parameter is a function that defines how individual data samples from your dataset should be batched together.

In [2]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
import numpy as np
import math

In [2]:
class WineDataSet(Dataset):
    def __init__(self):
        # Initialize data, download, etc.
        # read with numpy or pandas
        xy= np.loadtxt("../data/wine/wine.csv", delimiter=',', dtype=np.float32, skiprows=1)
        self.n_samples=xy.shape[0]
        self.x=torch.from_numpy(xy[:,1:]) # This will create a feature vectors of 1 * 13
        self.y=torch.from_numpy(xy[:,[0]]) # This will create a level

    def __getitem__(self, index):
        return self.x[index], self.y[index]
    
    def __len__(self):
        return self.n_samples


In [4]:
dataset = WineDataSet()
first_data = dataset[1]
feature, label = first_data
print("Features: ",feature, "\n Lebels: ", label)


Features:  tensor([1.3200e+01, 1.7800e+00, 2.1400e+00, 1.1200e+01, 1.0000e+02, 2.6500e+00,
        2.7600e+00, 2.6000e-01, 1.2800e+00, 4.3800e+00, 1.0500e+00, 3.4000e+00,
        1.0500e+03]) 
 Lebels:  tensor([1.])


`Each iteration below returns a batch of features and labels(containing batch_size=4 features and labels respectively).` 

In [23]:
## DataLoader
train_loader = DataLoader(dataset=dataset,
                          batch_size=4,
                          shuffle=True)

# convert to an iterator and look at one random sample
dataiter = iter(train_loader)
data = next(dataiter)
features, labels = data
print(features, labels)

tensor([[1.1870e+01, 4.3100e+00, 2.3900e+00, 2.1000e+01, 8.2000e+01, 2.8600e+00,
         3.0300e+00, 2.1000e-01, 2.9100e+00, 2.8000e+00, 7.5000e-01, 3.6400e+00,
         3.8000e+02],
        [1.2080e+01, 1.8300e+00, 2.3200e+00, 1.8500e+01, 8.1000e+01, 1.6000e+00,
         1.5000e+00, 5.2000e-01, 1.6400e+00, 2.4000e+00, 1.0800e+00, 2.2700e+00,
         4.8000e+02],
        [1.2850e+01, 1.6000e+00, 2.5200e+00, 1.7800e+01, 9.5000e+01, 2.4800e+00,
         2.3700e+00, 2.6000e-01, 1.4600e+00, 3.9300e+00, 1.0900e+00, 3.6300e+00,
         1.0150e+03],
        [1.3940e+01, 1.7300e+00, 2.2700e+00, 1.7400e+01, 1.0800e+02, 2.8800e+00,
         3.5400e+00, 3.2000e-01, 2.0800e+00, 8.9000e+00, 1.1200e+00, 3.1000e+00,
         1.2600e+03]]) tensor([[2.],
        [2.],
        [1.],
        [1.]])


In [25]:
# Dummy Training loop
num_epochs = 2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4)
print(total_samples, n_iterations)

for epoch in range(num_epochs):
    for i,(inputs, labels) in enumerate(train_loader):
        # here: 178 samples, batch_size = 4, n_iters=178/4=44.5 -> 45 iterations
        # Run your training process
        if (i+1) % 5 == 0:
            print(f'Epoch: {epoch+1}/{num_epochs}, Step {i+1}/{n_iterations}| Inputs {inputs.shape} | Labels {labels.shape}')

178 45
Epoch: 1/2, Step 5/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 10/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 15/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 20/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 25/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 30/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 35/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 40/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 1/2, Step 45/45| Inputs torch.Size([2, 13]) | Labels torch.Size([2, 1])
Epoch: 2/2, Step 5/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 2/2, Step 10/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 2/2, Step 15/45| Inputs torch.Size([4, 13]) | Labels torch.Size([4, 1])
Epoch: 2/2, Step 20/45| Inputs torch.Size([4, 1

In [21]:
train_dataset = torchvision.datasets.MNIST(root='./data', 
                                           train=True, 
                                           transform=torchvision.transforms.ToTensor(),  
                                           download=True)
test_dataset = torchvision.datasets.MNIST(root='./data', 
                                           train=False, 
                                           transform=torchvision.transforms.ToTensor(),  
                                           download=True)

train_loader = DataLoader(dataset=train_dataset, 
                                           batch_size=12000, 
                                           shuffle=True)
test_loader = DataLoader(dataset=test_dataset, 
                                           batch_size=3, 
                                           shuffle=True)

# look at one random sample
# dataiter = iter(test_loader)
# data = next(dataiter)
# inputs, targets = data
# print(inputs.shape, targets.shape)# batch size, color channel, height, width
for images, labels in train_loader:
    print(images.size())  # Output: torch.Size([64, 1, 28, 28]) for MNIST
    print(labels.size())  # Output: torch.Size([64])


torch.Size([12000, 1, 28, 28])
torch.Size([12000])
torch.Size([12000, 1, 28, 28])
torch.Size([12000])
torch.Size([12000, 1, 28, 28])
torch.Size([12000])
torch.Size([12000, 1, 28, 28])
torch.Size([12000])
torch.Size([12000, 1, 28, 28])
torch.Size([12000])


In [14]:

valid_size = 0.2
num_train = len(train_dataset)
indices = list(range(num_train))
print(num_train, indices)

np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]
print(len(train_idx),min(train_idx), max(train_idx))
print(len(valid_idx), max(valid_idx))

48000 0 59999
12000 59991


In [1]:
def get_dataloader(batch_size):

    # percentage of training set to use as validation
    valid_size = 0.2

    # convert data to torch.FloatTensor
    transform = transforms.ToTensor()

    # choose the training and test datasets
    train_data = datasets.MNIST(root='./data', 
                                train=True,
                                download=True, 
                                transform=transform)

    test_data = datasets.MNIST(root='./data',
                               train=False,
                               download=True,
                               transform=transform)

    # obtain training indices that will be used for validation
    num_train = len(train_data)
    indices = list(range(num_train))
    np.random.shuffle(indices)
    split = int(np.floor(valid_size * num_train))
    train_idx, valid_idx = indices[split:], indices[:split]
    
    # define samplers for obtaining training and validation batches
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)
    
    # load training data in batches
    train_loader = torch.utils.data.DataLoader(train_data,
                                               batch_size=batch_size,
                                               sampler=train_sampler,
                                               num_workers=0)
    
    # load validation data in batches
    valid_loader = torch.utils.data.DataLoader(train_data,
                                               batch_size=batch_size,
                                               sampler=valid_sampler,
                                               num_workers=0)
    
    # load test data in batches
    test_loader = torch.utils.data.DataLoader(test_data,
                                              batch_size=batch_size,
                                              num_workers=0)
    
    return train_loader, test_loader, valid_loader