# Batches and dataloaders

One of the handy options about using pytorch is that it can handle batches almost automatically.

This is useful because instead of doing the optimization step over all the dataset we can do it over the batches.

Also we will look in a way to prevent overfitting, using splits of the data, train, validation and test.

In [2]:
import torch 
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.datasets import load_wine

In [5]:
X, y = load_wine(return_X_y=True)
print(f"X shape: {X.shape}") #samples, features
print(f"y shape: {y.shape}") 

X shape: (178, 13)
y shape: (178,)


We can use pytorch to load our data in a smart way:

1. We will create a Dataset object inheriting the Dataset module.
2. We will define 3 main functions inside of it:
    1. **init:** constructor of the class
    2. **getitem:** we will have the ability to select entries of the tensor
    3. **len:** it will return the size of the object

In [11]:
class WineData(Dataset):
    def __init__(self):
        x, y = load_wine(return_X_y=True)
        self.x = torch.from_numpy(x.astype(np.float32)) #remember that float32 is the standard dtype in torch
        self.y = torch.from_numpy(y[:,np.newaxis].astype(np.float32)) #pytorch doesnt like empty axis (x,) -> (x,1)
        self.n_samples = x.shape[0]
    def __getitem__(self,index):
        return self.x[index], self.y[index]
    def __len__(self):
        return self.n_samples

In [14]:
dataset = WineData()
dataset[0]

(tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
         3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
         1.0650e+03]),
 tensor([0.]))

## Why we did all of this? 

Well, now that we have a dataset object we can use the dataloader from torch to work directly with batches, and shuffling the data on the go.

In [15]:
dataloader = DataLoader(dataset=dataset,batch_size=6,shuffle=True)