## Data Loading Tutorial 

Loading data is one crucial step in the deep learning pipeline. PyTorch makes it easy to write custom data loaders for your particular dataset. In this notebook, we're going to download and load cifar-10 dataset and return a torch "tensor" as the image and the label. 


## Getting the dataset 

Head over to [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset homepage and download the "python" version mentioned on the page. Extract the `tar.gz` archive in your home directory.

## Exploring the dataset

CIFAR-10 dataset is divided into 5 training batches and one test batch. In order to train effectively we need to use all the training data. The datasets themselves are stored in a "pickle" format so we have to "unpickle" them first: 

In [1]:
import sys, os 
import pickle 

def unpickle(fname):
    with open(fname, 'rb') as f:
        Dict = pickle.load(f, encoding='bytes')
    return Dict

Let's test if we actually can load the data after the unpickling. For that we're going to load one the data batches and see if we can find the data and labels from it: 

In [2]:
Dic = unpickle(os.path.join(os.getcwd(),'cifar10-data-batches-py', 'data_batch_1'))
print(Dic[b'data'].shape)
print(Dic[b'labels'].shape)

FileNotFoundError: [Errno 2] No such file or directory: '/home/akulshr/data_batch_1'

We notice that the data is a numpy array of 1000x3072 which means that the data batch contains 10000 images of size 3072 (32x32x3). The labels are 1000x1 list in which the *i*th element corresponds to the correct label for the *i*th image. This is how data is in the other training batches. Whenever you've got a new dataset always try to load one or two images(or a data batch) like in this case to get a feel of how the data is.

Loading data like this can be tedious and time consuming. Let's write a helper function which will return the data and the labels:

In [1]:
def load_data(batch_name):
    dic = unpickle(batch_name)
    return dic[b'data'], dic[b'labels']

This little helper function will call our "unpickling" function to unpickle the data and return only the relevant parts of it back to us. While writing dataset loader in general it is helpful to write "helper" functions like these to return the data and the label. In datasets where there is only images, you can use `PIL.Image`. Armed with this let's write a dataloader for PyTorch.

## torch way of doing things

Dataloading in PyTorch is a two step process. First, you need define a custom class which is subclassed from a `data.Dataset` class in `torch.utils.data`. This class takes in arguments which tell PyTorch about the location of the dataset, any "transforms" that you need to make before the dataset is loading.

First lets import the relevant modules. The `data` module which contains the useful functions for dataloading is contained in `torch.utils.data`. We also need a utility called `glob` to get all our batches in a nice list, which we can then index into: 

In [None]:
import torch 
import torch.data.utils as data 
import glob


In [None]:
class CIFARLoader(data.Dataset):
    """
    CIFAR-10 Loader: Loads the CIFAR-10 data according to an index value 
    and returns the data and the labels. 
    
    args:
    root: Root of the data directory.
    
    Optional args:
    transforms: The transforms you wish to apply to the data.
    target_transforms: The transforms you wish to apply to the labels.
    
    """
    def __init__(self, root,transform=None, target_transform=None):
        self.root = root 
        self.transform = transform 
        self.target_transform = target_transform
        patt = os.path.join(self.root, 'data_batch_*') # create the pattern we want to search for.
        self.batches = sorted(patt) 
    
    def __getitem__(self, index):
        batch = self.batches[index]
        data, labels = load_data(batch)
        if self.transform is not None:
            data = self.transform(data)
        if self.target_transform is not None:
            labels = self.target_transform(labels)
        cache = (data, labels)
        return cache
    
    def __len__(self):
        return 1000 
        

Every dataloader in PyTorch needs this form. The class you write for a DataLoader contains two methods `__init__` and `__getitem__`. 

The `__init__` method is where you define arguments telling the location of your dataset, transforms (if any). It is also a good idea to have a variable which can hold a list of all images/batches in your dataset.  

The `__getitem__` method should accept only one argument: the index of the image/batch you want to access. Remember the little helper function we made earlier? We can use that to load our data batch and labels.

Having written a dataloader, we can now test it. For the time being, don't worry about the test code, we'll get to it in later talks. The code below can be used for testing the dataloader class above:

In [2]:
import torchvision.transforms as transforms

tfs  = transforms.ToTensor() # convert any data into a torch tensor

root='/home/akulshr/cifar-10-data-batches-py/'

cifar_train = CIFARLoader(root, transforms=tfs, target_transforms=tfs) # create a "CIFARLoader instance".
cifar_loader = data.DataLoader(cifar_train, batch_size=1, shuffle=True)

diter = iter(cifar_loader)
cache = diter.next()
data, labels = cache 
print(data.size())
print(labels.size())

