# Data loading tutorial

Este resumen está basado en [este](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) tutorial. [Este](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel) blog también está bueno.

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from skimage import io, transform

``torch.utils.data.Dataset`` is an abstract class representing a
dataset.
Your custom dataset should inherit ``Dataset`` and override the following
methods:

-  ``__len__`` so that ``len(dataset)`` returns the size of the dataset.
-  ``__getitem__`` to support the indexing such that ``dataset[i]`` can
   be used to get $i$\ th sample

Let's create a dataset class for our face landmarks dataset. We will
read the csv in ``__init__`` but leave the reading of images to
``__getitem__``. This is memory efficient because all the images are not
stored in the memory at once but read as required.

Sample of our dataset will be a dict
``{'image': image, 'landmarks': landmarks}``. Our dataset will take an
optional argument ``transform`` so that any required processing can be
applied on the sample. We will see the usefulness of ``transform`` in the
next section.

In [None]:
class FaceLandmarksDataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, csv_file, root_dir, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.landmarks_frame = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.landmarks_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = os.path.join(self.root_dir,
                                self.landmarks_frame.iloc[idx, 0])
        image = io.imread(img_name)
        landmarks = self.landmarks_frame.iloc[idx, 1:]
        landmarks = np.array([landmarks])
        landmarks = landmarks.astype('float').reshape(-1, 2)
        sample = {'image': image, 'landmarks': landmarks}

        if self.transform:
            sample = self.transform(sample)

        return sample

We can iterate over the created dataset with a ``for i in range`` loop as before.

In [None]:
dataset = FaceLandmarksDataset(csv_file='data/faces/face_landmarks.csv',
                               root_dir='data/faces/',
                               transform=None)

for i in range(len(dataset)):
    sample = dataset[i]

    print(i, sample['image'].size(), sample['landmarks'].size())

    if i == 3:
        break

However, we are losing a lot of features by using a simple ``for`` loop to
iterate over the data. In particular, we are missing out on:

-  Batching the data
-  Shuffling the data
-  Load the data in parallel using ``multiprocessing`` workers.

``torch.utils.data.DataLoader`` is an iterator which provides all these
features. Parameters used below should be clear. One parameter of
interest is ``collate_fn``. You can specify how exactly the samples need
to be batched using ``collate_fn``. However, default collate should work
fine for most use cases.


In [None]:
# Algunas definiciones
NUM_TRAIN = 49000
USE_GPU = True
dtype = torch.float32 
print_every = 100

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

    
    
### Dataset ###

# Función para preprocesar los datos. Esto se ejecuta cuando
# se cargan en RAM los datos de entrenamiento.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# Objeto que representa al dataset y que contiene todas las 
# características de los datos (largo, dónde se guardan, etc.)
dataset = FaceLandmarksDataset(csv_file='data/faces/face_landmarks.csv',
                               root_dir='data/faces/',
                               transform=transform)

# Objeto que sirve para cargar a la RAM los datos en forma eficiente. Además,
# hace automáticamente el batch del SGD.
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=4)



### Entrenamiento ###

# Modelo a entrenar
model = TwoLayerFC(3 * 32 * 32, 4000, 10).to(device=device)  

# Parámetros de optimización
learning_rate = 1e-2
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Inicio del entrenamiento
epochs = 1
for e in range(epochs):
    for t, (x, y) in enumerate(dataloader):
        model.train()  
        x = x.to(device=device, dtype=dtype)
        y = y.to(device=device, dtype=torch.long)

        scores = model(x)
        loss = F.cross_entropy(scores, y)

        # Zero out all of the gradients for the variables which the optimizer
        # will update.
        optimizer.zero_grad()

        # This is the backwards pass: compute the gradient of the loss with
        # respect to each  parameter of the model.
        loss.backward()

        # Actually update the parameters of the model using the gradients
        # computed by the backwards pass.
        optimizer.step()

        if t % print_every == 0:
            print('Iteration %d, loss = %.4f' % (t, loss.item()))
            # acá falta la función CheckAccuracy() 
            print()