# Programming PyTorch for Deep Learning
### Ian Pointer (O'Reilly)
### Notes and tests

This books assumes a working CUDA installation. Let's hope for the best...

## Chapter 1. Getting started with PyTorch
### Tensors
A tensor is both a container for numbers (like a vector or matrix) but also represents sets of rules defining transformations
between tensors that produce new tensors. Tensors have ranks that represent their dimensional space. Tensors with PyTorch
support fetching and changing elements using the standard Python indexing. 

In [1]:
import torch, torchvision, numpy, pandas
from torchvision import transforms
from torch.utils import data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(x)
print(x[0][-1])  # fetch the last element of the first dimension
x[0][0] = 10  # change the value of the first element of the first dimension

# There are multiple functions that can create tensors, like torch.zeros(), torch.ones(), torch.rand()

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
tensor(3)


#### Tensor operations
There are a lot.
For instance, we can find the maximum value in a tensor like this:

In [2]:
x = torch.rand(2, 2)
print(x)
print(x.max(), x.argmax(), x.max().item())

tensor([[0.2723, 0.4860],
        [0.3239, 0.3828]])
tensor(0.4860) tensor(1) 0.4860256314277649


There are multiple types of tensors, for instance ```LongTensors``` or ```FloatTensors```. We can convert back and forth 
using the ```.to()``` method.

In [3]:
long_tensor = torch.tensor([[0, 0, 1], [1, 1, 1], [0, 0, 0]])
print(long_tensor.type())
float_tensor = long_tensor.to(dtype=torch.float32)
print(float_tensor.type())

torch.LongTensor
torch.FloatTensor


Sometimes it may be useful to make use of **in-place** operations, as it will save memory by avoiding copying the tensor.
In-place functions are post-fixed with a "_" symbol.

In [4]:
x = torch.rand(2, 2)
print(x)
x.log2()
print(x)
x.log2_()  # Only this in-place operation will change the original tensor (x)
print(x)

tensor([[0.7694, 0.5596],
        [0.7815, 0.8139]])
tensor([[0.7694, 0.5596],
        [0.7815, 0.8139]])
tensor([[-0.3782, -0.8375],
        [-0.3556, -0.2970]])


#### Reshaping
An important task is to reshape tensors. We can use ```torch.view()``` or ```torch.reshape()```. The main difference is
that ```torch.view()``` operates as a view of the original tensor, so if the underlying data is changed, the view will also
change, whereas this does not happen with ```torch.reshape()```. Another difference is that ```torch.view()``` requires
that the tensors/views being operated on **are contiguous**, that is, they need to share the same memory blocks they would
occupy if a new tensor of the desired shape was created ex-novo. In this case, de-fragment stuff with ```torch.contiguous()```
before the ```torch.view()``` operations.
#### Rearranging tensor dimensions
Another important operation is to re-arrange the dimensions in tensors. For instance, usually RGB image data is organized
in ```[width, height, channel]``` but usually PyTorch likes this data as ```[channel, width, height]```. We can use the
```torch.permute()``` method by supplying the new order of the dimensions.

In [5]:
x = torch.rand(640, 480, 3)
x_rearranged = x.permute(2, 0, 1)  # put the last dimension (RGB) as the first one.
print(x_rearranged.size())

torch.Size([3, 640, 480])


#### Tensor broadcasting
Tensor broadcasting is an approach that allows to perform operations between a tensor and a smaller tensor. It is possible
to broadcast across two tensors if, starting from their trailing dimensions:
* The two dimensions are equal
* One of the dimensions is 1

The book doesn't really expand much on this, but my understanding is that, as long as the limitations above are respected,
it can automatically pad the smaller tensor to have the same size as the larger one, and then operate.
## Chapter 2. Image Classification with PyTorch
Now we are going to incrementally build a simple neural network with the task of performing image classification between
fishes and cats. First of all, we need data. The ```download.py``` script, included in the book's GitHub, supposedly
downloads a subset of ImageNet data, already separated in 3 datasets (**training**, **test** and **validation**) and for
each of these datasets, the images are already divided into fish or cat categories. The script, unfortunately, seems to 
have failed for several images, and several other had not been downloaded properly, so let's see how it goes. I guess it's
a really real-world example...
### PyTorch and Data Loaders
Formally, a PyTorch ```dataset``` is a Python class that allows us to get at the data we're supplying to the neural network. 
A ```data loader``` is what actually feeds data from the dataset into the network. 
A dataset is defined as a class that defines at least a ```.__getitem__(self, index)``` method, and a ```.__len__(self)```
method. These two methods provide a way of retrieving elements from the data, in ```(label, tensor)``` pairs, and a way
of obtaining the size of the dataset, respectively.
### Building a training dataset
The ```torchvision``` module provide several convenience function to deal with image dataset, such as the
```ImageFolder``` class, which will greatly simplify dealing with image data, as long as the images are contained in a 
directory structure where each directory is a label (i.e. ```./train/cat/```, ```./train/fish/``` etc).

For our purposes, this will be enough:

In [6]:
train_data_path = "./book_examples/train/"

my_transforms = transforms.Compose([
    transforms.Resize(64),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])
train_data = torchvision.datasets.ImageFolder(root=train_data_path, transform=my_transforms)

val_data_path = "./book_examples/val/"
val_data = torchvision.datasets.ImageFolder(root=val_data_path, transform=my_transforms)

test_data_path = "./book_examples/test/"
test_data = torchvision.datasets.ImageFolder(root=test_data_path, transform=my_transforms)

my_batch_size = 64
train_data_loader = data.DataLoader(train_data, batch_size=my_batch_size)
val_data_loader = data.DataLoader(val_data, batch_size=my_batch_size)
test_data_loader = data.DataLoader(test_data, batch_size=my_batch_size)

Now that set ```datasets``` (and pertinent transforms) and ```dataloaders```, it's time to create the actual neural 
network!
### Creating a network

In [7]:
# Had to readjust several variables/imports that are not mentioned in the book examples. Cross-referencing with the official
# 60-min tutorial helps a lot. (Yeah, that's mentioned in the book errata webpage...)
# class SimpleNet(nn.Module):
#     def __init__(self):
#         super(SimpleNet, self).__init__()
#         self.fc1 = nn.Linear(12288, 84)
#         self.fc2 = nn.Linear(84, 50)
#         self.fc3 = nn.Linear(50, 2)
#     
#     def forward(self,x ):
#         x = x.view(-1, 12288)
#         x = F.relu(self.fc1(x))
#         x = F.relu(self.fc2(x))
#         x = F.softmax(self.fc3(x))
#         return x
# 
# simplenet = SimpleNet()

### Activation functions
Activation functions define how the ouptut from one unit should be propagated to the post-synaptic units. A very commonly
used activation function is the **```ReLU```** function, acronym for **rectified linear unit**. It basically implements
```max(0, x)``` where ```x``` is the value being propagated. Another useful activation function is called **```softmax```**,
and tries to "exaggerate" differences between values while keeping everything normalized so that it adds up to 1. This
is explained pretty poorly but I'm not going to open a new browser tab for every ill-explained paragraph in the book.

In the case of our ```simplenet``` implementation, we start by defining a initialization function. The first instruction
calls the super class (```nn.Module```) initialization function which we inherit. Then, we define **three fully connected
layers**, called **Linear** in PyTorch. We then define the ```forward()``` method, which defines how data is propagated
through the network. We start by converting the input 3D tensor into a 1D tensor that is fed into the first Linear layer.
Then, we apply the layers and respective activation functions in order, and then we return the ```softmax``` output to
obtain the prediction from our network. 

The numbers defining the layers (i.e. inputs to the layer, outputs of the layer) are arbitrary with the exception of the 
final Linear layer (it needs to have 2 outputs, representing our two classes of stimuli). The idea is that we start with
a high dimensionality in inputs, and as we progress through layers, we operate with less and less units, with the hope that
the network, forced to work with progressively shrunk representations, will extract features available in the higher-level,
lower-dimensionality representations. 

### Loss functions
Loss functions define ways of computing how far is a prediction from the ground truth, and as such are essential for
deep learning. PyTorch offers several different types of loss functions. For these examples, we are going to use a build-in
loss function called **```CrossEntropyLoss```**, suitable for multi-class categorization tasks. Another common loss function
is **```MSELoss```**, implementing the standard mean squared loss. An important point is that ```CrossEntropyLoss``` automatically
applies ```softmax()``` as part of its use, so we need to account for this in out simplenet implementation. 

Just realized that they don't provide an example implementation that could have helped in the part dealing with implementing
the actual learning loop. I'm going to try and just create an instance of a ```CrossEntropyLoss``` loss function.

In [8]:
my_loss_fn = nn.CrossEntropyLoss()
print(type(my_loss_fn))

<class 'torch.nn.modules.loss.CrossEntropyLoss'>


In [9]:
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(12288, 84)
        self.fc2 = nn.Linear(84, 50)
        self.fc3 = nn.Linear(50, 2)
    
    def forward(self,x ):
        x = x.view(-1, 12288)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

simplenet = SimpleNet()

Let's now focus on how the layers are updated during the network's training loop.
### Optimizing
Basically, we are talking about finding minima across very high-dimensional data. There are several algorithms and PyTorch
offers built-in tools to ease this process. Some practical point to keep in mind:
* We should try **not to get trapped in local minima**, which could result from using exceedingly small learning rates.
* Using exceedingly large learning rate might cause our optimization to **never converge on a suitable solution** for our
given weights, data and task.

To avoid these problems, other approaches for optimizing neural networks learning. PyTorch includes ```SGD``` (stochastic
gradient descent), ```AdaGrad```, ```RMSProp``` and ```Adam``` (which is what we'll use in these examples), among others.
Adam is particularly desirable for deep learning because it uses independent learning rate per parameter, adapting the
learning rate depending on the rate of change of the parameters. 
We can create an Adam-based optimizer instance with the following code (assuming ```import torch.optim as optim```).

In [10]:
my_optim = optim.Adam(simplenet.parameters(), lr=0.001)

### Training
We can now write a loop to perform the actual training on our network and data. The final implementation will look like
the following.

In [11]:
# # Doesn't work either.
# epochs = 2
# 
# for epochs in range(epochs):
#     # PyTorch operates on a batch-based logic
#     for batch in train_data_loader:
#         my_optim.zero_grad()  # always zero the initial gradients at the beginning of the inner iteration
#         curr_input, curr_target = batch  # Unpack the current batch, comprising an input and its ground truth in terms of target
#         curr_output = simplenet(curr_input)  # present our net's instance with the input and obtain its guess
#         curr_loss = my_loss_fn(curr_output, curr_target)  # using our specific instance of loss function (CrossEntropyLoss), calculate the distance from the correct answer
#         curr_loss.backward()  # Basically, compute how each weight of each unit in each layer has contributed to the distance of the net's current guess from the current correct answer
#         my_optim.step()  # Adjust the weights throughout our net accordingly
        

Ok, that doesn't work, let's try with the provided full implementation, which should work.

In [12]:
def my_train(model: SimpleNet, optimizer: torch.optim.Adam, loss_fn: nn.CrossEntropyLoss, train_loader: data.DataLoader, 
             val_loader: data.DataLoader, epochs: int = 20, device: str = "cpu"):
    """

    :param model: 
    :type model: SimpleNet
    :param optimizer: 
    :type optimizer: torch.optim.Adam
    :param loss_fn: 
    :type loss_fn: nn.CrossEntropyLoss
    :param train_loader: 
    :type train_loader: data.DataLoader
    :param val_loader: 
    :type val_loader: data.DataLoader
    :param epochs: 
    :type epochs: int
    :param device: 
    :type device: str
    """
    for epoch in range(epochs):
        curr_epoch_train_loss = 0.0
        curr_epoch_valid_loss = 0.0
        model.train()
        cnt = 0
        for batch in train_loader:
            optimizer.zero_grad()
            curr_in, curr_tgt = batch
            curr_in = curr_in.to(device)
            curr_tgt = curr_tgt.to(device)
            curr_out = model(curr_in)
            curr_loss = loss_fn(curr_out, curr_tgt)
            curr_loss.backward()
            optimizer.step()
            curr_epoch_train_loss += curr_loss.data.item()
            cnt += 1
        curr_epoch_train_loss /= cnt

        model.eval()
        num_correct = 0
        num_tot = 0
        for batch in val_loader:
            curr_in, curr_tgt = batch
            curr_in = curr_in.to(device)
            curr_tgt = curr_tgt.to(device)
            curr_out = model(curr_in)
            curr_loss = loss_fn(curr_out, curr_tgt)
            curr_epoch_valid_loss += curr_loss.data.item()
            correct = torch.eq(torch.max(F.softmax(curr_out), dim=1)[1], curr_tgt.view(-1))
            num_correct += torch.sum(correct).item()
            num_tot = correct.shape[0]
        curr_epoch_valid_loss /= len(val_loader)

        print("Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}, accuracy: {.2f}".format(epoch, curr_epoch_valid_loss, num_correct / num_tot))
#         
# my_train(simplenet, my_optim, my_loss_fn, train_data_loader, test_data_loader, device="cpu")

Ok, none of the above worked. I'll try to proceed with Chapter 3, hoping that accuracy increases since I couldn't figure
out how to fix the broken examples.
## Chapter 3. Convolutional Neural Networks
Let's have a look at our first CNN (hopefully)

In [13]:
# # The following doesn't work either, hence why it's commented out.
# class CNNNet(nn.Module):
#     def __init__(self, num_classes: int = 2):
#         super(CNNNet, self).__init__()
#         self.features = nn.Sequential(
#             nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
#             nn.ReLU(),
#             nn.MaxPool2d(kernel_size=3, stride=2),
#             nn.Conv2d(64, 192, kernel_size=3, padding=2),
#             nn.ReLU(),
#             nn.MaxPool2d(kernel_size=3, padding=1),
#             nn.Conv2d(192, 384, kernel_size=3, padding=1),
#             nn.ReLU(),
#             nn.Conv2d(384, 256, kernel_size=3, padding=1),
#             nn.ReLU(),
#             nn.Conv2d(256, 256, kernel_size=3, padding=1),
#             nn.ReLU(),
#             nn.MaxPool2d(kernel_size=3, stride=2)
#         )
#         self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
#         self.classifier = nn.Sequential(
#             nn.Dropout(),
#             nn.Linear(256 * 6 * 6, 4096),
#             nn.ReLU(),
#             nn.Dropout(),
#             nn.Linear(4096, 4096),
#             nn.ReLU(),
#             nn.Linear(4096, num_classes)
#         )
#         
#     def forward(self, x):
#         x = self.features(x)
#         x = self.avgpool(x)
#         x = torch.flatten(x, 1)
#         x = self.classifier(x)
#         return x
#     
# my_cnn = CNNNet()
# my_train(my_cnn, loss_fn=my_loss_fn, optimizer=my_optim, train_loader=train_data_loader, val_loader=test_data_loader, epochs=20)