# Deep Learning with PyTorch

In [1]:
import torch
import numpy as np

In [2]:
a = torch.FloatTensor(3,2)
a

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [3]:
a.zero_()

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In-place operations have an underscore appended to their name. Operations without an underscore are the functional equivalent and create a copy of the tensor with the performed modification, leaving the original tensor untouched.

You can also create tensors from Python iterables.

In [4]:
torch.FloatTensor([[1,2,3], [3,2,1]])

tensor([[1., 2., 3.],
        [3., 2., 1.]])

In [5]:
n = np.zeros(shape=(3,2))
b = torch.tensor(n)
b

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=torch.float64)

Notice that by default we created a 64 bit array. Usually in deep learning double precision is not required and it adds extra performance and memory overhead. Common practice is to use the 32 bit float type, or even the 16 bit float type, which is more than enough.

In [6]:
n = np.zeros(shape=(3,2), dtype=np.float32)
torch.tensor(n)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [7]:
n = np.zeros(shape=(3,2))
torch.tensor(n, dtype=torch.float32)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

Note that if you pass a `dtype` argument into the `torch.tensor` function it expected a `torch.float` dtype.

In [8]:
a = torch.tensor([1,2,3])
a

tensor([1, 2, 3])

In [9]:
# 0 dimemsional tensor.
s = a.sum()
s

tensor(6)

In [10]:
# Access the actual Python value of a 0 dimensional tensor using .item().
s.item()

6

In [11]:
torch.tensor(1)

tensor(1)

#### GPU Tensors

To convert from CPU to GPU, there is a tensor method `to(device)`, that creates a copy of the tensor to a specified device (CPU or GPU). 

In [12]:
a = torch.FloatTensor([2,3])
a

tensor([2., 3.])

In [13]:
ca = a.to('cuda')
ca

tensor([2., 3.], device='cuda:0')

In [14]:
a + 1

tensor([3., 4.])

In [15]:
ca + 1

tensor([3., 4.], device='cuda:0')

#### Gradients

Even with transparent GPU support, all of this dancing with tensors isn't worth
bothering with without one "killer feature"—the automatic computation of gradients. There are two approaches to how your gradients are calculated.

* **Static graph**: In this method, you need to define your calculations in advance and it won't be possible to change them later. The graph will be processed and optimized by the DL library before any computation is made.

* **Dynamic graph**: You don't need to define your graph in advance exactly as it will be executed, you just need to execute operations that you want to use for data transformation on your actual data. During this, the library will record the order of the operations performed, and when you ask it to calculate gradients, it will unroll its history of operations, accumulating the gradients of the network parameters. 

#### Tensors and Gradients

PyTorch tensors have a built-in gradient calculation and tracking system, so all you need to do is convert the data into tensors and perform computations using the tensor methods and functions provided by `torch`. There are several attributes related to gradients that every tensor has:

* `grad`: A property that holds a tensor of the same shape containing computed gradients.

* `is_leaf`: True if this tensor was constructed by the user and False if the object is a result of function transformation.

* `requires_grad`: True if this tensor requires gradients to be calculated. 

In [16]:
v1 = torch.tensor([1.0, 1.0], requires_grad=True)
v2 = torch.tensor([2.0, 2.0])

In [17]:
v_sum = v1 + v2
v_res = (v_sum*2).sum()
v_res

tensor(12., grad_fn=<SumBackward0>)

So now we have added both vectors element-wise, double every element, and then summed them together. The results is a 0 dimensional tensor with the value 12.

In [18]:
v1.is_leaf, v2.is_leaf

(True, True)

In [19]:
v_sum.is_leaf, v_res.is_leaf

(False, False)

In [20]:
v1.requires_grad, v2.requires_grad

(True, False)

In [21]:
v_sum.requires_grad, v_res.requires_grad

(True, True)

Now, let's tell PyTorch to calculate the gradients of our graph.

In [22]:
v_res.backward()
v1.grad

tensor([2., 2.])

By calling the `backward` function we asked PyTorch to calculate the numerical derivative of the `v_res` variable with respect to any variable that our graph has. In other words, what influence do small changes to the `v_res` variable have on the rest of the graph. In our particular example, the value of two in the gradients of `v1` means that by increasing any element of `v1` by one, the resulting value of `v_res` will grow by two.

As mentioned, PyTorch calculates gradients only for leaf tensors with requirement `grad=True`. Indeed, if we try to check the gradients of `v2`, we get nothing:

In [23]:
v2.grad

#### Neural Network Building Blocks

In [24]:
import torch.nn as nn
l = nn.Linear(2, 5)
v = torch.FloatTensor([1, 2])
l(v)

tensor([-0.1624,  1.1437, -0.9851, -1.0531,  0.4556], grad_fn=<AddBackward0>)

Here, we created a randomly initialized feed-forward layer, with two inputs and 5 outputs, and applied it to our float tensor. All classes in the `torch.nn` packages inherit from the `nn.Module` base class, which you can use to implement your own higher-level NN blocks. You will see how you can do this in the next section, but, for now, let's look at useful methods that all `nn.Module` children provide. They are
as follows:

* `parameters()`: This function returns an iterator of all variables that require gradient computation (that is, module weights).
* `zero_grad()`: This function initializes all gradients of all parameters to zero.
* `to(device)`: This function moves all module parameters to a given device (CPU or GPU).
* `state_dict()`: This function returns the dictionary with all module parameters and is useful for model serialization.
* `load_state_dict()`: This function initialized the module with the state dictionary.

`Sequential` is a convenient class that allows you to combine other layers into a pipeline.

In [25]:
s = nn.Sequential(
    nn.Linear(2, 5),
    nn.ReLU(),
    nn.Linear(5, 20),
    nn.ReLU(),
    nn.Linear(20, 10),
    nn.Dropout(p=0.3),
    nn.Softmax(dim=1)
)

s

Sequential(
  (0): Linear(in_features=2, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=10, bias=True)
  (5): Dropout(p=0.3, inplace=False)
  (6): Softmax(dim=1)
)

In [26]:
s(torch.FloatTensor([[1, 2]]))

tensor([[0.0939, 0.1859, 0.0614, 0.0738, 0.1017, 0.0939, 0.1137, 0.0899, 0.0963,
         0.0896]], grad_fn=<SoftmaxBackward>)

#### Custom Layers

By subclassing the `nn.Module` class, you can create your own building blocks, which can be stacked together, reused later, and integrated into the PyTorch framework flawlessly. At its core, the `nn.Module` provides quite rich functionality to its children:

* It tracks all submodules that the current module includes. For example, your building blocks can have two feed-forward layers used somehow to perform the block's transformation.

* It provides functions to deal with all parameters of the registered submodules. You can obtain a full list of the modules parameters (`parameters()` method), zero its gradients (`zero_grads()` method), serialize and deserialize the module (`state_dict()` and `load_state_dict()`) and even perform generic transformations using your own callable (`apply()` method).

* It establishes the convention of `Module` application to data. Every module needs to perform its data transformation in the `forward()` method by overriding it. 

* There are some more functions, such as the ability to register a hook function to tweak modeul transformation or gradients flow, but they are more for advanced use cases.

To make our life simpler, when following the preceding convention, the PyTorch authors simplified the creation of modules through careful design and a good dose of Python magic. So, to create a custom module, we usually have to do only two things—register submodules and implement the `forward()` method. Let's look at how this can be done for our Sequential example from the previous
section, but in a more generic and reusable way.

In [27]:
class OurModule(nn.Module):
    def __init__(self, num_inputs, num_classes, dropout_prob=0.3):
        super(OurModule, self).__init__()
        self.pipe = nn.Sequential(
            nn.Linear(num_inputs, 5),
            nn.ReLU(),
            nn.Linear(5, 20),
            nn.ReLU(),
            nn.Linear(20, num_classes),
            nn.Dropout(p=dropout_prob),
            nn.Softmax(dim=1)
        )
        
    def forward(self, x):
        return self.pipe(x)

This is our module class that inherits ``nn.Module``. In the constructor, we pass three parameters: the input size, the output size, and the optional dropout probability. The first thing we need to do is call the parent's constructor to let it initialize itself.

In [28]:
def forward(self, x):
    return self.pipe(x)

Here, we must override the `forward` function with our implementation of the data transformation. As our module is a very simple wrapper around other layers, we just need to ask them to transform the data. 

In [29]:
net = OurModule(num_inputs=2, num_classes=3)
v = torch.FloatTensor([[2, 3]])
out = net(v)
print(net)
print(out)

OurModule(
  (pipe): Sequential(
    (0): Linear(in_features=2, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
    (5): Dropout(p=0.3, inplace=False)
    (6): Softmax(dim=1)
  )
)
tensor([[0.2745, 0.2745, 0.4509]], grad_fn=<SoftmaxBackward>)


#### The final glue - loss functions and optimizers

#### Loss Functions

Loss functions reside in the `nn` package and are implement as an `nn.Module` subclass. Usually, they accept two arguments: output from the network and desired output. The most commonly used standard loss functions are:

* `nn.MSELoss`
* `nn.BCELoss` and `nn.BCEWithLogits`: Binary cross-entropy loss. The first version expects a single probability value (usually it's the output of the `Sigmoid` layer), while the second version assumes raw scores as input and applies `Sigmoid` itself. The second is usually more stable and efficient. 
* `nn.CrossEntropyLoss` and `nn.NLLLoss`: Famous maximum likelihoos criteria that are used in multi-class classification problems. The first version expects raw scores for each class and applies `LogSoftmax` internally, while the second expects to have log probabilities as the input.

#### Optimizers

The resposibility of the basic optimizer is to take the gradients of model parameters and change these parameters in order to decrease the loss value. In the `torch.optim` package, PyTorch provides lots of popular optimizer implementations and the most widely known are as follows:

* SGD: A vanilla stochastic gradient descent algorithm with an optional momentum extension.
* RMSprop
* Adagrad
* Adam

On construction, you need to pass an interable of tensors, which will be modified during the optimization process. The usual practice is to pass the result of the `params()` call of the upper level `nn.Module` instance, which will return an interable of all leaf tensors with gradients. Now, let's discuss the common blueprint of a training loop.

In [None]:
for batch_x, batch_y in iterate_batches(data, batch_size=32):
    batch_x_t = torch.tensor(batch_x)
    batch_y_t = torch.tensor(batch_y)
    out_t = net(batch_x_t)
    loss_t = loss_function(out_y, batch_y_t)
    loss_t.backward()
    optimizer.step()
    optimizer.zero_grad()

#### Monitoring with TensorBoard

DL practitioners have developed a list of things that you should observe during your training, which usually includes the following:

* Loss values, which normally consists of several components like base loss and regularization losses. You should monitor both the total loss and the individual components over time.
* Results of validation on training and test datasetes.
* Statistics about gradients and weights.
* Values produced by the network. For example, if you are soliving a classification problem, you definitely want to measure the entropy of predicted class probabilities. In the case of a regression problem, raw predicted values can give tons of data about the training.
* Learning rates and other hyperparameters, if they are adjusted over time.


#### TesnorBoard 101

From the architecture point of view, TensorBoard is a Python web service that you can start on your computer, passing it the directory where your training process will save values to be analyzed. Then, you point your browser to TensorBoard's port (usually `6006`), and it shows you an interactive web interface with values updated in real time. It's nice and convenient, especially when your training is performed on a remote machine somewhere in the cloud.

**Plotting Stuff**

To give you an impression of how simple tensorboardX is, let's consider a small example that is not related to NNs, but is just about writing stuff into TensorBoard

In [31]:
import math
from tensorboardX import SummaryWriter

writer = SummaryWriter()
funcs = {'sin': math.sin, 'cos': math.cos, 'tan': math.tan}

for angle in range(-360, 360):
    angle_rad = angle * math.pi / 180
    for name, func in funcs.items():
        val = func(angle_rad)
        writer.add_scalar(name, val, angle)
writer.close()

**TO DO**: Tensorboardx is not currently starting on my Windows machine.

#### Example – GAN on Atari Images

In [32]:
import gym 

class InputWrapper(gym.ObservationWrapper):
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        assert isinstance(self.observation_space, gym.spaces.Box)
        old_space = self.observation_space
        self.observation_space = gym.spaces.Box(
            self.observation(old_space.low),
            self.observation(old_space.high),
            dtype=np.float32)
            
    def observation(self, observation):
        new_obs = cv2.resize(
            observation, (IMAGE_SIZE, IMAGE_Size))
        # Transform (210, 160, 3) -> (3, 210, 160)
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)

This class is a wrapper around a Gym game, which includes several transformations.

* Resize the input imag efrom 210x160 (standard Atari resolution) to a 64x64 square.
* Move the volor plane of the image from the last position to the first, to meet the PyTorch convention of convolution layers that input a tensor with the shape of the channels, height, width.
* Cast the image from `bytes` to `float`.

The code for the generator and discriminator are included in the book repo.

In [None]:
def iterate_batches(envs, batch_size=BATCH_SIZE):
    batch = [e.reset() for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)
    
    while True:
        e = next(env_gen)
        obs, reward, is_done, _ = e.step(e.action_space.sample())
        if np.mean(obs) > 0.01:
            batch.append(obs)
        if len(batch) == batch_size:
            # Normalizing input between -1 and 1
            batch_np = np.array(batch, dtype=np.float32)
            batch_np *= 2.0 / 255.0 - 1.0
            yield torch.tensor(batch_np)
            batch.clear()
        if is_done():
            e.reset()

This infinitely samples the environment from the provided array, issues random actions, and remembers observations in the batch list. When the batch becomes the required size, we normalize the image, convert it to a tensor, and yield from the generator. The check for the non-zero mean of the observation is required due to a bug in one of the games to prevent the flickering of an image.

Now, let's look at our main function, which prepares models and runs the training loop.

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument(
    '--cuda', default=False, action='store_true', help='Enable cuda computation')
args = parser.parse_args()

device = torch.device('cuda' if args.cuda else 'cpu')
envs = [
    InputWrapper(gym.make(name))
    for name in ('Breakout-v0', 'AirRaid-v0', 'Pong-v0')
]
input_shape = envs[0].observation_space.shape

Here, we process the command-line arguments and create our environment pool, with a wrapper applied. This environment array will be passed to the `iterate_batches` function to generate training data.

In [None]:
net_discr = Discriminator(input_shape=input_shape).to(device)
net_gener = Generator(output_shape=input_shape).to(device)

objective = nn.BCELess()

gen_optimizer = optim.Adam(
    params=net_gener.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999)
)

dis_optimizer = optim.Adam(
    params=net_discr.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999)
)

writer = SummaryWriter()

#### PyTorch Ignite

**Ignite Concepts**

At a high level, Ignite simplifies the writing of the training loop in PyTorch DL. Earlier in the chapter we saw that the minimal training loop consists of:

* Sampling a batch of training data.
* Applying a NN to this batch to calculate the loss function - the single value we want to minimize. 
* Running backpropagation of the loss to get gradients on the network's parameters in respect to the loss.
* Asking the optimzer to apply the gradients to the network.
* Repeating until we are happy or bored of waiting.

The central piece to Ignite is the `Engine` class, which loops over the data source, applying the processing function to the data batch. In addition to that, Ignore offers the abilities to provide functions to be called at specific conditions of the training loop. Those conditions are called `Events` and could be at the:

* Beginning/end of the whole training process.
* Beginning/end of a training epoch.
* Beginning/end of a single batch processing.

In addition to that, custom events exist and allow you to specify your function to be called every `N` events, for example, if you want to do some calculations every 100 batches or every second epoch. 

A very simplistic example of Ignite is shown in the following code block.

In [None]:
from ignite.engine import Engine, Events

def training(engine, batch):
    optimizer.zero_grad()
    x, y = prepare_batch()
    y_out = model(x)
    loss = loss_fn(y_out, y)
    loss.backward()
    optimizer.step()
    return loss.item()

engine = Engine(training)
engine.run(data)