# Optimizing training and inference

In this notebook, we will discuss different ways to reduce memory and compute usage during training and inference.

## Prepare training script (1 point)

When training large models, it is usually a best practice not to use Jupyter notebooks, but run a **separate script** for training which could have command-line flags for various hyperparameters and training modes. This is especially useful when you need to run multiple experiments simultaneously (e.g. on a cluster with task scheduler). Another advantage of this is that after training, the process will finish and free the resources for other users of a shared GPU.

In this part, you will need to put all your code to train a model on Tiny ImageNet that you wrote for the previous task in `train.py`.

You can then run your script from inside of this notebook like this:

In [None]:
!python3 train.py --flag --some_parameter <its value>

**Task** 

Write code for training with architecture from homework_part2

**Requirements**
* Optional arguments from command line such as batch size and number of epochs with built-in argparse
* Modular structure - separate functions for creating data generator, building model and training 


## Profiling time (1 point)

For the next tasks, you need to add measurements to your training loop. You can use [`perf_counter`](https://docs.python.org/3/library/time.html#time.perf_counter) for that:

In [None]:
import time
import numpy as np
import torch

In [None]:
x = np.random.randn(1000, 1000)
y = np.random.randn(1000, 1000)

start_counter = time.perf_counter()
z = x @ y
elapsed_time = time.perf_counter() - start_counter
print("Matrix multiplication took %.3f seconds".format(elapsed_time))

**Task**. You need to add the following measurements to your training script:
* How much time a forward-backward pass takes for a single batch;
* How much time an epoch takes.

## Profiling memory usage (1 point)

**Task**. You need to measure the memory consumptions

This section depends on whether you train on CPU or GPU.

### If you train on CPU
You can use GNU time to measure peak RAM usage of a script:

In [None]:
!/usr/bin/time -lp python train.py

**Maximum resident set size**  will show you the peak RAM usage in bytes after the script finishes.

**Note**. 
Imports also require memory, do the correction

### If you train on GPU

Use [`torch.cuda.max_memory_allocated()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.max_memory_allocated) at the end of your script to show the maximum amount of memory in bytes used by all tensors.

In [None]:
x = torch.randn(1000, 1000, 1000, device='cuda:0')
print(f"Peak memory usage by Pytorch tensors: {(torch.cuda.max_memory_allocated('cuda:0') / 1024**2):.2f} Mb")

## Gradient based techniques

Modern architectures can potentially consume lots and lots of memory even for minibatch of several objects. To handle such cases here we will discuss two simple techniques.

### Gradient Checkpointing (3 points)

Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.

See [blogpost](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) for kind introduction and different strategies or [article](https://arxiv.org/pdf/1604.06174.pdf) for not kind introduction.

**Task**. Use [built-in checkpointing](https://pytorch.org/docs/stable/checkpoint.html), measure the difference in memory/compute.

Run several experiments with different checkpointing strategies (e.g. if your model is a sequential convnet, or there are  large sequential parts in your model, you are able to vary [the number of sequential checkpointed segments](https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint_sequential)).

Visualize your results (in terms of time/memory consumption). Which checkpointing strategy do you think is the best one for this trade-off?

**Requirements**. 
* Try several arrangements for checkpoints
* Add the chekpointing as the optional flag into your script
* Measure the difference in memory/compute between the different arrangements and baseline 

### Accumulating gradient for large batches (3 points)
We can increase the effective batch size by simply accumulating gradients over multiple forward passes. Note that `loss.backward()` simply adds the computed gradient to `tensor.grad`, so we can call this method multiple times before actually taking an optimizer step. However, this approach might be a little tricky to combine with batch normalization. Do you see why?

In [None]:
effective_batch_size = 1024
loader_batch_size = 32
batches_per_update = effective_batch_size // loader_batch_size # Updating weights after 8 forward passes

dataloader = DataLoader(dataset, batch_size=loader_batch_size)

optimizer.zero_grad()

for batch_i, (batch_X, batch_y) in enumerate(dataloader):
    l = loss(model(batch_X), batch_y)
    l.backward() # Adds gradients
  
    if (batch_i + 1) % batches_per_update == 0:
        optimizer.step()
        optimizer.zero_grad()

**Task**. Explore the trade-off between computation time and memory usage while maintaining the same effective batch size. By effective batch size we mean the number of objects over which the loss is computed before taking a gradient step.

**Requirements**

* Compare compute between accumulating gradient and gradient checkpointing with similar memory consumptions
* Incorporate gradient accumulation into your script with optional argument

## Accuracy vs compute trade-off

### Knowledge distillation (6 points)
Suppose that we have a large network (*teacher network*) or an ensemble of networks which has a good accuracy. We can like train a much smaller network (*student network*) using the outputs of teacher networks. It turns out that the perfomance could be even better! This approach doesn't help with training speed, but can be quite beneficial when we'd like to reduce the model size for low-memory devices.

* https://www.ttic.edu/dl/dark14.pdf
* [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)
* https://medium.com/neural-machines/knowledge-distillation-dc241d7c2322

Even the completely different ([article](https://arxiv.org/abs/1711.10433)) architecture can be used in a student model, e.g. you can approximate an autoregressive model (WaveNet) by a non-autoregressive one.

**Task:** 
1. Train good enough (teacher) network, achieve >=35% accuracy on validation set.
2. Train small (student) network, achieve 20-25% accuracy, draw a plot "training and testing errors vs train step index"
3. Distill teacher network with student network, achieve at least +1% improvement in accuracy over student network accuracy.

_Please, don't cheat with early-early-early stopping while training of the student network. Make sure, it  converged._

**Note**. Logits carry more information than the probabilities after softmax.

**Another note**. The most common way to distill knowledge in classification tasks is to optimize KL Divergence between the outputs of your student and teacher networks, since their outputs (after softmax operation) can be interpreted as labels distributions. There is also a good practice of using softmax with high temperature to obtain 'soft' distributions:

![image info](https://miro.medium.com/max/875/1*WxFiH3XDY1-28tbyi4BGDA.png)

**And another note**. Don't forget to use your teacher network in 'eval' mode. And don't forget your main objective.

### TL;DR. Moar techniques on accuracy vs time trade-off (just for your information)

### Mutual learning

Instead of transfer knowledge from a pre-trained teacher to a single student, we can train an ensemble of students learn collaboratively even without a teacher model:

https://arxiv.org/pdf/1706.00384.pdf

### Tensor type size

One of the hyperparameter affecting memory consumption is the precision (e.g. floating point number). The most popular choice is 32 bit however with several hacks* 16 bit arithmetics can save you approximately half of the memory without considerable loss of perfomance. This is called mixed precision training:

https://arxiv.org/pdf/1710.03740.pdf

Mixed precision in pytorch:

https://pytorch.org/docs/stable/notes/amp_examples.html

### Quantization

We can actually move further and use even lower precision like 8-bit integers:

* https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd
* https://intellabs.github.io/distiller/design.html#quantization
* https://arxiv.org/abs/1712.05877

Quantization in pytorch:
* https://pytorch.org/docs/stable/quantization.html
* https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT

### Pruning

The idea of pruning is to remove unnecessary (in terms of loss) weights. It can be measured in different ways: for example, by the norm of the weights (similar to L1 feature selection), by the magnitude of the activation or via Taylor expansion*.

One iteration of pruning consists of two steps:

1) Rank weights with some importance measure and remove the least important

2) Fine-tune the model

This approach is a bit computationally heavy but can lead to drastic (up to 150x) decrease of memory to store the weights. Moreover if you make use of structure in layers you can decrease also compute. For example, the whole convolutional filters can be removed.

*https://arxiv.org/pdf/1611.06440.pdf

*https://arxiv.org/pdf/1808.06866.pdf