In [None]:
import os
from pathlib import Path
from fastai.imports import *  # imports the usual suspects

## Lesson 7 Collaborative Filtering

The video title is instead entitled "What's inside a neural net", which makes sense because it begins by continuing to go through "Road to the Top" [part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3) and part [part 4](https://www.kaggle.com/code/jhoward/multi-target-road-to-the-top-part-4).


### Gradient accumulation

* For large models, you will need a smaller batch size to fit in memory. However this will increase the variance of the gradient if done in the direct way. Instead you can accumulate the gradients over multiple batches before updating the weights. This is called gradient accumulation.

* In pytorch this can be done by simply calling  `loss.backward()` for each 'sub-batch' and then calling `optimizer.step()` after a number of sub-batches have been processed. For example, if you want a batch size of 64 but could not fit it in memmory, you could set the sub-batch size to 16 and accumulate gradients over 4 sub-batches. This works because the gradients are simply added together until you call `optimizer.zero_grad()`. (or otherwise zero the grads.)

* The results will be the same for most architectures, except for things that depend on the batch itself like batch normalization.  

* Fastai supports this directly with [Gradient Accumulation](https://docs.fast.ai/callback.training.html#gradientaccumulation) callback.

* Someone asked how you pick a batch size, and he says something like "people pcik as large a batch size as will fit in the GPU".  Which begs the question why use gradient accumulation at all?   Why not just decrease learning rate ?   Not clear, something something variance...  

* When experimenting with memory usage, and perhaps in general, it is a good idea to clear out memory between runs. You can do this by restarting the kernel, or by using `gc.collect()` and `torch.cuda.empty_cache()`.

* Notebook (part 3) also illustrates how he got ot the top by ensembling several different large models trained on different training sets (using train val split)

* Video at 38:00

### Multi-target training.


This section uses [part 4](https://www.kaggle.com/code/jhoward/multi-target-road-to-the-top-part-4) of the notebook.
He uses fast.ai's `DataBlock` to encapsulate the multiple target dataset.... the targets are the disease *and* the type of rice.

Since there are 10 diffrent diseases and 10 different varieties, he simply has the 'learner' output 20 different values. The first 10 will be used to predict the disease, and the second 10 will be used to predict the variety.   The loss function will add the loss for the two targets together. The loss uses F.cross_entropy seperatly for each target, which works because the ouptut of the model is a single vector of 20 logits (10 for each target).



### Collaborative Filtering

This section uses the [notebook](https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook) which is also chapter 8 of the book.