### Advanced task: image captioning with visual attention

![img](https://i.imgur.com/r3r0fS4.jpg)

__This task__ walks you through all steps required to build an attentive image-to-captioning system. Except this time, there's no `<YOUR CODE HERE>`'s. You write all the code.

You are free to approach this task in any way you want. Follow our step-by-step guide or abandon it altogether. Use the notebook or add extra .py files (remember to add them to your anytask submission). The only limitation is that your code should be readable and runnable top-to-bottom.



### Step 1: image preprocessing (5 pts)

First, you need to prepare images for captioning. Just like in the basic notebook, you are going to use a pre-trained image classifier from the model zoo. Let's go to the [`preprocess_data.ipynb`](./preprocess_data) notebook and change a few things there. This stage is mostly running the existing code with minor modiffications.

1. Download the data someplace where you have enough space. You will need around 100Gb for the whole thing.
2. Pre-compute and save Inception activations at the layer directly __before the average pooling__.
 - the correct shape should be `[batch_size, 2048, 8, 8]`. Your LSTM will attend to that 8x8 grid.


__Note 1:__ Inception is great, but not the best model in the field. If you have enough courage, consider using ResNet or DenseNet from the same model zoo. Just remember that different models may require different image preprocessing.

__Note 2:__ Running this model on CPU may take days. You can speed things up by processing data in parts using colab + google drive. Here's how you do that: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
<...>

### Step 2: sub-word tokenization (5 pts)

While it is not strictly necessary for image captioning, you can generally improve generative text models by using sub-word units. There are several sub-word tokenizers available in the open-source (BPE, Wordpiece, etc).

* __[recommended]__ BPE implementation you can use: [github_repo](https://github.com/rsennrich/subword-nmt). 
* Theory on how it works: https://arxiv.org/abs/1508.07909
* We recommend starting with __4000 bpe rules__.
* The result@@ ing lines will contain splits for rare and mis@@ typed words like this: ser@@ endi@@ pity


In [None]:
<...>

### Step 3: define attentive decoder (5 pts)

Your model works similarly to the normal image captioning decoder, except that it has an additional mechanism for peeping into image at each step. We recommend implementing this mechanism as a separate Attention layer, inheriting from `nn.Module`. Here's what it should do:

![img](https://camo.githubusercontent.com/1f5d1b5def5ab2933b3746c9ef51f4622ce78b86/68747470733a2f2f692e696d6775722e636f6d2f36664b486c48622e706e67)


__Input:__ 8x8=64 image encoder vectors $ h^e_0, h^e_1, h^e_2, ..., h^e_64$ and a single decoder LSTM hidden state $h^d$.

* Compute logits with a 2-layer neural network with tanh activation (or anything similar)

$$a_t = linear_{out}(tanh(linear_{e}(h^e_t) + linear_{d}(h_d)))$$

* Get probabilities from logits, 

$$ p_t = {{e ^ {a_t}} \over { \sum_\tau e^{a_\tau} }} $$

* Add up encoder states with probabilities to get __attention response__
$$ attn = \sum_t p_t \cdot h^e_t $$

You can now feed this $attn$ to the decoder LSTM in concatenation with previous token embeddings.

__Note 1:__ If you need more information on how attention works, here's [a class in attentive seq2seq](https://github.com/yandexdataschool/nlp_course/tree/master/week04_seq2seq) from the NLP course.

__Note 2:__ There's always a choice whether you initialize LSTM state with some image features or zeros. We recommend using zeros: it is a good way to debug whether your attention is working and it usually produces better-looking attention maps

In [None]:
<...>

### Step 4: training

Up to 10 pts based on the model performance. 
The training procedure for your model is no different from the original non-attentive captioning from the base track: iterate minibatches, compute loss, backprop, use the optimizer.

Feel free to use the [`basic track notebook`](./homework04_basic_part2_image_captioning) for "inspiration" :)


In [None]:
<...>

### Final step: show us what it's capable of! (5 pts)

The task is exactly the same as in the base track _(with the exception that you don't have to deal with salary prediction :) )_


__Task: Find at least 10 images to test it on.__

* Seriously, that's a part of the assignment. Go get at least 10 pictures for captioning
* Make sure it works okay on __simple__ images before going to something more complex
* Your pictures must feature both successful and failed captioning. Get creative :)
* Use photos, not animation/3d/drawings, unless you want to re-train CNN network on anime
* Mind the aspect ratio.

In [None]:
# apply your network on images you've found
#
#


### What else to try

If you're made it this far you're awesome and you should know it already. All the tasks below are completely optional and may take a lot of your time. Proceed at your own risk

#### Hard attention

* There are more ways to implement attention than simple softmax averaging. Here's [a lecture](https://www.youtube.com/watch?v=_XRBlhzb31U) on that. 
* We recommend you to start with [gumbel-softmax](https://blog.evjang.com/2016/11/tutorial-categorical-variational.html) or [sparsemax](https://arxiv.org/abs/1602.02068) attention.

#### Reinforcement learning

* After your model has been pre-trained in a teacher forced way, you can tune for captioning-speific models like CIDEr.
* Tutorial on RL for sequence models: [practical_rl week7](https://github.com/yandexdataschool/Practical_RL/tree/spring19/week7_seq2seq)
* Theory: https://arxiv.org/abs/1612.00563

#### Chilling out

This is the final and the most advanced task in the DL course. And if you're doing this with the on-campus YSDA students, it should be late spring by now. There's got to be a better way to spend a few days than coding another deep learning model. If you have no idea what to do, ask Yandex. Or your significant other.

![img](https://imgs.xkcd.com/comics/computers_vs_humans.png)