# Image Captioning
---

This notebook covers the training of the CNN-RNN model.  

Contents:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Model training

<a id='step1'></a>
## Step 1: Training Setup

Customizing the training of the CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.

The following papers are used to set up the most of parameters [1](https://arxiv.org/pdf/1502.03044.pdf) and [2](https://arxiv.org/pdf/1411.4555.pdf).


Variables to setup:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from a file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model. [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but reasonable results can be seen in a matter of a few hours!
- `save_every` - determines how often to save the model weights. After the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

**Model architecture** 

The model's architecture is a combination of CNN and RNN. CNN serves as encoder and RNN as a decoder. To keep things easy and at the same time utilize CNN performance a pre-trained ResNet-50 model is used with a modified last module where a fully connected layer with a size of 256 neurons is used. CNN neurons' weights are fixed apart from the last linear (fully connected one).
The RNN starts with an embedding layer to transform the inputs (batch of words) into a fixed dimension,  particularly, each word is represented as a 256 length vector. It is the same as the encoder's output. Then, a concatenation of the encoder output and the embedding layer's input goes into LSTM cells with a hidden size of 512. The decoder's goal is to find the next word for the image description. Thus, the last layer is linear with the number of neurons equal to the vocabulary size. 
Different thresholds are used to set up a proper vocabulary size and to get rid of very rare words. Vocabulary size decreases as the threshold increases. For instance, with threshold 6 there are 8099 words in the vocabulary, 5 - 8855, 4 - 9955. It is decided to use 6 to keep the most useful and frequent words in the vocabulary. Besides, it makes the algorithm to run faster.

**Image transformation:**

Image transformation is an important part of a neural network pipeline. Some of them are necessary because images have different sizes that is why Resize(256), RandomCrop(224) are used. Because ResNet-50 uses normalized images, the normalization should be applied on images: Normalize((0.485, 0.456, 0.406). And finally, to make better generalization data augmentation is used: RandomRotation(10), ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1), RandomHorizontalFlip(). And without any words, ToTensor() must be used to satisfy PyTorch's data representation.

**Trainable parameters**

The Resnet-50 is used for transfer learning to speed up the learning time and not to lose in performance. That is why all Encoder's layers are frozen apart from the last fully connected one which is supposed to learn the combination of features necessary for the decoder to capture the patterns. Besides, all Decoders' layers are taught parameters as well. 

**Optimizer**

Adam [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer) with standard parameters is used as it is recomended in [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636)

In [9]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math
import torch.optim

batch_size = 64            # batch size
vocab_threshold = 6        # minimum word count threshold
vocab_from_file = True     # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomRotation(10),
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# The learnable parameters of the model.
params = list(decoder.parameters())

# The optimizer.
optimizer = optim.Adam(params=params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

loading annotations into memory...
Done (t=0.89s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...


  0%|          | 929/414113 [00:00<01:28, 4645.76it/s]

Done (t=0.85s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:32<00:00, 4497.12it/s]


<a id='step2'></a>
## Step 2: Model training


### A Note on Tuning Hyperparameters

To figure out how well the model is doing, please have a look at how the training loss and perplexity evolve during training.

However, this will not tell you if the model is overfitting on the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636) provided several approaches to minimizing overfitting.

In [10]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/6471], Loss: 3.8385, Perplexity: 46.4574
Epoch [1/3], Step [200/6471], Loss: 3.4616, Perplexity: 31.8674
Epoch [1/3], Step [300/6471], Loss: 3.3325, Perplexity: 28.0083
Epoch [1/3], Step [400/6471], Loss: 3.1990, Perplexity: 24.5085
Epoch [1/3], Step [500/6471], Loss: 3.1257, Perplexity: 22.7759
Epoch [1/3], Step [600/6471], Loss: 3.0853, Perplexity: 21.8734
Epoch [1/3], Step [700/6471], Loss: 2.9654, Perplexity: 19.4016
Epoch [1/3], Step [800/6471], Loss: 5.1532, Perplexity: 172.9919
Epoch [1/3], Step [900/6471], Loss: 2.9446, Perplexity: 19.0023
Epoch [1/3], Step [1000/6471], Loss: 2.7136, Perplexity: 15.0836
Epoch [1/3], Step [1100/6471], Loss: 2.9283, Perplexity: 18.6965
Epoch [1/3], Step [1200/6471], Loss: 3.0839, Perplexity: 21.8423
Epoch [1/3], Step [1300/6471], Loss: 2.7078, Perplexity: 14.9965
Epoch [1/3], Step [1400/6471], Loss: 2.9434, Perplexity: 18.9802
Epoch [1/3], Step [1500/6471], Loss: 2.9674, Perplexity: 19.4419
Epoch [1/3], Step [1600/6471], Lo