# Project: Image Captioning


---

## Problem Statement: 
Annotate an image with a short description explaining the contents in that image.

**Dataset used:** 
The Microsoft **C**ommon **O**bjects in **CO**ntext (MS COCO) dataset is a large-scale dataset for scene understanding.  The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.  

![Sample Dog Output](images/coco-examples.jpg)

You can read more about the dataset on the [website](http://cocodataset.org/#home) or in the [research paper](https://arxiv.org/pdf/1405.0312.pdf).


## Project Planning

### The project is divided into the following tasks and, each task is carried out in it's respective Jupyter Notebook.
 1. Dataset Exploration: In this notebook, the COCO dataset is explored, in preparation for the project.
 2. Preprocessing: In this notebook, COCO dataset is loaded and pre-processed, making it ready to pass to the model for training.
 3. **Training: In this notebook, the CNN-RNN deep architecture model is trained.**
 4. Inference: In this notebook, the trained model is used to generate captions for images in the test dataset. Here, the performance of the model is observed on real world images.

## Task Planning:

- [Step 1](#step1): Training Setup
- [Step 2](#step2): Training the Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, we will customize the training of the CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.




### Step 1.1: Necessary Information

Let's begin by understanding and setting the following variables that are used during the training process:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  
- `save_every` - determines how often to save the model weights.  
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.



### Chosen CNN-RNN Architecture (ResNet50 - LSTM)
 
- The Encoder consists of CNN network. In our model, we are using pre-trained Resnet50 as our encoder, utilizing all of it's layers except the last output layer as seen from `modules = list(resnet.children())[:-1]` in models.py file.
- The Decoder consists of Embedding layer and RNN network. We are using LSTM as RNNs which train on a combined input of image features and caption pairs coming from Embedding layer. LSTM is followed by a Linear layer.


### Optimizer

As this is a Classification task(predicting the next word), I will go for CrossEntropyLoss with Adam Optimizer. Adam optimizer is best for classification tasks over SGD because it has adaptive learning rate and momentum as a part of it's algorithm. Also, with previous experience with Project 1, observed Adam Optimizer was giving lesser loss over SGD. Hence, choosing Adam optimizer with learning rate as 0.01

### Imports and Constants Declaration

In [1]:
import nltk
nltk.download('punkt')
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math
from workspace_utils import active_session
import torch.utils.data as data
import numpy as np
import os
import requests
import time

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [1]:
## Selecting appropriate values for the variables below.
batch_size = 128          # batch size
vocab_threshold = 7        # minimum word count threshold
vocab_from_file = True     # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# Amending the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Building data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initializing the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Moving models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Defining the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specifing the learnable parameters of the model.
params = list(encoder.embed.parameters()) + list(decoder.parameters())

# Defining the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)

# Setting the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=0.93s)
creating index...


  0%|          | 748/414113 [00:00<01:56, 3547.97it/s]

index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:38<00:00, 4207.73it/s]


In [2]:
vocab_size

7525

<a id='step2'></a>
## Step 2: Train your Model

In [3]:
# Open the training log file.
f = open(log_file, 'w')

with active_session():
    
    for epoch in range(1, num_epochs+1):

        for i_step in range(1, total_step+1):

            # Randomly sample a caption length, and sample indices with that length.
            indices = data_loader.dataset.get_train_indices()
            
            # Create and assign a batch sampler to retrieve a batch with the sampled indices.
            new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
            data_loader.batch_sampler.sampler = new_sampler

            # Obtain the batch.
            images, captions = next(iter(data_loader))

            # Move batch of images and captions to GPU if CUDA is available.
            images = images.to(device)
            captions = captions.to(device)

            # Zero the gradients.
            decoder.zero_grad()
            encoder.zero_grad()

            # Pass the inputs through the CNN-RNN model.
            features = encoder(images)
            outputs = decoder(features, captions)

            # Calculate the batch loss.
            loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))

            # Backward pass.
            loss.backward()

            # Update the parameters in the optimizer.
            optimizer.step()

            # Get training statistics.
            stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))

            # Print training statistics (on same line).
            print('\r' + stats, end="")
            sys.stdout.flush()

            # Print training statistics to file.
            f.write(stats + '\n')
            f.flush()

            # Print training statistics (on different line).
            if i_step % print_every == 0:
                print('\r' + stats)

        # Save the weights.
        if epoch % save_every == 0:
            torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
            torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/3236], Loss: 3.6461, Perplexity: 38.3236
Epoch [1/3], Step [200/3236], Loss: 3.3239, Perplexity: 27.7690
Epoch [1/3], Step [300/3236], Loss: 3.2897, Perplexity: 26.8359
Epoch [1/3], Step [400/3236], Loss: 3.3381, Perplexity: 28.1667
Epoch [1/3], Step [500/3236], Loss: 3.1367, Perplexity: 23.0279
Epoch [1/3], Step [600/3236], Loss: 4.3863, Perplexity: 80.3440
Epoch [1/3], Step [700/3236], Loss: 2.8506, Perplexity: 17.2989
Epoch [1/3], Step [800/3236], Loss: 3.2354, Perplexity: 25.4160
Epoch [1/3], Step [900/3236], Loss: 3.0097, Perplexity: 20.2810
Epoch [1/3], Step [1000/3236], Loss: 2.5486, Perplexity: 12.7894
Epoch [1/3], Step [1100/3236], Loss: 2.6488, Perplexity: 14.1376
Epoch [1/3], Step [1200/3236], Loss: 2.6172, Perplexity: 13.6968
Epoch [1/3], Step [1300/3236], Loss: 3.2045, Perplexity: 24.6442
Epoch [1/3], Step [1400/3236], Loss: 2.3751, Perplexity: 10.7524
Epoch [1/3], Step [1500/3236], Loss: 2.3724, Perplexity: 10.7232
Epoch [1/3], Step [1600/3236], Los