# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, I demonstrate the training of my CNN-RNN model.  

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you I customize the training of my CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.


### A Description of the Training Variables (initialized in the next cell):

- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

### Question Prompts from the Udacity Instructor:

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**My Answer:** 
As a model for my CNN I used the architecture of the Resnet50 network.  The  architecture of the Resnet model can be found <a href="https://cv-tricks.com/keras/understand-implement-resnets/">here</a>.

For my RNN decoder I created a model that has: an embedding layer, 2 layers of an LSTM with 512 dimensions, and a fully connected layer that creates a tensor of output scores for my vocabulary.  

For my mini-batch size I chose the value 256 because the video lecture in the course suggested a larger batch size could help reduce the error of a relatively large learning rate.  While training my model the first time I noticed my error rate increasing.  I stopped the training and decreased the learning rate.  However, because I did not want to spend too much of my GPU time training I did not reduce the learning rate by much, which would have increased my computation time.  I chose to compromise by lowering the learning rate a bit but increasing my batch size from 32 to 256.

For my vocab_threshold value I chose 5 because that was the value used by the researchers in the research paper "Show and Tell: a Neural Image Caption Generator."  

For the embed_size, I chose 256 because that was the value suggested by the lectures in the course.  The course lecture also suggested that the size of the LSTM's hidden layer should be larger than the size of the embedding layer.  The "Show and Tell" paper suggested a hidden layer size of 512, which I also used.

I chose to train my model for 3 epochs because after trial and error it became apparent to me that 3 epochs was all the time I could judiciously alot for the project.  Since the error rate fell continuously throughout I felt no need to prematurely end the traning before the end of its third epoch.

I set the File Load value to True because I had already run the cell that created the vocabulary in a previous notebook.  

Finally, I set the print parameter to 250 because from trial-and-error I did not find it useful to have a message print more often than that.


### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**My Answer:** 
I left the parameters of the transform Udacity selected as they were because the settings seemed appropriate for the architecture I chose for my model.  For example, the image size of 224 x 224 matched the requirements of the Resnet CNN and the 0.5 value probability of randomly flipping each image made sense to me for the purposes of data augmentation.  Lastly, I kept the mean and standard deviation values as they were for the Transform's Normalize method because I did not have a reason to imagine I could come up with anything more appropriate.  

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**My Answer:** 
I chose to train all the weights in my decoder and only the fully connected layer of my CNN encoder.  I chose to leave the other weights in the CNN as they were because the model had already been trained on a robust dataset.  By training only the weights on the fully connected layer I could fine-tune the Resnet CNN model to the specific images of my dataset.  In my opinion, this choice would strike a balance between accuracy and overfitting.  

I chose to train all the weights of my decoder because I was not beginning with a pre-trained model.  

### Question 4

**Question:** How did you select the optimizer used to train your model?

**My Answer:** 
I chose to use the Adam Optimizer because in the course lecture on hyperparameters the instructor suggested the Adam Optimizer implemented a learning rate decay through adaptive learning.  Because I was not certain about the balance I struck between my batch size value and my learning rate value in my hyperparameters, I felt comfortable relying a bit on the Adam Optimizer's learning rate adjustment to help guide my model toward a reasonable design.

In [3]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## TODO #1: Select appropriate values for the Python variables below.
batch_size = 256          # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 250          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO #3: Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# TODO #4: Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.0001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=0.91s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:36<00:00, 4300.88it/s]


## What's Happening in the Cell Above?

I can best explain the actions performed in the cell above by breaking its behavior down into 8 steps.

<u>Step 1</u>: I import the libraries required by the functions called in the cell.  The Data_Loader and Model libraries are classes I created the files for which I have included in this repository.  

<u>Step 2</u>: I set the values for the variables Udacity defined in the training setup.  I determined values for batch_size, vocab_threshold, embed_size, and hidden_size based on trial-and-error informed by the literature provided by Udacity in its lecture modules.

<u>Step 3</u>: I defined a Training Transform for the training dataset using the Transform class from the PyTorch library.  

<u>Step 4</u>: I set the parameters for the Data Loader object and call its constructor "get_loader" which I define in the Data_Loader python file included in the repository.  In addition to the variables defined in Step 2, the "get_loader" function takes as a parameter the Training Transform defined in Step 3.  The Training Transform serves as the final state of the images and captions in the training data set before the Data Loader object passes them to the Model for training. 

<u>Step 5</u>: I capture the size of the vocabulary represented by the dataset's captions and store it in a object called "vocab_size".  The "vocab_size" object accesses the "vocab" field built into the CoCoDataset.  Coco, in turn, calls the Vocabulary constructor, which I have included in the <strong>vocabulary.py</strong> file in the repository for this project. Among other parameters, the Vocabulary constructor uses the input I defined as the integer 5 in Step 2 as <em>"vocab_threshold"</em> to create a vocabulary from all words that appear in the dataset's caption set more than 5 times.

<u>Step 6</u>: I initialize and store the Encoder and Decoder neural networks from the <strong>model.py</strong> file included in the repository.

<u>Step 7</u>: I prepare the Encoder and Decoder neural networks to train on the GPU.

<u>Step 8</u>: I define the Loss and Optimizer functions to tune my networks' accuracy, and define the learnable (updatable) parameters of my neural networks as all the parameters of the RNN Decoder and only the final embed layer of the Encoder CNN.

<a id='step2'></a>
## Step 2: Training My Model


### A Note from Udacity on Training Time:

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence to see how your model performs on the test data.

In [9]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

# --START-- Udacity-specific code to maintain GPU-connection during training

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        

        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
            
# --END-- Udacity-specific code to maintain GPU-connection
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [250/1618], Loss: 4.7381, Perplexity: 114.2201
Epoch [1/3], Step [500/1618], Loss: 4.3030, Perplexity: 73.92104
Epoch [1/3], Step [750/1618], Loss: 4.0651, Perplexity: 58.27357
Epoch [1/3], Step [1000/1618], Loss: 3.8360, Perplexity: 46.3381
Epoch [1/3], Step [1250/1618], Loss: 3.7303, Perplexity: 41.69293
Epoch [1/3], Step [1500/1618], Loss: 3.9844, Perplexity: 53.75150
Epoch [2/3], Step [250/1618], Loss: 3.6035, Perplexity: 36.72717
Epoch [2/3], Step [500/1618], Loss: 3.3509, Perplexity: 28.5270
Epoch [2/3], Step [750/1618], Loss: 3.7475, Perplexity: 42.4135
Epoch [2/3], Step [1000/1618], Loss: 3.1696, Perplexity: 23.7969
Epoch [2/3], Step [1250/1618], Loss: 3.0535, Perplexity: 21.1900
Epoch [2/3], Step [1500/1618], Loss: 3.7397, Perplexity: 42.0852
Epoch [3/3], Step [250/1618], Loss: 3.1190, Perplexity: 22.62334
Epoch [3/3], Step [500/1618], Loss: 2.8316, Perplexity: 16.9732
Epoch [3/3], Step [750/1618], Loss: 3.2204, Perplexity: 25.03803
Epoch [3/3], Step [1000/16

## Results:
After only 3 training Epochs, the error rate of my model, calculated as Loss, has decreased by more than 45%.  While a Loss value of 2.5 is significantly too high for an adequate machine learning model, the evidence that my model in fact learns is enough to fulfill the criteria of the assignment.  In order to compare the results of my model with others publicly available would require me to train the model for several hours on a GPU, the resources for which I currently do not have.  

In the <strong>next notebook</strong> I will test the accuracy of my CNN-RNN model by tasking it with creating its own captions for unknown images.