In this notebook, you would need to use **Python 3.6+** along with the following packages (**need to update**):
```
1. pytorch 1.2
2. torchvision
3. numpy
4. matplotlib
5. nltk
```
To install pytorch, please follow the instructions on the [Official website](https://pytorch.org/). In addition, the [official document](https://pytorch.org/docs/stable/) could be very helpful when you want to find certain functionalities. 


# Image Captioning Using Encoder-Decoder Architecture

Simply, the encoder will take the image as input and encode it into a vector of feature values. The decoder will take this output from encoder as hidden state and starts to predict next words at each step. The following figure illustrates this:

<img src="figs/image_captioning_overview.jpg" width="600">
Figure 1. An overview of the encoder-decoder architecture
(image credit: <a href="https://link.springer.com/chapter/10.1007/978-3-030-04780-1_23">Deep Neural Network Based Image Captioning</a>)

You will use a pre-trained CNN as the encoder and Vanilla RNN/LSTM as decoder to predict the captions.

## How to download the data (Google Colab)
Step 1: Register a Kaggle account.  https://www.kaggle.com/

Step 2: Download your kaggle.json file from  https://www.kaggle.com/Your_Username/account. In API section, click Create New API Token.

Step 3: As we did before, upload all files on Google Drive and open Google Colab.

Step 4: Install required packages.
    
    ! pip install -q kaggle nltk

Step 5: Insert a cell.
    
    
    from google.colab import files
    files.upload()
    
    
    Upload `kaggle.json` you just downloaded.
    
Step 6: Move `kaggle.json` to the right place,
    
    
     ! mkdir ~/.kaggle
     ! cp kaggle.json ~/.kaggle/
    

Step 7: Change the permission.
    
    ! chmod 600 ~/.kaggle/kaggle.json

Step 8: Download.
    
    !kaggle datasets download hsankesara/flickr-image-dataset

Step 9: Move it to your drive and unzip it.
    
    unzip flickr-image-dataset.zip -x "flickr30k_images/flickr30k_images/flickr30k_images/*.jpg" -d "/path-to-Assignment_4/Assignment_4/"
    
Step 10: Move "dataset_flickr30k.json" to "flickr30k_images" folder.

### Colab Setup: 
- Below are some basic steps for colab setup. 
- Make changes based on requirements.
- Comment out in case of ARC or your local device with powerful GPU.

**Note: For Google Colab give proper paths in this notebook and in dataloader.py if required.**

In [None]:
#! pip install -q kaggle nltk

In [None]:
#from google.colab import files
#files.upload()

In [None]:
 #! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/

In [None]:
#! chmod 600 ~/.kaggle/kaggle.json

In [None]:
#!kaggle datasets download hsankesara/flickr-image-dataset

In [None]:
#!unzip flickr-image-dataset.zip -x "flickr30k_images/flickr30k_images/flickr30k_images/*.jpg" -d "/content/drive/My Drive/Assignment_4/"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
# modify "customized_path_to_homework", path of folder in drive, where you uploaded your homework
path_to_homework = "/content/drive/My Drive/Assignment_4/"
sys.path.append(path_to_homework)

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [None]:
# import necessary packages and modules
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from torch.nn.utils.rnn import pack_padded_sequence
from torch.autograd import Variable
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
from dataloader import Flickr30k, get_loader

# Section 1.2 Take a look at the data

In [None]:
# visualize images and captions
flickr = Flickr30k(split='val', root=path_to_homework+'flickr30k_images/')  # load validation set as an example
flickr()

-------flickr30k--------
image root: /content/drive/My Drive/Assignment_4/flickr30k_images/flickr30k_images
dataset split: val
the length of the dataset: 1014


# Section 1.3 Build vocabulary
We need to build a vocabulary for our dataset. The vocabulary stores all the words and their indices. We will use it to embed and recover the words.

In [None]:
import nltk
import pickle
import json
from tqdm import tqdm
from collections import Counter
nltk.download('punkt') # You can comment this line once you've downloaded 'punkt'

class Vocabulary(object):
    """Simple vocabulary wrapper."""
    def __init__(self):
        self.word2idx = {'<pad>': 0, '<unk>': 1, '<start>': 2, '<end>': 3}  # follow Pytorch padding rules: pad sentence with zero.
        self.idx = 4
        self.idx2word = {v: k for k, v in self.word2idx.items()}

    def __call__(self, key):
        if key not in self.word2idx:
            return self.word2idx['<unk>']
        return self.word2idx[key]

    def __len__(self):
        return len(self.word2idx)

    def add_word(self, word):
        """
        Add new words
        :param word: word
        """
        if word not in self.word2idx:
            self.word2idx[word] = self.idx  # add a new word
            self.idx2word[self.idx] = word
            self.idx += 1

    def reverse(self, value):
        """
        From idx to words.
        :param value: index
        :return:
        """
        if value not in self.idx2word:
            return self.idx2word[1]  # return '<unk>' if the word is unseen before.
        return self.idx2word[value]

def build_vocab(json_file=path_to_homework+ '/flickr30k_images/dataset_flickr30k.json', threshold=3):
    with open(json_file) as f:
            data = json.load(f)
    f.close()
    counter = Counter()
    for img_idx in tqdm(range(len(data['images']))):
        img_annos = data['images'][img_idx]
        for sent_idx in range(len(img_annos['sentids'])):
#             tokens = img_annos['sentences'][sent_idx]['tokens']  # directly load tokens

            caption = img_annos['sentences'][sent_idx]['raw']
            tokens = nltk.tokenize.word_tokenize(caption.lower())
            
            counter.update(tokens)

    # If the number of words is less than threshold we don't count it.
    words = [word for word, cnt in counter.items() if cnt >= threshold]


    # create a Vocabulary class
    vocab = Vocabulary()

    # add words to Vocab
    for i, word in enumerate(words):
        vocab.add_word(word)

    return vocab

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# let's create a vocabulary for future usage
vocab_path = path_to_homework + '/flickr30k_images/vocab.pkl'
if not os.path.isfile(vocab_path):  # if we don't have vocab, create one
    vocab = build_vocab(json_file=path_to_homework + '/flickr30k_images/dataset_flickr30k.json', threshold=3)
    with open(vocab_path, 'wb') as f:
        pickle.dump(vocab, f)
    print("Total vocabulary size: {}".format(len(vocab)))
    print("Saved the vocabulary wrapper to '{}'".format(vocab_path))
else:  # if we have, load the existing vocab
    with open(vocab_path, 'rb') as f:
        vocab = pickle.load(f)
    print('vocab loaded!')
    print('the size of vocab:', len(vocab))
f.close()

vocab loaded!
the size of vocab: 9991


In [None]:
vocab_path = path_to_homework + '/flickr30k_images/vocab.pkl'
with open(vocab_path, 'rb') as f:
    vocab = pickle.load(f)
print('vocab loaded!')
print('the size of vocab:', len(vocab))
# print(vocab.word2idx.keys())
# print(vocab.idx2word)

# check some random words
for i in range(3):
    random_idx = np.random.randint(len(vocab))
    print('word: {}, index: {}'.format(list(vocab.word2idx.keys())[random_idx], vocab(list(vocab.word2idx.keys())[random_idx])))

vocab loaded!
the size of vocab: 9991
word: game, index: 646
word: gasoline, index: 9105
word: command, index: 9753


# Section 2 Vanilla RNN [45 pts]
# Section 2.1 Design the Network: Encoder [5 pts]
Implement the baseline model by using pre-trained ResNet-50 as the encoder and Vanilla RNN as the decoder. Note that we will remove the last layer (fc layer) of ResNet-50 and add a trainable linear layer to finetune it for our task. During the training, we will **freeze** the layer before the fc layer. The encoder should output a feature vector of a fixed size for each image.

In [None]:
class Encoder(nn.Module):
    def __init__(self, emb_dim):
        """
        Use ResNet-50 as encoder.
        :param emb_dim: output size of ResNet-50.
        """
        super(Encoder, self).__init__()
        self.resnet = torchvision.models.resnet50(pretrained=True)
        ###########Your code###############
        # freeze the parameters
        for param in self.resnet.parameters():
            param.requires_grad_(False)
        
        # replace the last layer (fc layer) with a trainable layer for finetuning
        #modules = list(self.resnet.children())[:-1]
        #self.resnet = nn.Sequential(*modules)
        #self.embed = nn.Linear(resnet.fc.in_features, embed_size)
        self.resnet.fc = nn.Linear(self.resnet.fc.in_features, emb_dim)

        

    def forward(self, x):
        
        x = self.resnet(x)  # output shape: [N, emb_dim]

        return x

# Section 2.2 Design the Network: Decoder [10 pts]
During decoding, we will train a RNN (https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN) to learn the structure of the caption text throught "**Teacher Forcing**". Teacher forcing works by using the teaching signal from the training dataset at the current time step, $target(t)$, as input in the next time step $x(t+1) = target(t)$, rather than the output $y(t)$ generated by the network. 

As shown in Figure 1 above, RNN will take three inputs: the *current feature*, hidden state ($h_0$) and cell state ($c_0$). The *current feature* for the first step should be the output of encoder to predict '\<start\>' word. Hidden states for this step should be set to None. Then in the second step '\<start\>' will be passed into RNN as the input, and so on.

To use '\<start\>' or any subsequent word as current feature, get its index from the vocabulary you created, convert it to one-hot vector and pass it through a linear layer to embed into a feature (or you can take advantage of Pytorch’s nn.Embedding which does one-hot encoding + linear layer for you).

For convenience, you might want to 'pad' the captions in a mini-batch to convert them into fixed length. You can use 'pack_padded_sequence' function.

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0):
        """
        Use RNN as decoder for captions.
        :param emb_dim: Embedding dimensions.
        :param hidden_dim: Hidden states dimensions.
        :param num_layers: Number of RNN layers.
        :param vocab_size: The size of Vocabulary.
        :param dropout: the probability for dropout.
        """
        super(Decoder, self).__init__()
        self.max_length = 30  # the maximum length of a sentence, in case it's trapped
        
        #############Your code############
        # you need to implement a Vanilla RNN for the decoder. Take a look at the official documentation.
        # https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN
        
        # one-hot encoding + linear layer
        self.embedding_layer = nn.Embedding(vocab_size, emb_dim)
        
        # vanilla rnn network
        self.rnn = nn.RNN(input_size = emb_dim,hidden_size = hidden_dim,
                            num_layers = num_layers, batch_first = True)
        
        
        # output layer
        self.linear = nn.Linear(hidden_dim, vocab_size)
        

    def forward(self, encode_features, captions, lengths):
        """
        Feed forward to generate captions. Note that you need to pad the input so they have the same length
        :param encode_features: output of encoder, size [N, emb_dim]
        :param captions: captions, size [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. size is (batch_size).
        """
        #############Your Code###################
        # compute the embedding using one-hot technique and linear function
        embed = self.embedding_layer(captions)
        # concatenate the encoded features from encoder and embeddings
        embed = torch.cat((encode_features.unsqueeze(1), embed), dim = 1)
        packed_input = pack_padded_sequence(embed, lengths, batch_first=True)
                
        # feed into RNN.
        hiddens, _ = self.rnn(packed_input )

        
        # output layer
        outputs = self.linear(hiddens[0])

        return outputs

# Encoder-decoder [10 pts]
Now we need to put our encoder and decoder together. 

In the sample_generate stage, the idea is to “let the network run on its own”, predicting the next word, and then use the network’s prediction to obtain the next input word. There are at least two ways to obtain the next word.

- **Deterministic**: Take the maximum output at each step.
- **Stochastic**: Sample from the probability distribution. To get the distribution, we need to compute the weighted softmax of the outputs: $y^i = \exp(o^j/\tau) / \sum_n \exp(o^n/\tau)$, where $o^j$ is the output from the last layer, $n$ is the size of the vocabulary, and $\tau$ is the so-called "temperature". By doing this, you should get a different caption each time.

In [None]:
class Vanilla_rnn(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0):
        """
        Encoder-decoder vanilla RNN.
        :param vocab_size: the size of Vocabulary.
        :param emb_dim: the dimensions of word embedding.
        :param hidden_dim: the dimensions of hidden units.
        :param num_layers: the number of RNN layers.
        :param dropout: dropout probability
        """
        super(Vanilla_rnn, self).__init__()
        #########Your Code################
        # Encoder: ResNet-50
        self.Encoder= Encoder(emb_dim)

        # Decoder: RNN
        self.Decoder = Decoder(vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0)
        self.max_length = self.Decoder.max_length

        

    def forward(self, x, captions, lengths):
        """
        Feed forward.
        :param x: Images, [N, 3, H, W]
        :param captions: encoded captions, [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. length is (batch_size).
        :return: output logits, usually followed by a softmax layer.
        """
        ##########Your code###################
        # forward passing
        Encoder= self.Encoder(x)
        x = self.Decoder(Encoder,captions, lengths)

        return x

    def sample_generate(self, x, states=None, mode='Deterministic', temperature=5.0):
        """
        Generate samples during the evaluation.
        
        :param x: input image
        :param states: rnn states
        :param mode: which mode we use.  
         - 'Deterministic': Take the maximum output at each step.
         - 'Stochastic': Sample from the probability distribution from the output layer.
        :param temperature: will be used in the stochastic mode
        :return: sample_idxs. Word indices. We can use vocab to recover the sentence later.
        """
        sample_idxs = []  # record the index of your generated words
        #################Your Code##################
        # compute the encoded features
        features = self.Encoder(x)
        inputs = features.unsqueeze(1)
        # decide which mode we use
        if mode == 'Deterministic':
          for i in range(self.max_length):
              hiddens, states = self.Decoder.rnn(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 
           # take the maximum index after the softmax
              _, predicted = outputs.max(1)                        # predicted: (batch_size)
              sample_idxs.append(predicted)
              inputs= self.Decoder.embedding_layer(predicted)
              inputs = inputs.unsqueeze(1)
          sample_idxs = torch.stack(sample_idxs, dim=1)
            
        elif mode == 'Stochastic':
            for i in range(self.max_length):
              hiddens, states = self.Decoder.rnn(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 

            # sample from the probability distribution after the softmax
            # Hint: use torch.multinomial() to sample from a distribution.
              #probabilities = F.softmax(outputs.div(temperature).squeeze(0).squeeze(0), dim=1)
              probabilities = F.softmax(outputs.div(temperature), dim=1)
              predicted = torch.multinomial(probabilities, 1) 

              sample_idxs.append(predicted[:, 0])
              inputs = self.Decoder.embedding_layer(predicted[:,0])                       # inputs: (batch_size, embed_size)
              inputs = inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
            sample_idxs = torch.stack(sample_idxs, 1)                # sampled_ids: (batch_size, max_seq_length)
            
        return sample_idxs

# Section 2.3 Training [10 pts]
Train your encoder-decoder. You might also want to check the output sentence every epoch.

In [None]:
# some hyperparameters, you can change them
## training parameters
batch_size = 256
lr = 1e-2
num_epochs = 50
weight_decay = 0.0
log_step = 50

## network architecture
emb_dim = 1024
hidden_dim = 256
num_layers = 1 # number of RNN layers
dropout = 0.0

## image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Output directory
output_dir = path_to_homework + '/checkpoints/rnn/'
os.makedirs(output_dir, exist_ok=True)

## device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Validation code here. We are gonna use this during the training. 
def val(model, data_loader, vocab):
    """
    Inputs:
    :param model: the encoder-decoder network.
    :param data_loader: validation data loader
    :param vocab: pre-built vocabulary
    Output:
    the mean value of validation losses
    """
    print('Validating...')

    criterion = nn.CrossEntropyLoss()  # CE loss
    
    val_loss = []
    total_step = len(data_loader)
    validatin_loss= 0
    for itr, (images, captions, lengths) in enumerate(data_loader):
        #######Your Code#########
        # forward inputs and compute the validation loss
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        targets = Variable(pack_padded_sequence(captions, lengths, batch_first=True)[0]).to(device)
            
        outputs = model(images, captions, lengths)

        loss = criterion(outputs, targets)
 
     
        # record the validation loss
        val_loss.append(loss.data.detach().cpu().numpy())
        
        # Print current loss
        if itr % log_step == 0:
            print('Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                  .format(itr, total_step, loss.item(), np.exp(loss.item())))
    
    # (optional) you might also want to print out the sentence to see the qualitative performance of your model. 
    # You can use deterministic mode to generate sentences
    

    return np.mean(val_loss)


In [None]:
print(device)

cuda


In [None]:
# Training code here


train_data_loader = get_loader(root=path_to_homework + 'flickr30k_images/', split='train', vocab=vocab, 
                               transform=transform, batch_size=batch_size, shuffle=True, num_workers=8)
val_data_loader = get_loader(root=path_to_homework + 'flickr30k_images/', split='val', vocab=vocab, 
                               transform=transform, batch_size=8, shuffle=True, num_workers=8)

model = Vanilla_rnn(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model

# loss and optimizer
criterion = nn.CrossEntropyLoss()  # CE loss
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)  # optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 
                                      step_size=5,
                                      gamma=0.5)  # decay LR by a factor of 0.5 every 10 epochs. You can change this

# logs
Train_Losses = []  # record average training loss each epoch
Val_Losses = []   # record average validation loss each epoch
total_step = len(train_data_loader)  # number of iterations each epoch
best_val_loss = np.inf

# start training
print('Start training...')
import time
tic = time.time()
for epoch in range(num_epochs):
    print('Switch to training...')
    model.train()
    Train_loss_iter = []  # record the the training loss each iteration
    for itr, (images, captions, lengths) in enumerate(train_data_loader):
        ########Your Code###########
        # train your model
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        targets = Variable(pack_padded_sequence(captions, lengths, batch_first=True)[0]).to(device)

        optimizer.zero_grad()

        outputs = model(images, captions, lengths)
        loss = criterion(outputs, targets)

        loss.backward()  
        optimizer.step()

        # record the training loss
        Train_loss_iter = Train_loss_iter+loss.data.detach().cpu().numpy()

        # print log info
        if itr % log_step == 0:
            # print current loss and perplexity
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                      .format(epoch, num_epochs, itr, total_step, loss.item(), np.exp(loss.item())))
    scheduler.step()
    Train_Losses.append(np.mean(Train_loss_iter))
    np.save(os.path.join(output_dir, 'TrainingLoss_rnn.npy'), Train_Losses)  # save the training loss
    
    model.eval()
    # (optional) generate a sample during the training, you can use deterministic mode
    # Your code
    
    
    # validation
    Val_Losses.append(val(model, val_data_loader, vocab))
    np.save(os.path.join(output_dir, 'ValLoss_rnn.npy'), Val_Losses) # save the val loss
    
    # save model
    if Val_Losses[-1] < best_val_loss:
        best_val_loss = Val_Losses[-1]
        print('updated best val loss:', best_val_loss)
        print('Save model weights to...', output_dir)
        torch.save(model.state_dict(), 
                   os.path.join(output_dir, 'vanilla_rnn-best.pth'.format(epoch + 1, itr + 1)))

print('It took: {} s'.format(time.time() - tic))

Start training...
Switch to training...
Epoch [0/50], Step [0/114], Loss: 9.2886, Perplexity: 10814.3532
Epoch [0/50], Step [50/114], Loss: 3.8253, Perplexity: 45.8447
Epoch [0/50], Step [100/114], Loss: 3.6685, Perplexity: 39.1919
Validating...


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Step [0/127], Loss: 3.6928, Perplexity: 40.1580
Step [50/127], Loss: 3.3617, Perplexity: 28.8396
Step [100/127], Loss: 3.7779, Perplexity: 43.7222
updated best val loss: 3.6196856
Save model weights to... /content/drive/My Drive/Assignment_4//checkpoints/rnn/
Switch to training...
Epoch [1/50], Step [0/114], Loss: 3.5814, Perplexity: 35.9252
Epoch [1/50], Step [50/114], Loss: 3.5934, Perplexity: 36.3577
Epoch [1/50], Step [100/114], Loss: 3.5670, Perplexity: 35.4116
Validating...
Step [0/127], Loss: 3.7276, Perplexity: 41.5784
Step [50/127], Loss: 3.3852, Perplexity: 29.5244
Step [100/127], Loss: 3.7080, Perplexity: 40.7721
updated best val loss: 3.5572746
Save model weights to... /content/drive/My Drive/Assignment_4//checkpoints/rnn/
Switch to training...
Epoch [2/50], Step [0/114], Loss: 3.4455, Perplexity: 31.3584
Epoch [2/50], Step [50/114], Loss: 3.4773, Perplexity: 32.3723
Epoch [2/50], Step [100/114], Loss: 3.5945, Perplexity: 36.3975
Validating...
Step [0/127], Loss: 4.4924, Pe

# Section 2.4 Evaluation [10 pts]

In [None]:
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=1.0):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=1.0):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + 'flickr30k_images/', mode='Deterministic', temperature=1.0,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + 'flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

- Test your outputs in the **Deterministic** way by using BLEU scores. You should at achieve a BLEU 4 of 25.

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.

## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = Vanilla_rnn(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/rnn/vanilla_rnn-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))



  0%|          | 0/125 [00:00<?, ?it/s][A[A

Run on the test set...




  1%|          | 1/125 [00:04<08:19,  4.03s/it][A[A

  2%|▏         | 2/125 [00:07<07:43,  3.77s/it][A[A

  2%|▏         | 3/125 [00:10<07:17,  3.58s/it][A[A

  3%|▎         | 4/125 [00:13<06:59,  3.47s/it][A[A

  4%|▍         | 5/125 [00:16<06:46,  3.39s/it][A[A

  5%|▍         | 6/125 [00:19<06:33,  3.31s/it][A[A

  6%|▌         | 7/125 [00:23<06:35,  3.35s/it][A[A

  6%|▋         | 8/125 [00:26<06:25,  3.30s/it][A[A

  7%|▋         | 9/125 [00:29<06:17,  3.25s/it][A[A

  8%|▊         | 10/125 [00:32<06:10,  3.22s/it][A[A

  9%|▉         | 11/125 [00:35<06:05,  3.20s/it][A[A

 10%|▉         | 12/125 [00:39<05:58,  3.18s/it][A[A

 10%|█         | 13/125 [00:42<05:53,  3.16s/it][A[A

 11%|█         | 14/125 [00:45<05:50,  3.15s/it][A[A

 12%|█▏        | 15/125 [00:48<05:45,  3.14s/it][A[A

 13%|█▎        | 16/125 [00:51<05:42,  3.14s/it][A[A

 14%|█▎        | 17/125 [00:54<05:38,  3.14s/it][A[A

 14%|█▍        | 18/125 [00:57<05:35,  3.14s/it][A[A


Computing BLEU




  4%|▍         | 42/1000 [00:00<00:04, 201.83it/s][A[A

  6%|▌         | 62/1000 [00:00<00:04, 199.73it/s][A[A

  8%|▊         | 83/1000 [00:00<00:04, 200.21it/s][A[A

 10%|█         | 102/1000 [00:00<00:04, 196.51it/s][A[A

 12%|█▏        | 121/1000 [00:00<00:04, 192.87it/s][A[A

 14%|█▍        | 140/1000 [00:00<00:04, 189.82it/s][A[A

 16%|█▌        | 160/1000 [00:00<00:04, 191.46it/s][A[A

 18%|█▊        | 179/1000 [00:00<00:04, 187.38it/s][A[A

 20%|██        | 200/1000 [00:01<00:04, 191.36it/s][A[A

 22%|██▏       | 221/1000 [00:01<00:03, 196.49it/s][A[A

 24%|██▍       | 242/1000 [00:01<00:03, 199.71it/s][A[A

 26%|██▌       | 262/1000 [00:01<00:03, 199.65it/s][A[A

 28%|██▊       | 282/1000 [00:01<00:03, 199.21it/s][A[A

 30%|███       | 302/1000 [00:01<00:03, 198.48it/s][A[A

 32%|███▏      | 322/1000 [00:01<00:03, 197.97it/s][A[A

 34%|███▍      | 343/1000 [00:01<00:03, 198.71it/s][A[A

 36%|███▋      | 363/1000 [00:01<00:03, 195.91it/s][A[A

BLEU 1:89.68767687069148, BLEU 2:63.75724619747129, BLEU 3:41.23849656594904, BLEU 4:27.7544551175118





- Try different temperatures (e.g. 0.1, 0.2, 0.5, 1.0, 1.5, 2, etc.) during the generation. Report BLEU scores for at least 3 different temperatures.

In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## evaluation code for temperature 1.5
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=1.5):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=1.5):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=1.5,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

# End of code

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.

## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = Vanilla_rnn(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/rnn/vanilla_rnn-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:05<12:12,  5.90s/it][A
  2%|▏         | 2/125 [00:06<09:02,  4.41s/it][A
  3%|▎         | 4/125 [00:06<06:15,  3.10s/it][A
  4%|▍         | 5/125 [00:10<06:43,  3.37s/it][A
  6%|▌         | 7/125 [00:11<04:39,  2.37s/it][A
  8%|▊         | 10/125 [00:12<03:25,  1.79s/it][A
  9%|▉         | 11/125 [00:12<02:42,  1.42s/it][A
 10%|█         | 13/125 [00:16<02:54,  1.56s/it][A
 11%|█         | 14/125 [00:18<03:10,  1.72s/it][A
 14%|█▎        | 17/125 [00:22<02:50,  1.58s/it][A
 14%|█▍        | 18/125 [00:24<03:06,  1.74s/it][A
 17%|█▋        | 21/125 [00:27<02:36,  1.51s/it][A
 18%|█▊        | 22/125 [00:30<03:15,  1.90s/it][A
 19%|█▉        | 24/125 [00:30<02:15,  1.35s/it][A
 20%|██        | 25/125 [00:33<03:18,  1.98s/it][A
 21%|██        | 26/125 [00:35<03:15,  1.98s/it][A
 22%|██▏       | 27/125 [00:36<02:21,  1.44s/it][A
 23%|██▎       | 29/125 [00:39<02:21,  1.48s/it][A
 24%|██▍       | 30/125 [00:40<02:28,  1.57s/it][A
 25%|██▍       |

Computing BLEU



  4%|▍         | 44/1000 [00:00<00:04, 214.47it/s][A
  6%|▋         | 63/1000 [00:00<00:04, 205.00it/s][A
  8%|▊         | 85/1000 [00:00<00:04, 208.53it/s][A
 11%|█         | 107/1000 [00:00<00:04, 209.67it/s][A
 13%|█▎        | 127/1000 [00:00<00:04, 206.24it/s][A
 15%|█▍        | 149/1000 [00:00<00:04, 208.62it/s][A
 17%|█▋        | 172/1000 [00:00<00:03, 212.37it/s][A
 20%|█▉        | 195/1000 [00:00<00:03, 214.57it/s][A
 22%|██▏       | 216/1000 [00:01<00:03, 212.11it/s][A
 24%|██▍       | 238/1000 [00:01<00:03, 213.38it/s][A
 26%|██▌       | 262/1000 [00:01<00:03, 218.76it/s][A
 28%|██▊       | 284/1000 [00:01<00:03, 216.04it/s][A
 31%|███       | 306/1000 [00:01<00:03, 214.41it/s][A
 33%|███▎      | 328/1000 [00:01<00:03, 210.92it/s][A
 35%|███▌      | 350/1000 [00:01<00:03, 211.44it/s][A
 37%|███▋      | 373/1000 [00:01<00:02, 215.22it/s][A
 40%|███▉      | 396/1000 [00:01<00:02, 217.68it/s][A
 42%|████▏     | 419/1000 [00:01<00:02, 220.93it/s][A
 44%|████▍  

BLEU 1:89.68767687069148, BLEU 2:63.75724619747129, BLEU 3:41.23849656594904, BLEU 4:27.7544551175118





In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## evaluation code for temperature 0.5
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=0.5):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=0.5):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=0.5,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.

## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = Vanilla_rnn(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/rnn/vanilla_rnn-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:00<01:01,  2.01it/s][A
  2%|▏         | 2/125 [00:00<00:50,  2.46it/s][A
  3%|▎         | 4/125 [00:00<00:37,  3.25it/s][A
  5%|▍         | 6/125 [00:01<00:29,  3.97it/s][A
  6%|▌         | 7/125 [00:01<00:26,  4.50it/s][A
  7%|▋         | 9/125 [00:01<00:21,  5.52it/s][A
  9%|▉         | 11/125 [00:01<00:17,  6.40it/s][A
 10%|▉         | 12/125 [00:01<00:16,  6.91it/s][A
 11%|█         | 14/125 [00:01<00:13,  8.33it/s][A
 13%|█▎        | 16/125 [00:02<00:11,  9.20it/s][A
 14%|█▍        | 18/125 [00:02<00:11,  9.13it/s][A
 16%|█▌        | 20/125 [00:02<00:11,  9.39it/s][A
 18%|█▊        | 22/125 [00:02<00:10,  9.76it/s][A
 19%|█▉        | 24/125 [00:02<00:09, 10.19it/s][A
 21%|██        | 26/125 [00:02<00:09, 10.78it/s][A
 22%|██▏       | 28/125 [00:03<00:08, 10.92it/s][A
 24%|██▍       | 30/125 [00:03<00:08, 10.90it/s][A
 26%|██▌       | 32/125 [00:03<00:07, 12.04it/s][A
 27%|██▋       | 34/125 [00:03<00:07, 11.57it/s][A
 29%|██▉       | 

Computing BLEU



  4%|▍         | 44/1000 [00:00<00:04, 208.67it/s][A
  7%|▋         | 66/1000 [00:00<00:04, 209.24it/s][A
  8%|▊         | 85/1000 [00:00<00:04, 199.36it/s][A
 10%|█         | 105/1000 [00:00<00:04, 197.10it/s][A
 13%|█▎        | 126/1000 [00:00<00:04, 200.31it/s][A
 15%|█▍        | 148/1000 [00:00<00:04, 205.52it/s][A
 17%|█▋        | 170/1000 [00:00<00:03, 209.15it/s][A
 19%|█▉        | 193/1000 [00:00<00:03, 213.29it/s][A
 22%|██▏       | 216/1000 [00:01<00:03, 216.86it/s][A
 24%|██▍       | 238/1000 [00:01<00:03, 214.58it/s][A
 26%|██▌       | 261/1000 [00:01<00:03, 218.14it/s][A
 28%|██▊       | 284/1000 [00:01<00:03, 219.05it/s][A
 31%|███       | 307/1000 [00:01<00:03, 221.08it/s][A
 33%|███▎      | 329/1000 [00:01<00:03, 220.43it/s][A
 35%|███▌      | 351/1000 [00:01<00:02, 217.99it/s][A
 37%|███▋      | 374/1000 [00:01<00:02, 220.84it/s][A
 40%|███▉      | 397/1000 [00:01<00:02, 219.20it/s][A
 42%|████▏     | 421/1000 [00:01<00:02, 222.73it/s][A
 44%|████▍  

BLEU 1:89.68767687069148, BLEU 2:63.75724619747129, BLEU 3:41.23849656594904, BLEU 4:27.7544551175118





In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## evaluation code for temperature 0.1
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=0.1):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=0.1):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=0.1,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.

## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = Vanilla_rnn(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/rnn/vanilla_rnn-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:00<01:24,  1.47it/s][A
  2%|▏         | 2/125 [00:00<01:03,  1.95it/s][A
  2%|▏         | 3/125 [00:00<00:47,  2.55it/s][A
  4%|▍         | 5/125 [00:01<00:37,  3.23it/s][A
  6%|▌         | 7/125 [00:01<00:28,  4.15it/s][A
  6%|▋         | 8/125 [00:01<00:23,  5.00it/s][A
  8%|▊         | 10/125 [00:01<00:18,  6.21it/s][A
 10%|▉         | 12/125 [00:01<00:15,  7.28it/s][A
 11%|█         | 14/125 [00:01<00:13,  8.45it/s][A
 13%|█▎        | 16/125 [00:02<00:11,  9.20it/s][A
 14%|█▍        | 18/125 [00:02<00:11,  9.47it/s][A
 16%|█▌        | 20/125 [00:02<00:10,  9.77it/s][A
 18%|█▊        | 22/125 [00:02<00:09, 10.48it/s][A
 20%|██        | 25/125 [00:02<00:08, 11.76it/s][A
 22%|██▏       | 27/125 [00:02<00:09, 10.72it/s][A
 23%|██▎       | 29/125 [00:03<00:08, 10.68it/s][A
 25%|██▍       | 31/125 [00:03<00:08, 10.45it/s][A
 26%|██▋       | 33/125 [00:03<00:07, 11.53it/s][A
 28%|██▊       | 35/125 [00:03<00:07, 11.33it/s][A
 30%|██▉       | 

Computing BLEU



  4%|▍         | 44/1000 [00:00<00:04, 212.49it/s][A
  7%|▋         | 66/1000 [00:00<00:04, 212.15it/s][A
  9%|▉         | 89/1000 [00:00<00:04, 216.42it/s][A
 11%|█         | 111/1000 [00:00<00:04, 214.92it/s][A
 13%|█▎        | 133/1000 [00:00<00:04, 215.64it/s][A
 16%|█▌        | 155/1000 [00:00<00:03, 214.47it/s][A
 18%|█▊        | 178/1000 [00:00<00:03, 216.94it/s][A
 20%|█▉        | 199/1000 [00:00<00:03, 212.38it/s][A
 22%|██▏       | 221/1000 [00:01<00:03, 212.71it/s][A
 24%|██▍       | 243/1000 [00:01<00:03, 212.70it/s][A
 27%|██▋       | 266/1000 [00:01<00:03, 216.26it/s][A
 29%|██▉       | 288/1000 [00:01<00:03, 216.81it/s][A
 31%|███       | 311/1000 [00:01<00:03, 220.28it/s][A
 33%|███▎      | 333/1000 [00:01<00:03, 219.96it/s][A
 36%|███▌      | 355/1000 [00:01<00:03, 213.80it/s][A
 38%|███▊      | 377/1000 [00:01<00:02, 214.14it/s][A
 40%|████      | 400/1000 [00:01<00:02, 218.50it/s][A
 42%|████▏     | 422/1000 [00:01<00:02, 218.88it/s][A
 44%|████▍  

BLEU 1:89.68767687069148, BLEU 2:63.75724619747129, BLEU 3:41.23849656594904, BLEU 4:27.7544551175118





In [None]:
#It semms that as temperature goes up the Belu score dropes. for small values for temperature it is almost the same

# Section 3 Variations [55 pts]
## Section 3.1 LSTM [35 pts]
## Section 3.1.1 Decoder: LSTM [5 pts]
This time, replace the RNN module with an LSTM module.

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0):
        """
        Use LSTM as decoder for captions.
        :param emb_dim: Embedding dimensions.
        :param hidden_dim: Hidden states dimensions.
        :param num_layers: Number of LSTM layers.
        :param vocab_size: The size of Vocabulary.
        :param dropout: dropout probability
        """
        super(Decoder, self).__init__()
        self.max_length = 30
        #############Your code############
        # you need to implement a LSTM for the decoder. Take a look at the official documentation.
        # https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM
         # one-hot encoding + linear layer
        self.embedding_layer = nn.Embedding(vocab_size, emb_dim)
        
        # lstm network
        self.lstm = nn.LSTM(input_size = emb_dim,hidden_size = hidden_dim,
                            num_layers = num_layers, batch_first = True)
        
        
        # output layer
        self.linear = nn.Linear(hidden_dim, vocab_size)

    def forward(self, encode_features, captions, lengths):
        """
        Feed forward to generate captions.
        :param encode_features: output of encoder, size [N, emb_dim]
        :param captions: captions, size [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. length is (batch_size).
        """
        #############Your Code###################
        embed = self.embedding_layer(captions)
        # concatenate the encoded features from encoder and embeddings
        embed = torch.cat((encode_features.unsqueeze(1), embed), dim = 1)
        packed_input = pack_padded_sequence(embed, lengths, batch_first=True)
                
        # feed into LSTM.
        hiddens, _ = self.lstm(packed_input )

        # output layer
        outputs = self.linear(hiddens[0])

        return outputs

## Encoder-Decoder [5 pts]

In [None]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0):
        """
        Encoder-decoder vanilla RNN.
        :param vocab_size: the size of Vocabulary.
        :param emb_dim: the dimensions of word embedding.
        :param hidden_dim: the dimensions of hidden units.
        :param num_layers: the number of RNN layers.
        """
        super(LSTM, self).__init__()
        #self.max_length = self.Decoder.max_length
        #########Your Code################
        # Encoder: ResNet-50
        self.Encoder= Encoder(emb_dim)

        # Decoder: LSTM
        self.Decoder = Decoder(vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0)
        self.max_length = self.Decoder.max_length

    def forward(self, x, captions, lengths):
        """
        Feed forward.
        :param x: Images, [N, 3, H, W]
        :param captions: encoded captions, [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. length is (batch_size).
        :return: output logits, usually followed by a softmax layer.
        """
        ##########Your code###################
                # forward passing
        Encoder= self.Encoder(x)
        x = self.Decoder(Encoder,captions, lengths)

        return x

    def sample_generate(self, x, states=None, mode='Deterministic', temperature=5.0):
        """
        Generate samples during the evaluation.
        
        :param x: input image
        :param states: rnn states
        :param mode: which mode we use.  
         - 'Deterministic': Take the maximum output at each step.
         - 'Stochastic': Sample from the probability distribution from the output layer.
        :param temperature: will be used in the stochastic mode
        :return: sample_idxs. Word indices. We can use vocab to recover the sentence.
        """
        sample_idxs = []  # record the index of your generated words
        # compute the encoded features
        features = self.Encoder(x)
        inputs = features.unsqueeze(1)
        if mode == 'Deterministic':
          for i in range(self.max_length):
              hiddens, states = self.Decoder.lstm(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 
           # take the maximum index after the softmax
              _, predicted = outputs.max(1)                        # predicted: (batch_size)
              sample_idxs.append(predicted)
              inputs= self.Decoder.embedding_layer(predicted)
              inputs = inputs.unsqueeze(1)
          sample_idxs=torch.stack(sample_idxs, dim=1)
            
        elif mode == 'Stochastic':
            for i in range(self.max_length):
              hiddens, states = self.Decoder.lstm(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 
              #outputs = m(outputs/temperature)
            # sample from the probability distribution after the softmax
            # Hint: use torch.multinomial() to sample from a distribution.
              #probabilities = F.softmax(outputs.div(temperature).squeeze(0).squeeze(0), dim=1)
              probabilities = F.softmax(outputs.div(temperature), dim=1)
              predicted = torch.multinomial(probabilities.data, 1) 

              sample_idxs.append(predicted[:, 0])
              inputs = self.Decoder.embedding_layer(predicted[:,0])                       # inputs: (batch_size, embed_size)
              inputs = inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
            sample_idxs = torch.stack(sample_idxs, dim=1)                # sampled_ids: (batch_size, max_seq_length)
            
            
        return sample_idxs

## Section 3.1.2 Training [10 pts]
Use the same set of hyper-parameters (hidden units, optimizer, learning rate etc.) for both models.

In [None]:
# some hyperparameters, you can change them
## training parameters
batch_size = 256
lr = 1e-2
num_epochs = 50
weight_decay = 0.0
log_step = 50

## network architecture
emb_dim = 1024
hidden_dim = 256
num_layers = 1 # number of RNN layers
dropout = 0.0

## image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Output directory
output_dir = path_to_homework + '/checkpoints/lstm/'
os.makedirs(output_dir, exist_ok=True)

In [None]:
# Training code here
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


train_data_loader = get_loader(root=path_to_homework + 'flickr30k_images/', split='train', vocab=vocab,
                               transform=transform, batch_size=batch_size, shuffle=True, num_workers=12)
val_data_loader = get_loader(root=path_to_homework + 'flickr30k_images/', split='val', vocab=vocab,
                             transform=transform, batch_size=8, shuffle=True, num_workers=4)

model = LSTM(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model

# loss and optimizer
criterion = nn.CrossEntropyLoss().to(device)  # CE loss
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)  # optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 
                                      step_size=5,
                                      gamma=0.5)  # decay LR by a factor of 0.5 every 10 epochs. You can change this

# logs
Train_Losses = []  # record average training loss each epoch
Val_Losses = []   # record average validation loss each epoch
total_step = len(train_data_loader)  # number of iterations each epoch
best_val_loss = np.inf

# start training
print('Start training...')
import time
tic = time.time()
for epoch in range(num_epochs):
    print('Switch to training...')
    model.train()
    Train_loss_iter = []  # record the the training loss each iteration
    for itr, (images, captions, lengths) in enumerate(train_data_loader):
        ########Your Code###########
        
        # train your model
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        targets = Variable(pack_padded_sequence(captions, lengths, batch_first=True)[0]).to(device)

        optimizer.zero_grad()

        outputs = model(images, captions, lengths)
        loss = criterion(outputs, targets)

        loss.backward()  
        optimizer.step()

        # record the training loss
        Train_loss_iter = Train_loss_iter+loss.data.detach().cpu().numpy()
        
        
        # print log info
        if itr % log_step == 0:
            # print current loss and perplexity
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                      .format(epoch, num_epochs, itr, total_step, loss.item(), np.exp(loss.item())))
    scheduler.step()
    Train_Losses.append(np.mean(Train_loss_iter))
    np.save(os.path.join(output_dir, 'TrainingLoss_lstm.npy'), Train_Losses)  # save the training loss
    
    model.eval()
    # (optional) generate a sample during the training, you can use deterministic mode
    # Your code
    
    
    # validation
    Val_Losses.append(val(model, val_data_loader, vocab))
    np.save(os.path.join(output_dir, 'ValLoss_lstm.npy'), Val_Losses) # save the val loss
    
    # save model
    if Val_Losses[-1] < best_val_loss:
        best_val_loss = Val_Losses[-1]
        print('updated best val loss:', best_val_loss)
        print('Save model weights to...', output_dir)
        torch.save(model.state_dict(), 
                   os.path.join(output_dir, 'lstm-best.pth'.format(epoch + 1, itr + 1)))

print('It took: {} s'.format(time.time() - tic))

Start training...
Switch to training...
Epoch [0/50], Step [0/114], Loss: 9.2181, Perplexity: 10077.8075
Epoch [0/50], Step [50/114], Loss: 3.6306, Perplexity: 37.7345
Epoch [0/50], Step [100/114], Loss: 3.4019, Perplexity: 30.0200
Validating...


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Step [0/127], Loss: 3.7728, Perplexity: 43.5004
Step [50/127], Loss: 3.1068, Perplexity: 22.3496
Step [100/127], Loss: 3.8409, Perplexity: 46.5682
updated best val loss: 3.366395
Save model weights to... /content/drive/My Drive/Assignment_4//checkpoints/lstm/
Switch to training...
Epoch [1/50], Step [0/114], Loss: 3.3167, Perplexity: 27.5680
Epoch [1/50], Step [50/114], Loss: 3.3043, Perplexity: 27.2300
Epoch [1/50], Step [100/114], Loss: 3.2659, Perplexity: 26.2047
Validating...
Step [0/127], Loss: 3.4206, Perplexity: 30.5885
Step [50/127], Loss: 3.3993, Perplexity: 29.9436
Step [100/127], Loss: 3.3913, Perplexity: 29.7054
updated best val loss: 3.2770257
Save model weights to... /content/drive/My Drive/Assignment_4//checkpoints/lstm/
Switch to training...
Epoch [2/50], Step [0/114], Loss: 3.2657, Perplexity: 26.1975
Epoch [2/50], Step [50/114], Loss: 3.1362, Perplexity: 23.0155
Epoch [2/50], Step [100/114], Loss: 3.1070, Perplexity: 22.3543
Validating...
Step [0/127], Loss: 3.3349, P

## Section 3.1.3 Evalution [10 pts]
Evaluate your model on the test set by perplexity score or BLEU score

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.
# Your code here
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=1.0):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=1.0):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=1.0,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

# End of code

In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = LSTM(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/lstm/lstm-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))

# End of code



  0%|          | 0/125 [00:00<?, ?it/s][A[A

Run on the test set...




  1%|          | 1/125 [00:04<08:40,  4.20s/it][A[A

  2%|▏         | 2/125 [00:07<07:58,  3.89s/it][A[A

  2%|▏         | 3/125 [00:10<07:27,  3.67s/it][A[A

  3%|▎         | 4/125 [00:13<07:06,  3.52s/it][A[A

  4%|▍         | 5/125 [00:16<06:50,  3.42s/it][A[A

  5%|▍         | 6/125 [00:20<06:37,  3.34s/it][A[A

  6%|▌         | 7/125 [00:23<06:28,  3.29s/it][A[A

  6%|▋         | 8/125 [00:26<06:21,  3.26s/it][A[A

  7%|▋         | 9/125 [00:29<06:17,  3.25s/it][A[A

  8%|▊         | 10/125 [00:32<06:11,  3.23s/it][A[A

  9%|▉         | 11/125 [00:36<06:07,  3.22s/it][A[A

 10%|▉         | 12/125 [00:39<06:02,  3.21s/it][A[A

 10%|█         | 13/125 [00:42<05:57,  3.19s/it][A[A

 11%|█         | 14/125 [00:45<05:53,  3.18s/it][A[A

 12%|█▏        | 15/125 [00:48<05:50,  3.19s/it][A[A

 13%|█▎        | 16/125 [00:51<05:47,  3.19s/it][A[A

 14%|█▎        | 17/125 [00:55<05:45,  3.20s/it][A[A

 14%|█▍        | 18/125 [00:58<05:44,  3.22s/it][A[A


Computing BLEU




  4%|▍         | 44/1000 [00:00<00:04, 221.18it/s][A[A

  6%|▋         | 65/1000 [00:00<00:04, 216.14it/s][A[A

  9%|▊         | 86/1000 [00:00<00:04, 213.06it/s][A[A

 11%|█         | 109/1000 [00:00<00:04, 215.89it/s][A[A

 13%|█▎        | 128/1000 [00:00<00:04, 206.54it/s][A[A

 15%|█▌        | 151/1000 [00:00<00:04, 210.84it/s][A[A

 17%|█▋        | 174/1000 [00:00<00:03, 215.08it/s][A[A

 20%|█▉        | 197/1000 [00:00<00:03, 217.85it/s][A[A

 22%|██▏       | 220/1000 [00:01<00:03, 220.23it/s][A[A

 24%|██▍       | 244/1000 [00:01<00:03, 223.71it/s][A[A

 27%|██▋       | 267/1000 [00:01<00:03, 224.85it/s][A[A

 29%|██▉       | 290/1000 [00:01<00:03, 224.20it/s][A[A

 31%|███▏      | 313/1000 [00:01<00:03, 225.37it/s][A[A

 34%|███▎      | 336/1000 [00:01<00:02, 224.00it/s][A[A

 36%|███▌      | 359/1000 [00:01<00:02, 218.60it/s][A[A

 38%|███▊      | 381/1000 [00:01<00:02, 217.96it/s][A[A

 40%|████      | 405/1000 [00:01<00:02, 223.15it/s][A[A

BLEU 1:90.94035469084693, BLEU 2:65.86864561698819, BLEU 3:41.163834409771844, BLEU 4:27.814403103582656





In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.
# Your code here
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=1.5):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=1.5):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=1.5,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

# End of code

In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = LSTM(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/lstm/lstm-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))

# End of code


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:00<01:11,  1.74it/s][A
  2%|▏         | 2/125 [00:00<00:55,  2.22it/s][A
  3%|▎         | 4/125 [00:00<00:41,  2.94it/s][A
  4%|▍         | 5/125 [00:01<00:33,  3.60it/s][A
  5%|▍         | 6/125 [00:01<00:27,  4.36it/s][A
  6%|▌         | 7/125 [00:01<00:22,  5.21it/s][A
  7%|▋         | 9/125 [00:01<00:18,  6.20it/s][A
  8%|▊         | 10/125 [00:01<00:16,  7.00it/s][A
 10%|▉         | 12/125 [00:01<00:13,  8.08it/s][A
 11%|█         | 14/125 [00:01<00:12,  9.13it/s][A
 13%|█▎        | 16/125 [00:02<00:11,  9.74it/s][A
 15%|█▌        | 19/125 [00:02<00:09, 10.85it/s][A
 17%|█▋        | 21/125 [00:02<00:09, 10.40it/s][A
 18%|█▊        | 23/125 [00:02<00:09, 10.35it/s][A
 20%|██        | 25/125 [00:02<00:10,  9.94it/s][A
 22%|██▏       | 27/125 [00:03<00:09, 10.87it/s][A
 23%|██▎       | 29/125 [00:03<00:08, 10.85it/s][A
 25%|██▍       | 31/125 [00:03<00:08, 11.66it/s][A
 26%|██▋       | 33/125 [00:03<00:07, 12.82it/s][A
 28%|██▊       | 3

Computing BLEU



  5%|▌         | 50/1000 [00:00<00:03, 244.35it/s][A
  7%|▋         | 73/1000 [00:00<00:03, 239.06it/s][A
 10%|▉         | 98/1000 [00:00<00:03, 242.00it/s][A
 12%|█▏        | 121/1000 [00:00<00:03, 238.08it/s][A
 15%|█▍        | 146/1000 [00:00<00:03, 241.27it/s][A
 17%|█▋        | 172/1000 [00:00<00:03, 244.41it/s][A
 20%|█▉        | 196/1000 [00:00<00:03, 241.82it/s][A
 22%|██▏       | 222/1000 [00:00<00:03, 245.40it/s][A
 25%|██▍       | 248/1000 [00:01<00:03, 248.16it/s][A
 27%|██▋       | 273/1000 [00:01<00:03, 241.46it/s][A
 30%|██▉       | 299/1000 [00:01<00:02, 244.78it/s][A
 32%|███▏      | 324/1000 [00:01<00:02, 241.56it/s][A
 35%|███▍      | 348/1000 [00:01<00:02, 233.89it/s][A
 37%|███▋      | 372/1000 [00:01<00:02, 233.18it/s][A
 40%|███▉      | 396/1000 [00:01<00:02, 235.14it/s][A
 42%|████▏     | 420/1000 [00:01<00:02, 235.96it/s][A
 45%|████▍     | 446/1000 [00:01<00:02, 241.60it/s][A
 47%|████▋     | 471/1000 [00:01<00:02, 240.46it/s][A
 50%|████▉  

BLEU 1:90.94035469084693, BLEU 2:65.86864561698819, BLEU 3:41.163834409771844, BLEU 4:27.814403103582656





In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.
# Your code here
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=0.5):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=0.5):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=0.5,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    
    return bleu1, bleu2, bleu3, bleu4

# End of code   score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
 

In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = LSTM(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/lstm/lstm-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:00<01:12,  1.72it/s][A
  2%|▏         | 2/125 [00:00<00:54,  2.24it/s][A
  3%|▎         | 4/125 [00:00<00:41,  2.91it/s][A
  4%|▍         | 5/125 [00:01<00:33,  3.56it/s][A
  5%|▍         | 6/125 [00:01<00:27,  4.39it/s][A
  6%|▌         | 7/125 [00:01<00:22,  5.26it/s][A
  6%|▋         | 8/125 [00:01<00:19,  5.97it/s][A
  8%|▊         | 10/125 [00:01<00:15,  7.44it/s][A
 10%|▉         | 12/125 [00:01<00:14,  7.75it/s][A
 11%|█         | 14/125 [00:01<00:12,  8.77it/s][A
 13%|█▎        | 16/125 [00:02<00:10, 10.03it/s][A
 14%|█▍        | 18/125 [00:02<00:09, 10.77it/s][A
 16%|█▌        | 20/125 [00:02<00:10, 10.25it/s][A
 18%|█▊        | 22/125 [00:02<00:09, 11.00it/s][A
 19%|█▉        | 24/125 [00:02<00:09, 10.55it/s][A
 21%|██        | 26/125 [00:02<00:08, 11.33it/s][A
 22%|██▏       | 28/125 [00:03<00:08, 11.22it/s][A
 24%|██▍       | 30/125 [00:03<00:08, 11.65it/s][A
 26%|██▌       | 32/125 [00:03<00:07, 12.29it/s][A
 27%|██▋       | 3

Computing BLEU



  5%|▍         | 46/1000 [00:00<00:04, 232.24it/s][A
  7%|▋         | 66/1000 [00:00<00:04, 219.42it/s][A
  9%|▉         | 89/1000 [00:00<00:04, 221.01it/s][A
 11%|█▏        | 113/1000 [00:00<00:03, 226.22it/s][A
 14%|█▎        | 137/1000 [00:00<00:03, 229.98it/s][A
 16%|█▌        | 160/1000 [00:00<00:03, 229.76it/s][A
 18%|█▊        | 185/1000 [00:00<00:03, 234.18it/s][A
 21%|██        | 209/1000 [00:00<00:03, 235.46it/s][A
 23%|██▎       | 233/1000 [00:01<00:03, 235.35it/s][A
 26%|██▌       | 259/1000 [00:01<00:03, 239.92it/s][A
 28%|██▊       | 283/1000 [00:01<00:03, 237.81it/s][A
 31%|███       | 309/1000 [00:01<00:02, 242.29it/s][A
 33%|███▎      | 334/1000 [00:01<00:02, 237.81it/s][A
 36%|███▌      | 358/1000 [00:01<00:02, 236.80it/s][A
 38%|███▊      | 383/1000 [00:01<00:02, 238.86it/s][A
 41%|████      | 408/1000 [00:01<00:02, 240.88it/s][A
 43%|████▎     | 433/1000 [00:01<00:02, 240.88it/s][A
 46%|████▌     | 458/1000 [00:01<00:02, 241.10it/s][A
 48%|████▊  

BLEU 1:90.94035469084693, BLEU 2:65.86864561698819, BLEU 3:41.163834409771844, BLEU 4:27.814403103582656





In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode.
# Your code here
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=0.1):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=0.1):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=0.1,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    
    return bleu1, bleu2, bleu3, bleu4

# End of code   score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
 

In [None]:
## Use at least 3 different temperatures to generate captions on the test set. Report the BLEU scores.
# Your code here
## Image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Evaluate your model using BLEU score. Use Deterministic mode
model = LSTM(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, 
                   num_layers=1, dropout=dropout).to(device)  # build a model
model.load_state_dict(torch.load(path_to_homework + '/checkpoints/lstm/lstm-best.pth', map_location=torch.device('cpu')))
model.eval()
bleu1, bleu2, bleu3, bleu4 = evaluation(model, vocab, mode='Deterministic')
print("BLEU 1:{}, BLEU 2:{}, BLEU 3:{}, BLEU 4:{}".format(bleu1, bleu2, bleu3, bleu4))


  0%|          | 0/125 [00:00<?, ?it/s][A

Run on the test set...



  1%|          | 1/125 [00:00<01:17,  1.60it/s][A
  2%|▏         | 2/125 [00:00<00:59,  2.05it/s][A
  2%|▏         | 3/125 [00:00<00:45,  2.69it/s][A
  3%|▎         | 4/125 [00:01<00:36,  3.29it/s][A
  4%|▍         | 5/125 [00:01<00:30,  3.91it/s][A
  6%|▌         | 7/125 [00:01<00:24,  4.89it/s][A
  7%|▋         | 9/125 [00:01<00:18,  6.13it/s][A
  9%|▉         | 11/125 [00:01<00:16,  7.01it/s][A
 10%|█         | 13/125 [00:01<00:13,  8.10it/s][A
 12%|█▏        | 15/125 [00:02<00:12,  8.48it/s][A
 14%|█▎        | 17/125 [00:02<00:11,  9.30it/s][A
 15%|█▌        | 19/125 [00:02<00:10,  9.82it/s][A
 17%|█▋        | 21/125 [00:02<00:10,  9.79it/s][A
 18%|█▊        | 23/125 [00:02<00:10,  9.77it/s][A
 20%|██        | 25/125 [00:02<00:09, 10.55it/s][A
 22%|██▏       | 27/125 [00:03<00:08, 11.72it/s][A
 23%|██▎       | 29/125 [00:03<00:08, 10.94it/s][A
 25%|██▍       | 31/125 [00:03<00:08, 11.22it/s][A
 26%|██▋       | 33/125 [00:03<00:09, 10.03it/s][A
 28%|██▊       | 3

UnboundLocalError: ignored

In [None]:
#It semms that as temperature goes up the Belu score dropes. for small values for temperature it is almost the same

## Section 3.1.4 Discussion [5 pts]
What's the difference between Vanilla RNN and LSTM (training loss, evaluation results, etc)? for the temperature 1 i get the same Belue4 score for both models
and the loss for LSTM is 2.7 whic is a little bit better than rnn which is 2.99

**Your comments**:

## Section 3.2 Using pre-trained word embeddings [20 pts]
For now, the decoder uses a word as input by converting it into a fixed size embedding, and our networks learn these word embeddings by training. In this experiment, you will use pre-trained word embeddings like Word2Vec or GloVe in LSTM. If you use Pytorch’s nn.Embedding layer, you can initialize its weights with a matrix containing pre-trained word embeddings for all words in your vocabulary, and freeze the weights (i.e. don’t train this layer). You can find these embeddings online.

Some resources:
- GloVe: https://nlp.stanford.edu/projects/glove/
- Word2Vec: http://jalammar.github.io/illustrated-word2vec/

In case you don't know how to get one, we've already provided a light GloVe embedding: wm_06.npy, which can produce 300-d word embeddings.

## Section 3.2.1 Encoder-decoder [10 pts]

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, pretrained_emb, num_layers=1, dropout=0):
        """
        Use LSTM as decoder for captions.
        :param emb_dim: Embedding dimensions.
        :param hidden_dim: Hidden states dimensions.
        :param pretrained_emb: the path to the pretrained embedding
        :param num_layers: Number of LSTM layers.
        :param vocab_size: The size of Vocabulary.
        :param dropout: dropout probability
        """
        super(Decoder, self).__init__()
        self.max_length = 30  # in case it's trapped
        ###### Your Code#########
        # load pre-trained embedding weights and freeze this layer
        pretrained_emb = '/content/drive/My Drive/Assignment_4/wm_06.npy'
        b=np.load(pretrained_emb) 
        weights = torch.FloatTensor(b)
        self.embedding = nn.Embedding.from_pretrained(weights, freeze=True) 
        
      # lstm network
        self.lstm = nn.LSTM(input_size = emb_dim,hidden_size = hidden_dim,
                            num_layers = num_layers, batch_first = True)
        
        # output layer
        self.linear = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, encode_features, captions, lengths):
        """
        Feed forward to generate captions.
        :param encode_features: output of encoder, size [N, emb_dim]
        :param captions: captions, size [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. length is (batch_size).
        """
        #############Your Code###################
        # compute the embedding using one-hot technique and linear function
        embed = self.embedding(captions)
        
        # concatenate the encoded features from encoder and embeddings
        embed = torch.cat((encode_features.unsqueeze(1), embed), dim = 1)
        packed_input = pack_padded_sequence(embed, lengths, batch_first=True)       
                
        # feed into RNN
        hiddens, _ = self.lstm(packed_input )
        
        # output layer
        outputs = self.linear(hiddens[0])

        return outputs

In [None]:
class Word_embeddings(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, pretrained_emb, num_layers=1, dropout=0):
        """
        Encoder-decoder baseline.
        :param vocab_size: the size of Vocabulary.
        :param emb_dim: the dimensions of word embedding.
        :param hidden_dim: the dimensions of hidden units.
        :param pretrained_emb: the path to the pretrained embedding
        :param num_layers: the number of LSTM layers.
        :param dropout: dropout probability.
        """
        super(Word_embeddings, self).__init__()
        self.max_length = 30
        #########Your Code################
        # Encoder: ResNet-50
        self.Encoder= Encoder(emb_dim)
        # Decoder: LSTM
        self.Decoder = Decoder(vocab_size, emb_dim, hidden_dim, pretrained_emb, num_layers=1, dropout=0)
        #self.max_length = self.Decoder.max_length
 

    def forward(self, x, captions, lengths):
        """
        Feed forward.
        :param x: Images, [N, 3, H, W]
        :param captions: encoded captions, [N, max(lengths)]
        :param lengths: a list indicating valid length for each caption. length is (batch_size).
        :return: output logits, usually followed by a softmax layer.
        """
        ##########Your code###################
        # forward passing
        Encoder= self.Encoder(x)
        x = self.Decoder(Encoder,captions, lengths)

        return x

    def sample_generate(self, x, states=None, mode='Deterministic', temperature=5.0):
        """
        Generate samples.
        :param x:
        :return:
        """
        sample_idxs = []
        #################Your Code##################
        sample_idxs = []  # record the index of your generated words
        # compute the encoded features
        features = self.Encoder(x)
        inputs = features.unsqueeze(1)
        if mode == 'Deterministic':
          for i in range(self.max_length):
              hiddens, states = self.Decoder.lstm(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 
           # take the maximum index after the softmax
              _, predicted = outputs.max(1)                        # predicted: (batch_size)
              sample_idxs.append(predicted)
              inputs= self.Decoder.embedding_layer(predicted)
              inputs = inputs.unsqueeze(1)
          sample_idxs=torch.stack(sample_idxs, dim=1)
            
        elif mode == 'Stochastic':
            for i in range(self.max_length):
              hiddens, states = self.Decoder.lstm(inputs, states)  
              outputs = self.Decoder.linear(hiddens.squeeze(1)) 
            # sample from the probability distribution after the softmax
            # Hint: use torch.multinomial() to sample from a distribution.
              probabilities = F.softmax(outputs.div(temperature), dim=1)
              predicted = torch.multinomial(probabilities.data, 1) 

              sample_idxs.append(predicted[:, 0])
              inputs = self.Decoder.embedding_layer(predicted[:,0])                       # inputs: (batch_size, embed_size)
              inputs = inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
            sample_idxs = torch.stack(sample_idxs, dim=1)                # sampled_ids: (batch_size, max_seq_length)
            
            
        return sample_idxs

## Section 3.2.2 Training [5 pts]

In [None]:
# some hyperparameters, you can change them
## training parameters
batch_size = 256
lr = 1e-2
num_epochs = 50
weight_decay = 0.0
log_step = 50

## network architecture
emb_dim = 300
hidden_dim = 256
num_layers = 1 # number of RNN layers
dropout = 0.0

## image transformation
transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        #     transforms.RandomCrop(224, pad_if_needed=True),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])

## Output directory
output_dir = path_to_homework + '/checkpoints/pretrained_emb/'
os.makedirs(output_dir, exist_ok=True)

In [None]:
# Training code here
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


train_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split='train', vocab=vocab,
                               transform=transform, batch_size=batch_size, shuffle=True, num_workers=12)
val_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split='val', vocab=vocab,
                             transform=transform, batch_size=8, shuffle=True, num_workers=4)

# pretrained embedding weights
pre_emb_path = '/content/drive/My Drive/Assignment_4/wm_06.npy'  # type the path to the pretrained embedding you find

model = Word_embeddings(vocab_size=len(vocab), emb_dim=emb_dim, hidden_dim=hidden_dim, pretrained_emb=pre_emb_path,
                   num_layers=1, dropout=dropout).to(device)  # build a model

# loss and optimizer
criterion = nn.CrossEntropyLoss().to(device)  # CE loss
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)  # optimizer
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 
                                      step_size=5,
                                      gamma=0.5)  # decay LR by a factor of 0.5 every 10 epochs. You can change this

# logs
Train_Losses = []  # record average training loss each epoch
Val_Losses = []   # record average validation loss each epoch
total_step = len(train_data_loader)  # number of iterations each epoch
best_val_loss = np.inf

# start training
print('Start training...')
import time
tic = time.time()
for epoch in range(num_epochs):
    print('Switch to training...')
    model.train()
    Train_loss_iter = []  # record the the training loss each iteration
    for itr, (images, captions, lengths) in enumerate(train_data_loader):
        ########Your Code###########
        # train your model
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        targets = Variable(pack_padded_sequence(captions, lengths, batch_first=True)[0]).to(device)

        optimizer.zero_grad()

        outputs = model(images, captions, lengths)
        loss = criterion(outputs, targets)

        loss.backward()  
        optimizer.step()

        # record the training loss
        #Train_loss_iter = Train_loss_iter+loss.data.detach().cpu().numpy()
        #Train_loss_iter = Train_loss_iter+loss.data.detach().numpy()
        Train_loss_iter.append(loss.data.detach().cpu().numpy())
        
        
        # print log info
        if itr % log_step == 0:
            # print current loss and perplexity
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                      .format(epoch, num_epochs, itr, total_step, loss.item(), np.exp(loss.item())))
    scheduler.step()
    Train_Losses.append(np.mean(Train_loss_iter))
    np.save(os.path.join(output_dir, 'TrainingLoss_lstm.npy'), Train_Losses)  # save the training loss
    
    model.eval()
    # (optional) generate a sample during the training, you can use deterministic mode
    # Your code
    
    
    # validation
    Val_Losses.append(val(model, val_data_loader, vocab))
    np.save(os.path.join(output_dir, 'ValLoss_lstm.npy'), Val_Losses) # save the val loss
    
    # save model
    if Val_Losses[-1] < best_val_loss:
        best_val_loss = Val_Losses[-1]
        print('updated best val loss:', best_val_loss)
        print('Save model weights to...', output_dir)
        torch.save(model.state_dict(), 
                   os.path.join(output_dir, 'pretrain-best.pth'.format(epoch + 1, itr + 1)))

print('It took: {} s'.format(time.time() - tic))

Start training...
Switch to training...
Epoch [0/50], Step [0/114], Loss: 9.2032, Perplexity: 9928.9960


RuntimeError: ignored

## Section 3.2.3 Evaluation [3 pts]

In [None]:
## Evaluate your model using BLEU score. Use Deterministic mode
## Evaluate your model using BLEU score. Use Deterministic mode.
# Your code here
## evaluation code
from tqdm import tqdm, tqdm_notebook
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

def caption_generator(model, images, vocab, img_ids, captions, mode='Deterministic', temperature=1.0):
    """
    Generate captions.
    :param mode:
    :return:
    """
    sample_idxs = model.sample_generate(images, mode=mode,
                                        temperature=temperature).data.cpu().numpy()  # [N, max_length]
    for i, sentence in enumerate(sample_idxs):  # every sentence in this batch
        sentence_caption = ''
        for word_idx in sentence:
            word = vocab.idx2word[word_idx]
            if word != '<start>' and word != '<end>':
                if word == '.':
                    sentence_caption += '.'
                else:
                    sentence_caption += word + ' '
            if word == '<end>':
                break
        captions.append({'caption': sentence_caption})
        # captions.append(sentence_caption)

    return captions

def run_test(model, data_loader, vocab, mode='Deterministic', temperature=1.0):
    """
    Run your model on the test set.
    Inputs:
    :param model: the model you use
    :param data_loader: the data_loader
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    predictions = []
    for itr, (images, captions, lengths) in enumerate(tqdm(data_loader)):
        images = Variable(images).to(device)
        captions = Variable(captions).to(device)
        outputs = model(images, captions, lengths)
        
        img_ids = list(range(itr * data_loader.batch_size, (itr + 1) * data_loader.batch_size))
        predictions = caption_generator(model, images, vocab, img_ids, 
                                        predictions, mode=mode, temperature=temperature)
        
    return predictions

def evaluation(model, vocab, data_path=path_to_homework + '/flickr30k_images/', mode='Deterministic', temperature=1.0,
               split='test'):
    """
    Evaluate the performance of your model on the test set using BLEU scores.
    Inputs:
    :param model: the model you use
    :param weight_path: the directory to the weights of your model
    :param vocab: vocabulary
    :param data_path: the directory to the dataset
    :param mode: use 'deterministic' or 'stochastic'
    Outputs:
    :param predictions
    """
    # data loader
    test_data_loader = get_loader(root=path_to_homework + '/flickr30k_images/', split=split, vocab=vocab, 
                                  transform=transform, batch_size=8, shuffle=False, num_workers=4)
    
    # run your model on the test set
    print('Run on the test set...')
    preds = run_test(model, test_data_loader, vocab, mode, temperature)
    
    # load the groundtruth
    gt = test_data_loader.dataset.annos
    
    # evaluate the performance using BLEU score
    score1 = 0
    score2 = 0
    score3 = 0
    score4 = 0
    
    print('Computing BLEU')
    for itr in tqdm(range(len(gt))):
        candidate = preds[itr]['caption']
        reference = [sent['raw'] for sent in gt[itr]['sentences']]
        score1 += sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=smoother.method1)
        score2 += sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=smoother.method1)
        score3 += sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=smoother.method1)
        score4 += sentence_bleu(reference, candidate, weights=(0, 0, 0, 1), smoothing_function=smoother.method1)
    
    bleu1 = 100 * score1/len(gt)
    bleu2 = 100 * score2/len(gt)
    bleu3 = 100 * score3/len(gt)
    bleu4 = 100 * score4/len(gt)
    
    return bleu1, bleu2, bleu3, bleu4

# End of code