# Deep Learning Mini-Challenge 2: Image Captioning

**Task description:** The aim of this Mini Challenge is to train an image captioning model. After the training, the model should be able to receive an image and generate a single sentence describing the captured scene. This work is strongly inspired by the paper from Vinyals: *Show and Tell: A Neural Image Caption Generator* (https://arxiv.org/pdf/1411.4555.pdf).

**Description of the dataset:** The Flickr8k data set is used for training the models. It consists of 8091 different images with varying resolutions. The images were collected from six different Flickr groups and were manually selected to include a range of scenes and situations. In addition, 5 captions are included in the dataset for each image which results in a total of 40455 captions. 

In [None]:
import os
import pickle
import wandb
import random
import itertools
import numpy as np
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import Dataset

from sklearn.model_selection import train_test_split
from torchtext.data.utils import get_tokenizer
from torch.utils.data import DataLoader
import torchtext
from tqdm import tqdm

import torchvision
import torchvision.models as models
from torch.nn.utils.rnn import pad_packed_sequence

from nltk.translate.bleu_score import sentence_bleu

import warnings
warnings.filterwarnings('ignore')

print("torch:", torch.__version__)
print("torchtext:", torchtext.__version__)
print("torchvision:", torchvision.__version__)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

## Import data

In [None]:
def read_labels(label_path, skip_header=True):
    '''
    Reads the labels and caption text from the captions.txt file in the specified path
    '''
    with open(label_path + "captions.txt") as f:
        if skip_header:
            next(f)
        lines = f.readlines()
        lines = [line.replace("\n", "") for line in lines]
        lines = [line.split(".jpg,") for line in lines]
        filenames = [line[0] + ".jpg" for line in lines]
        text = [line[1] for line in lines]
        return(pd.DataFrame([filenames, text], index=(["filename", "text"])).T)


image_path = "../../data/Flickr8k/Images/"
label_path = "../../data/Flickr8k/"


df_caption = read_labels(label_path)
df_caption.head()

## Explorative data analysis

### Visualization of Images with their corresponding captions.

In [None]:
def show_sample_imeages(df, n=3, m=2):
    '''
    Visualises a number of images with the corresponding captions
    '''
    fig, axes = plt.subplots(n, m, figsize=(22,14))
    unique_files = df_caption.filename.unique()

    for i in range(n*m):
        filename = unique_files[i]
        caption = "\n".join(list(df.loc[df["filename"]==unique_files[i]]["text"]))
        img = mpimg.imread(image_path + filename)
        axes[i//m, i%m].imshow(img)
        axes[i//m, i%m].set_title(caption)
    plt.subplots_adjust(hspace = 0.8)
    plt.show()


show_sample_imeages(df_caption)

**Description:** We see the first six images from the dataset with their corresponding captions. The images have varying resolutions. The scenes contain people or animals performing a simple action. The captions in the dataset seem relatively clean at first glance. However, there are case-sensitive differences for individual words. In general, some editing will be necessary for the images and the captions, but the effort will probably not be too high.

### Average caption lengths

In [None]:
plt.figure(figsize=(10,5))
sns.ecdfplot(df_caption.text.apply(str.split).apply(len))
plt.title("ecdf of nr of words per caption")
plt.xlabel("nr of words")
plt.ylabel("proportion")
plt.show()

(df_caption.text.apply(str.split).apply(len)).quantile([.5,.6,.7,.8,.9,.95,1])

**Description:** The captions in the data set are a maximum of 38 words long. A word length of 19 would already be sufficient for over 95 percent of the captions.

### Image resolutions

In [None]:
def plot_image_sizes(df):
    '''
    Visualizes the height and width of the images in nr of pixels.
    '''
    # read image sizes
    widths, heights = [], []
    for i in range(len(df)):
        filename = df.iloc[i]
        img = mpimg.imread(image_path + filename)
        width, heigth, chanels = np.shape(img)
        widths.append(width)
        heights.append(heigth)

    #create plot
    plt.figure(figsize=(8,8))
    plt.scatter(widths, heights)
    plt.title("Image sizes")
    plt.xlabel("width of images in pixels")
    plt.ylabel("heights of images in pixels")
    plt.show()

unique_filenames = df_caption["filename"].drop_duplicates()
plot_image_sizes(unique_filenames)

**Descriptions:** The visualisation shows the resolution of the images with their number of pixels in height and width. As already recognised in the visualised samples, the images vary in their resolution. However, a clear maximum boundary of 500 pixels is noticeable in the height and width over all images. For the CNN, it is necessary in our case that all images have the same resolution. For this reason, the images must be processed in a next step.

## Preprocessing 

### Preprocessing Images

This section handles the preprocessing of the images, which includes the following transformations: 
- `ToPILImage` Transformes the input images to a PIL image which provides the python interpreter with editing capabilities using the **P**ython **I**maging **L**ibrary.
- `CenterCrop` Crops the images from the center, resulting in a fixed image resolution.  In our case since the maximal resolution is 500 x 500 pixels, the images with less pixels recieve a padding of zeros to fill the gap.
- `Resize` Images to a resolution of 244 x 244 pixels which is the minimal input size for the Resnet18. Because we center croped all images to the same size before, a distortion of the images is avoided.
- `ToTensor` Trainforms the numpy format to a tensor.
- `Normalize` Normalizes the rgb channels of the dataset samples with the average values of the rgb channels.

In [None]:
from torchvision.transforms import Compose, CenterCrop, Resize, ToTensor, ToPILImage, Normalize

image_transform = Compose([
    ToPILImage(),
    CenterCrop((500, 500)),  # padd images
    Resize((224, 224)), # resnet18 minimal input shape
    ToTensor(),
    Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])

# image_transform = Compose([
#     ToPILImage(),
#     CenterCrop((224, 224)), # resnet18 input shape
#     ToTensor(),
#     Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
#     ])

### Preprocessing Captions

In this section, the captions for the images are preprocessed. The captions are originally provided as strings. In a first step they are processed using the `basic_english` tokenizer included in the torchtext library. It performs several operations such as: lowercasing and replacing certain symbols using a pattern dict. We also limit the maximum number of words per caption to 20, since over 95 percent of all captions are within this range. Sentences with less than 20 words are padded using the `<pad>` token. Finally, we mark the beginning `<bos>` and end `<eos>` with the corresponding tokens, giving all captions a fixed length of 22 tokens.

In [None]:
# define special tokens
start_token = "<bos>"
stop_token = "<eos>"
unknown_token = "<unk>"
padding_token = "<pad>"

# define caption boundaries
max_length  = 20

# specify tokenizer
tokenizer = get_tokenizer('basic_english')

def preprocess_caption(text):
    '''
    Tokenizes the captions and applies preprocessing steps.
    '''
    # tokenize words with torchtext
    tokens = tokenizer(text)
    # cut list length to max_length
    tokens = tokens[:max_length]
    # add start and end token
    tokens = [start_token] + tokens + [stop_token]
    len_tokens = len(tokens)
    #pad to short sentences
    tokens = tokens + [padding_token] * (max_length + 2 - len(tokens))
    return tokens

def add_caption_lengths(text):
    return sum([x != "<pad>" for x in text])

df_caption["text_tokens"] = df_caption["text"].apply(preprocess_caption)

df_caption["caption_length"] = df_caption["text_tokens"].apply(add_caption_lengths) 

### Define Embedding 

Our network does not use the actual words from the captions in the dataset but takes features from an embedding as a vector representation. The aim of embedding is to transform the high dimensionality of the words into a vector space in a meaningful way. This space is significantly smaller in dimensionality, which makes the training more efficient. 
On one hand, it is necessary to define the vocabulary. On the other hand, it is required to create an embedding for the individual words with this vocabulary. There are two ways to do this. Firstly, it is possible to train an own embedding. The implementation is not too complicated, but since the embedding must also be trained, the learning curve of the network slows down significantly and can generally prevent from reaching the full potential of the predictions. The second option would be to use pre-trained word embedding vectors. An example of pre-trained word embedding would be by using [GloVe](https://nlp.stanford.edu/projects/glove/). 


Below we use the torchtext Vocab class to generate a onehot encoded vocabulary from the captions and read out the corresponding vectors from GloVe. In addition, the vocabulary is supplemented with our four keywords \<bos\>, \<eos\>, \<pad\> and \<unk\>. 

In [None]:
from collections import Counter, OrderedDict
from torchtext.vocab import Vocab, GloVe

#define embeding method
vectors = "glove.6B.100d"

# define minimal required occurence of words
min_word_count = 3

# count vocabulary
vocab_count = Counter()
for capiton in df_caption["text_tokens"]:
    vocab_count.update(capiton)
sorted_by_freq_tuples = sorted(vocab_count.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

# define vocabulary
vocab = Vocab(
    vocab_count,
    vectors=vectors,  
    min_freq=min_word_count, 
    specials=((start_token, stop_token, unknown_token, padding_token)))

# comparison between vocabs
glove = GloVe(name='6B', dim=100)
print("GloVe vocab:", glove.vectors.size())

print("Reduced vocab:", vocab.vectors.size())

**Description:** We see that the raw glove embedding contains a vocabulary of 400,000 tokens. Our reduced vocabulary, on the other hand, has 4094 words, barely 1 percent of that.
For these words we now have the corresponding pre-trained embeding `vocab.vectors` which could be passed to the embedding layer in our later model.

#### Encoding of the dataframe

In [None]:
def encode_tokens(text):
    '''
    Encodes the tokens from string to integer using our vocabulary
    '''
    return [vocab.stoi[word] for word in text]

def inverse_encode_tokens(text):
    '''
    Encodes the tokens from integer to string using our vocabulary
    '''
    return [vocab.itos[word] for word in text]

text_list = ["<bos>", "Anton", "is", "in", "this", "picture", ":)", "<eos>"]

print("Input:", text_list )
embedded_text = encode_tokens(text_list)
print("Encoding:", embedded_text)
reconstructed_text = inverse_encode_tokens(embedded_text)
print("Inverse Encoding:", reconstructed_text)

**Description:** We see an example of how the encoding works. The special tokens \<bos\> and \<eos\> are correctly encoded and decoded again. Since "Anton" and ":)" do not occur in our generated vocabulary, they are encoded with the unknown token \<unk\>. So the encoding and decoding works. In a next step, we therefore apply the onehotencoding to the entire data set.

In [None]:
df_caption["text_encoded"] = df_caption["text_tokens"].apply(encode_tokens)
df_caption[["text", "text_tokens", "text_encoded"]].head()

### Train-test split 
For training, we divide the data set into a training set and a test set with a ratio of 4/1. Thereby it is relevant to make sure that all captions of a picture are in the same subset.

In [None]:
train_files, test_files = train_test_split(unique_filenames, test_size=0.2)
df_train = df_caption.loc[ df_caption["filename"].isin( list(train_files) )]
df_test = df_caption.loc[ df_caption["filename"].isin( list(test_files) )]


# save train and test split dataframes as pickles and load if already exists
if os.path.exists("./train.pickle") and os.path.exists("./test.pickle"):
    with open("./train.pickle", 'rb') as f:
        df_train =  pickle.load(f)
    with open("./test.pickle", 'rb') as f:
        df_test =  pickle.load(f)
else:
    with open("./train.pickle", 'wb') as f:
        pickle.dump(df_train, f, protocol=pickle.HIGHEST_PROTOCOL)
    with open("./test.pickle", 'wb') as f:
        pickle.dump(df_test, f, protocol=pickle.HIGHEST_PROTOCOL)


train_img_labels = set(df_train["filename"])
test_img_labels = set(df_test["filename"])
print("Proportion of train set:", len(train_img_labels) / (len(train_img_labels) + len(test_img_labels)))
print("Proportion of test set:", len(test_img_labels) / (len(train_img_labels) + len(test_img_labels)))
print("Overlapping labels of train and test set:", sum([label in train_img_labels for label in test_img_labels]))

### Create train and test set

In [None]:
class Flickr8kDataset(Dataset):
    """
    Creates the dataset structure for training with pytorch
    Args:
        df (pandas DataFrame): contains the filenames and the captions of the pictures
        image_path (str): path to the image folder
        transform (callable, optional): Optional transform to apply on the images
        preload (bool): if true preloads the dataset to the memory
        """
    def __init__(self, df, image_path, transform=None, preload=False):
        
        self.df = df
        self.transform = transform
        self.preload = preload

        if self.preload:
            self.images = []
            for filename in tqdm(np.unique(df['filename'])):
                image = mpimg.imread(image_path + filename)
                if self.transform:
                    image = self.transform(image)
                self.images.append(image)

            self.df['image_idx'] = df.groupby('filename').ngroup()

    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, idx):
        df_row = self.df.iloc[idx, :]

        if self.preload:
            image_id = self.df.iloc[idx]['image_idx']
            image = self.images[image_id]
        else:
            image = mpimg.imread(image_path + df_row['filename'])
            if self.transform:
                image = self.transform(image)

        caption = torch.from_numpy(np.array(df_row['text_encoded']))
        length = torch.from_numpy(np.array(df_row['caption_length']))
        return image, caption, length
        


In [None]:
train_set = Flickr8kDataset(df_train, image_path, transform=image_transform, preload=True)
test_set = Flickr8kDataset(df_test, image_path, transform=image_transform, preload=True)

### Define the dataloader

In [None]:
# Set seed 
torch.manual_seed(42)
batch_size = 64

train_dataloader = DataLoader(
    dataset=train_set, 
    batch_size=batch_size, 
    shuffle=True)

test_dataloader = DataLoader(
    dataset=test_set, 
    batch_size=batch_size, 
    shuffle=False)

In [None]:
example_batch = iter(train_dataloader)
samples, labels, length = example_batch.next()
np.shape(labels)

### Model Structure

The `EncoderCNN` model consists of two main components: The image is fed into a deep Convolutional Neural Network (CNN). This generates a vector representation, which is extracted from the last hidden layer. The resulting vector is then used as feature input of the `DecoderRNN`, which contains a a Long Short Term Memory (LSTM) to generate the sentence structure.

#### CNN
For the CNN, we use the pytorch libaray [Restnet18](https://pytorch.org/hub/pytorch_vision_resnet/) model which has been pre-trained on the ImageNet dataset on a image classification task of 1000 classes. In general, it would also be possible to use any other CNN structure for this task. To be able to use the network for our captioning task, the last hidden layer has to be manually modified and trainied using transfer learning to match the feature vectors on the embedding layer. Therefore the output of the linear layer has to match the dimensions of the embedding layer vectors for the subsequent LSTM network.


#### LSTM
The LSTM is now used to decode the feature vector. While training, it receives as input a `PackedSequence` consisting of the concatenation of the feature vectors and the embedding. This enables the network to recieve inouts of varying lengths. The output of the LSTM is then transformed back into the vocab_size dimension by an additional linear layer. In this way, the linear layer serves as reverse encoding, whereby the output represents the weights for the assignment to the words in our vocabulary.

Due to the additional "packing" of the labels using pytorchs `pack_padded_sequence` method, we gain the advantage that the \<pad\> tokens are not included in the cost function when calculating the crossentropy loss.

During training, the LSTM receives as inputs the tokens that were generated on the original captions. The output of the LSTM is ignored in this stage. If instead the generated output from the last iteration were used as input, the predictions of the LSTM in the later iterations of a sequence would be very strongly dependent on the previously generated output. The network would train on assumptions that are often incorrect, especially at the beginning of the training, and would therefore slow down the training of the network.

When predicting with the trained network, no captions are included, so the highest probability token from the output of the previous iteration is used as input for the next iteration of the LSTM.

#### Combination

Both model classes are integrated into a single `CNNtoRNN` class. This is primarily for structural reasons and allows to call the functionalities of both models in a combined class structure. Additionally, the function for captioning a single image is integrated here.


### Define models

In [None]:
class EncoderCNN(nn.Module):
    '''
    The Encoder class
    Args:
        embed_size (int)
        train_cnn (bool) if true trains the complete network
    '''
    def __init__(self, embed_size, train_cnn=False):
        super(EncoderCNN, self).__init__()
        self.train_cnn = train_cnn
        self.cnn_model = models.resnet18(pretrained=True)
        self.cnn_model.fc = nn.Linear(self.cnn_model.fc.in_features, embed_size) # resize outout shape
        self.relu = nn.ReLU()
        self.bn = nn.BatchNorm1d(embed_size)

    def forward(self, images):
        features = self.cnn_model(images)

        # specify if the complete network should be trained or only the last one
        for name, param in self.cnn_model.named_parameters():
            if "fc.weight" in name or "fc.bias" in name:
                param.requires_grad = True
            else:
                param.requires_grad = self.train_cnn
        return self.bn(self.relu(features))



class DecoderRNN(nn.Module):
    '''
    The decoder class
    Args:
        embed_size (int): size of the embeddings
        hidden_size (int): equal to embed size
        vocab_size (int): equals the number of words in the vocab
        num_layers (int): number of layers in the lstm
        dropout (float): dropout ratio for the embeddings
        pretrained_emb (bool): if true uses the pretrained glove embedding vectors
    '''
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, dropout=0, pretrained_emb=False):
        super(DecoderRNN, self).__init__()
        if pretrained_emb:
            self.embedding = nn.Embedding.from_pretrained(vocab.vectors, freeze=True)
        else:
            self.embedding = nn.Embedding(vocab_size, embed_size) # ????????????????
        
        self.lstm = nn.LSTM(input_size=embed_size, 
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            # batch_first=True
                            )
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(dropout)
        

    def forward(self, features, captions, lengths):
        # print("captions shape", np.shape(captions))
        # embedding of captions
        embeddings = self.dropout(self.embedding(captions))
        # print("Embedded captions shape", np.shape(embeddings))
        # print("features shape", np.shape(features))
        embeddings = torch.cat((features.unsqueeze(1), embeddings), dim=1)
        packed = pack_padded_sequence(embeddings, lengths, enforce_sorted=False, batch_first=True)

        output_packed, hidden = self.lstm(packed) #(1, batch size, len_embedding)
        output_padded, output_lengths = pad_packed_sequence(output_packed, batch_first=True)
        
        outputs = self.linear(output_padded)
        return outputs



class CNNtoRNN(nn.Module):

    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, dropout=0, pretrained_emb=False):
        super(CNNtoRNN, self).__init__()
        self.encoderCNN = EncoderCNN(embed_size=embed_size)
        self.decoderRNN = DecoderRNN(embed_size=embed_size, hidden_size=hidden_size, vocab_size=vocab_size, num_layers=num_layers)

    def forward(self, images, capitons, lengths):
        features = self.encoderCNN(images)
        outputs = self.decoderRNN(features, capitons, lengths)
        return outputs

    def caption_images(self, image, vocabulary, max_length = 30):
        '''Creates a caption for a single image'''
        caption_result = []
        with torch.no_grad():
            x = self.encoderCNN(image).unsqueeze(0)
            states = None
       
            for _ in range(max_length):
                hidden, states = self.decoderRNN.lstm(x, states)
                output = self.decoderRNN.linear(hidden.squeeze(0))
                predicted = output.argmax(1)

                caption_result.append(predicted.item())
                x = self.decoderRNN.embedding(predicted).unsqueeze(0)

                if vocabulary.itos[predicted.item()] == "<eos>":
                    break
            return "".join([vocabulary.itos[idx] + " " for idx in caption_result])



def caption_image(model, path=image_path, filename=None, transform=image_transform):
    '''
    Creates a caption for a single image
    Args:
        model (CNNtoRNN): model class
        path (str): path to image folder
        filename (str): name of image, if None a random image is selected
        transform (torchvision.transforms.Compose): the transformations of the image before feeding into the model
    '''
    model.eval()
    if not filename:
        filename = random.choice(os.listdir(path))
    image = mpimg.imread(path + filename)
    if transform is not None:
        image_tensor = transform(image)
    with torch.no_grad():
        image_tensor = (image_tensor[None, ...]).to(device)
        return filename, image, model.caption_images(image_tensor, vocab)  

def func_save_model(model, path, name):
        '''
        Saves the model as state dict.
        Args:
            model (CNNtoRNN): the model class
            path (str): path where to save the model
            name of the document
        '''
        filename = "{}.pt".format(name)
        print(filename)
        if not os.path.exists(path):
            os.makedirs(path)
        torch.save(model, path + filename)

def func_save_log(log_dict, name):
    '''
    Saves the logging dictionary as pickle object
    Args: 
        log_dict (dict): contains the logged metricts from the training
        name (str): the name of the pickle file
    '''
    with open('{}.pickle'.format(name), 'wb') as f:
        pickle.dump(log_dict, f, protocol=pickle.HIGHEST_PROTOCOL)

def func_open_log(name):
    '''
    Opens the logging dictionary from a pickle object
    Args: 
        name (str): the name of the pickle file
    Returns: 
        log_dict (dict): contains the logged metricts from the training
    '''
    with open('{}.pickle'.format(name), 'rb') as f:
        return pickle.load(f)


In [None]:
import torch.optim as optim
from tqdm import tqdm
from torch.nn.utils.rnn import pack_padded_sequence
import pickle
import wandb

run_cell = False
# wandb
use_wandb=False
# model loading & saving
load_model = True
save_model = True
model_path = "./saved_models/"
model_name = "pretrained_embedding"
# training hyperparameters
pretrained_emb = True
embed_size=100
hidden_size = 100
vocab_size = vocab.vectors.size()[0]
num_layers = 1
learning_rate = 0.003
num_epochs=40
# BLEU evaluation
include_bleu_test_score = True
n_samples = 64 # nr of random samples for BLEU score


if run_cell:
    # define training modules
    if load_model:
        log_dict = func_open_log(model_name)
        step = log_dict[max(int(k) for k in log_dict.keys())]["step"]
        model = torch.load(model_path + "{}_e{}.pt".format(model_name, len(log_dict)), map_location=device)
    else: 
        log_dict = dict()
        step = 0
        model = CNNtoRNN(embed_size, hidden_size, vocab_size, num_layers, pretrained_emb).to(device)

    criterion = nn.CrossEntropyLoss() # ignore_index
    optimizer = optim.Adam(model.parameters(), lr = learning_rate)
    if use_wandb:
        wandb.init(reinit=True, project="del_mc2", entity="simonluder")


    for epoch in range(len(log_dict)+1,len(log_dict)+1+num_epochs):
        print("epoch:", epoch)
        cumloss = 0

        model.train()
        for i, (imgs, captions, lengths) in tqdm(enumerate(train_dataloader)):
            
            step += len(imgs)
            # print(np.shape(imgs), np.shape(captions))
            imgs = imgs.to(device)
            captions = captions.to(device)
            # targets = pack_padded_sequence(captions, length, batch_first=True)[0]

            packed_captions = pack_padded_sequence(captions, lengths, enforce_sorted=False, batch_first=True)[0]
            # print(targets)
            outputs = model(imgs, captions[:,:-1], lengths)
            packed_outputs = pack_padded_sequence(outputs, lengths, enforce_sorted=False, batch_first=True)[0]
            # print("packed_captions", np.shape(packed_captions))
            # print("packed_outputs", np.shape(packed_outputs))
            
            # print(np.shape(captions.reshape(-1)))
            # # outputs = pack_padded_sequence(outputs, lengths.tolist(), enforce_sorted=False, batch_first=True)[0]
            # print(np.shape(outputs.reshape(-1, outputs.shape[2])))
            # print(np.shape(targets))
            # # loss = criterion(outputs, targets)
            # loss = criterion(outputs.reshape(-1, outputs.shape[2]), captions.reshape(-1).type(torch.long))
            loss = criterion(packed_outputs, packed_captions.type(torch.long))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            cumloss += loss.item()

            if i % 10 == 0 and use_wandb:
                wandb.log({
                    "model":model_name,
                    "train_loss": loss,
                    "epoch":epoch+1, 
                    }, step=step)

        model.eval()   
        # calculate BLEU
        if include_bleu_test_score:
            mean_bleu = 0
            for j, file in enumerate(random.sample(list(set(df_test["filename"])), n_samples)):
                (file, _, caption) = caption_image(model)
                train_captions = df_caption.loc[df_caption["filename"]==file]["text"].apply(str.split).to_list()
                mean_bleu += sentence_bleu(train_captions, caption.split()[1:-1])
            mean_bleu /= n_samples

            
        log_dict[epoch] = {
            "step":step, 
            "train_loss":cumloss,
            "test_bleu":mean_bleu
            }

        # save the model at the end of the epoch
        func_save_model(model, model_path, "{}_e{}".format(model_name, epoch))
        func_save_log(log_dict, model_name)


## Evaluation

### Visualizing Loss of the train set

In [None]:
log_dict = func_open_log(model_name)
test_bleu = []
epochs = []
train_losses = []

for i in range(1, len(log_dict)+1):
    test_bleu.append(log_dict[i]["test_bleu"])
    epochs.append(i)
    train_losses.append(log_dict[i]["train_loss"])

plt.figure(figsize=(10,5))
plt.plot(train_losses)
plt.title("Cumulative train loss over epochs")
plt.xlabel("epochs")
plt.ylabel("cumulative loss")
plt.show()

In [None]:
model_path = "./saved_models/"
model_name = "untrained_embedding"
epochs_trained = 250

def eval_caption(model, filename=None):
    (file, image, caption) = caption_image(model, filename=filename)

    caption.split()[1:-1]
    train_captions = df_caption.loc[df_caption["filename"]==file]["text"].apply(str.split).to_list()

    bleu =  sentence_bleu(train_captions, caption.split()[1:-1], weights=(1/2, 1/2))
    return file, bleu

model = torch.load(model_path + "{}_e{}.pt".format(model_name, epochs_trained), map_location=device)
df_bleu_scores = pd.DataFrame(columns=["filename", "bleu_score", "set"])

for f in tqdm(set(df_test["filename"])):
    (file, bleu) = (eval_caption(model, filename=f))
    df_bleu_scores.loc[len(df_bleu_scores.index)] = [file, bleu, "test"]

for f in tqdm(set(df_train["filename"])):
    (file, bleu) = (eval_caption(model, filename=f))
    df_bleu_scores.loc[len(df_bleu_scores.index)] = [file, bleu, "train"]


### Comparison of bleu scores for train and test set

In [None]:
def plot_bleu_distribution(scores, title):
    '''Creates a histogramm of the bleu scores'''
    plt.figure(figsize=(10, 5))
    plt.hist(scores, bins=25)
    plt.title(title)
    plt.xlabel("BLEU score")
    plt.ylabel("nr of samples")
    plt.show()

In [None]:
plot_bleu_distribution(df_bleu_scores.loc[df_bleu_scores["set"]=="train"]["bleu_score"], title = "Distribution of BLEU scores in train set" )
print("BLEU scores quantiles for train set:\n", df_bleu_scores.loc[df_bleu_scores["set"]=="train"]["bleu_score"].quantile([0,.25,.5,.75,1]).round(3))

**Description:**  The histogram shows the distribution of the achieved BLEU score scale for all samples from the training set. We see that the BLEU scores are distributed over the entire value scale with a concentration around 0.33. There are several captions with very good BLEU scores greater than 0.8. Also there are many samples which have achieved a BLEU score of 0. Subsequently, we now want to make a comparison of the BLEU scores for the test set.

In [None]:
plot_bleu_distribution(df_bleu_scores.loc[df_bleu_scores["set"]=="test"]["bleu_score"], title = "Distribution of BLEU scores in test set" )
print("BLEU scores quantiles for test set:\n", df_bleu_scores.loc[df_bleu_scores["set"]=="test"]["bleu_score"].quantile([0,.25,.5,.75,1]).round(3))

**Description:**  The histogram shows the distribution of the achieved BLEU score scale for all samples from the test set. We see that there are significantly more images in the dataset that received a BLEU score of 0. Also, there are almost no images in the value range 0.6 upwards. This might be because of several reasons. On the one hand, a minimal deterioration if the average BLEU Score can be expected due to variance in sentence structures. Also, the test set contains words and objects that were not used in the training. Thus, the model is sometimes not able to recognise these objects at all. 

### Visualizing single examples

Now we will take a closer look at a handful of examples with very good and very bad BLEU Scores from the test set.

In [None]:
df_bleu_scores_test = df_bleu_scores.loc[df_bleu_scores["set"]=="test"]
df_bleu_scores_test = df_bleu_scores_test.sort_values("bleu_score", ascending=False).reset_index(drop=True)

def sample_presentation(model, filename):
    '''Shows an Image with corresponging caption, bleu score and ranking compared to the other samples in the test set'''
    (file, image, caption) = caption_image(model, filename = filename)
    plt.imshow(image)
    plt.show()
    print("Captioning image: {}".format(file))  
    print("Caption sencence:", caption)
    caption.split()[1:-1]
    train_captions = df_caption.loc[df_caption["filename"]==file]["text"].apply(str.split).to_list()
    bleu =  sentence_bleu(train_captions, caption.split()[1:-1], weights=(1/2, 1/2))
    print("BLEU score:",bleu)
    print("Ranking in test set according to BLEU score:", int(df_bleu_scores_test.loc[df_bleu_scores_test["filename"] == file].index.values), "of", len(df_bleu_scores_test))

#### Good captions

In [None]:
files = ["3640020134_367941f5ec.jpg", "3148811252_2fa9490a04.jpg", "1019604187_d087bf9a5f.jpg", "2660008870_b672a4c76a.jpg", "3728015645_b43a60258b.jpg", "2313230479_13f87c6bf3.jpg"]
for file in files:
    sample_presentation(model, file)

**Description:**

#### Bad captions

In [None]:
files = ["3640020134_367941f5ec.jpg", "3148811252_2fa9490a04.jpg", "1019604187_d087bf9a5f.jpg", "2660008870_b672a4c76a.jpg", "3728015645_b43a60258b.jpg", "2313230479_13f87c6bf3.jpg"]
for file in files:
    sample_presentation(model, file)

**Description:**


In [None]:
plt.plot(test_bleu)

## Conclusion

Zusammenfassend bin ich mit dem erreichten Ergebnis zufrieden. Die Qualität der Captions schwankt von recht ordentlich bis komplett falsch. Ich denke das Modell ist relativ gut in der Lage einzelne Objekte zu erkennen, 