In [None]:
import os
from pathlib import Path

import json
import math

import pandas as pd

import torch
from torch.utils.data.dataset import Dataset

from tokenizers import ByteLevelBPETokenizer

import transformers

from transformers import RobertaConfig
from transformers import RobertaForMaskedLM # RobertaLM for learning
from transformers import RobertaTokenizerFast # After training tokenizer we will wrap it so it can be used by Roberta model

from transformers import DataCollatorForLanguageModeling
from transformers import pipeline

from transformers import Trainer, TrainingArguments

In [None]:
IMAGES_DIRECTORY = "./data/Images"
TOKENIZER_DIRECTORY = "./models/Byte_tokenizer"
ROBERTA_DIRECTORY = "./models/RobertaDecoder"

TRAIN_BATCH_SIZE = 64   # input batch size for training (default: 64)
VALID_BATCH_SIZE = 64   # input batch size for testing (default: 1000)
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
MAX_LEN = 128           # Max length for caption
VOCAB_SIZE = 10000

TRAIN_EPOCHS = 10       # number of epochs to train (default: 10)
WEIGHT_DECAY = 0.01

# Training Decoder

## Abstract

In this notebook we will train a decoder for our very own image captioning model. Also, we will cover the most important concepts required to understand how is our image captioning model working.

## Concepts

### Image Captioning

Image captioning is as an End-to-End Sequence to Sequence embedding task where image pixels are input sequences and caption describing the image is desired output.

Due to exclusive nature of both images and text sequences two different model tied together (one dedicated to Encode from images and other Decode to a text Sequence) are required to solve this task.

This idea is synonymous to traditional Encoders & Decoders used to resolve processing and propagation of electronic signals.

### Encoder-Decoder

Encoder-decoder is a type of neural network architecture that is commonly used for image captioning. The basic idea behind this architecture is to first "encode" an image into a fixed length representation, which captures the main features and characteristics of the image. This encoded representation is then passed through a "decoder" network, which generates a caption for the image.

The encoder network typically uses convolutional neural networks (CNNs) to extract features from the image. The decoder network typically uses recurrent neural networks (RNNs) to generate the caption, as it needs to consider the context of the previous words in the sequence when generating the next word in the caption.

In simple terms, the encoder-decoder network takes an image as input, transforms it into a compact representation using the encoder, and then generates a descriptive caption using the decoder.

#### 1-Encoder

The encoder in image captioning is a deep neural network that takes an image as input and generates a dense fixed-length representation, or encoding, of the image. The purpose of the encoder is to extract and compress meaningful information from the image and represent it in a compact form that can be used by the decoder to generate a caption. The encoder typically consists of several convolutional and pooling layers that are designed to capture the hierarchical structure and spatial relationships of the objects and features in the image. The output of the encoder is usually a feature map or a set of feature vectors that encapsulate the important information about the image.

Here we will use Vision Transformer (ViT) as encoder.

#### 2-Decoder

The decoder for image captioning is responsible for generating text captions based on the encoded information from the image. It typically uses an attention mechanism that allows the decoder to focus on different parts of the encoded image at different times while generating the caption. The decoder processes the encoded information and produces a sequence of words that describe the content of the image. The decoder uses the encoded information as a context vector, and at each step of the caption generation process, it predicts the next word in the caption based on this context vector and the previously generated words. The decoder generates the final caption word by word until it produces a complete sentence that describes the image. In our model the decoder will be Roberta.

### Attention

Attention is a mechanism in deep learning models that allows the model to focus on specific parts of the input, rather than processing all of it equally. It works by assigning importance weights to different elements of the input, which are then used to compute the final representation of the input.

The attention mechanism can be represented mathematically as a dot product operation between a query vector and a set of key-value pairs, followed by a softmax function. The formula is as follows:

$$Attention(Q, K, V) = \frac{QK^T}{\sqrt{d_k}}$$
$$= Softmax(\frac{QK^T}{\sqrt{d_k}}) \cdot V$$

where:

* $Q$: The query matrix is used to represent the current state or context in which the model is trying to make a prediction. It is often a learned representation of the input.

* $K$: The key matrix represents the elements in the input that the model wants to attend to. The dot product between the query and key matrices determines the attention weights for each key.

* $V$: The value matrix represents the information that is associated with each key. The attention weights are used to compute a weighted sum of the values, which is used as the output of the attention mechanism.

* $d_k$: The dimension of the keys determines the size of the dot product between the query and key matrices. It is used as a scaling factor in the attention mechanism to prevent the dot product from becoming too large.

* $\cdot$: The dot product is used to compute the similarity between the query and key matrices. The result of the dot product is used to compute the attention weights.

* $K^T$: The transpose of the key matrix is used in the dot product operation. Taking the transpose allows the dot product to be computed between the columns of the query matrix and the rows of the key matrix.

* $\frac{QK^T}{\sqrt{d_k}}$: This expression computes the dot product between the query and key matrices, scaled by the square root of the dimension of the keys. The result of this operation is used as the input to the softmax function to compute the attention weights.

The attention mechanism allows the model to focus on the most relevant parts of the input, which helps improve its performance on various tasks.

#### Cross-Attention

Cross-attention refers to a type of attention mechanism where the attention scores are computed between the queries and keys from two different inputs.

For example, in a machine translation task, the queries may come from the source language and the keys may come from the target language. The attention scores computed between the two inputs determine how much attention the model should pay to each word in the target language when predicting the words in the source language. This allows the model to focus on the most relevant information from both inputs when making its predictions.

In this way, cross-attention allows the model to capture complex relationships between the inputs, leading to more accurate predictions and improved performance on various tasks.

#### Self-Attention

Self-attention is a type of attention mechanism where the queries, keys, and values all come from the same input. The model computes attention scores between the elements of the input to determine how much importance to assign to each element when making a prediction.

For example, in a language modeling task, the input might be a sequence of words and the self-attention mechanism might compute attention scores between each pair of words in the sequence. The attention scores can then be used to determine the most relevant words to consider when making a prediction for the next word in the sequence.

By computing attention scores between elements of the same input, self-attention allows the model to capture dependencies and relationships between elements of the input in a direct and intuitive way, leading to improved performance on various tasks. The main difference between Self- and Cross-Attention is the source of the queries, keys, and values.

### RoBERTa

RoBERTa (Robustly Optimized BERT Approach) is a state-of-the-art language model developed by Facebook AI Research. It is based on the popular BERT (Bidirectional Encoder Representations from Transformers) architecture and uses transformer networks to process and generate text data.

In simple terms, RoBERTa is a machine learning model that has been trained on a large corpus of text data to understand the patterns and relationships between words and sentences in a language. It can then be used for various NLP tasks such as text classification, question answering, and text generation.

RoBERTa's training procedure is optimized to better handle the challenges of NLP, such as handling out-of-vocabulary words, and its large size allows it to effectively capture the fine-grained details of text data. This makes RoBERTa one of the most powerful language models available and it is widely used in NLP research and industry applications.

## Preparing the Captions

Firstly, we will load our captions from json file.

In [None]:
with open('./data/captions.json', 'r') as openfile:
    caption_dict = json.load(openfile)

Next, we will get the paths of the images:

In [None]:
for image_path in list(caption_dict.keys()):
    if image_path.endswith('jpg'):
        image_path = image_path.replace("ImagesImages", "")
        new = IMAGES_DIRECTORY + image_path.split('/')[-1]
        caption_dict[new] = caption_dict.pop(image_path)
    else:
        caption_dict.pop(image_path)

After that we will get our dataframe with images and captions.

In [None]:
df = pd.DataFrame([])

captions = []
images = []
for image in list(caption_dict.keys()):
    captions_for_image = caption_dict[image]
    for caption in captions_for_image:
        captions.append(caption.replace('<s> ','').replace('  <e>','').strip())
        images.append(image)
        
df['images'] = images
df['captions'] = captions

Here is how our DataFrame with paths and captions look like.

In [None]:
df.head()

### Training the Decoder Model for Language Understanding and build Vocabulary

In this section we will train our RoBERTa decoder and create a tokenizer.

### Tokenizer

Here we will train a byte-level tokenizer.

Byte-level tokenization is a method of splitting a string of text into individual tokens, where each token is defined as a sequence of contiguous bytes. In this method, the tokenization process does not consider the structure or meaning of the text, but rather focuses solely on the individual bytes and their sequences.

This type of tokenization can be useful in certain applications, such as processing text in non-Latin scripts, where words are not separated by spaces and traditional word-level tokenization methods may not work well. By using byte-level tokenization, the model can still work with the text, even if it doesn't have a complete understanding of the language.

In summary, byte-level tokenization is a simple and flexible method of splitting text into individual tokens, where each token is defined as a sequence of bytes, regardless of the structure or meaning of the text.

#### Converting captions in to .txt file for training of the tokenizer

The Byte-Level BPE Tokenizer requires each text to be in a separate file because it is designed to work with text data that has already been preprocessed into individual files. This is a common format for text data, as it allows for easy processing and management of the data. Additionally, having each text in a separate file makes it easier to work with the data in a parallelizable way, which can be important when dealing with large amounts of text data.

Firstly, we will create a directory for the txt files, if it does not exist:

In [None]:
if not os.path.exists("./data/text_split"):
    os.mkdir("./data/text_split")

The, we will define function that writes caption to each file:

In [None]:
def captions_to_files(column, prefix, txt_files_dir = "./data/text_split"):
    # The prefix is a unique ID to avoid to overwrite a text file
    i=prefix
    #For every value in the df, with just one column
    for row in column.to_list():
      # Create the filename using the prefix ID
        file_name = os.path.join(txt_files_dir, str(i)+'.txt')
        try:
            # Create the file and write the column text to it
            f = open(file_name, 'wb')
            f.write(row.encode('utf-8'))
            f.close()
        except Exception as e:  #catch exceptions(for eg. empty rows)
            print(row, e) 
        i+=1
    # Return the last ID
    return i

The, we will write the captions to files:

In [None]:
captions_column = df["captions"]
captions_column = captions_column.replace("\n"," ")

prefix=0

# the function returns the last prefix, so we can be sure that there are no overwritten files
prefix = captions_to_files(captions_column, prefix)

In [None]:
assert prefix == len(captions_column)

#### Training tokenizer

Now, we can create and train our tokenizer.

Firstly, we will get all the paths of the files we created above:

In [None]:
text_splits_paths = [str(x) for x in Path(".").glob("./data/text_split/*.txt")]

Then, we will initialize our tokenizer with parameter `lowercase` set to `True`.

In [None]:
tokenizer = ByteLevelBPETokenizer(lowercase=True)

Now, we will train our tokenizer (this may take a few minutes), but first we will see what are those parameters we pass to the `train` function.

* 1. vocab_size - This parameter determines the maximum number of words the tokenizer will know.

* 2. min_frequency - This parameter determines the minimum frequency of the subwords in vocabulary. This parameter is useful because if there are subwords that are seen only once (referred to as "out-of-vocabulary" or "OOV" words) are not helpful because they will not be recognized by the model's vocabulary.  When a model encounters an OOV word, it must either ignore it or map it to a generic token, such as an "unknown" token, which may not accurately capture the meaning of the original word. In contrast, sub-word units that are seen multiple times in the training data are more likely to be meaningful and to capture important word structures in the text, making them more useful for encoding text data.

* 3. special_tokens - They are specific string tokens that have a special meaning or function in a language model. They are used to perform tasks such as separating text into sentences or marking the beginning and end of a sequence. These special tokens play a crucial role in allowing NLP models to effectively process and generate text data. Some common special tokens include:

    * [SEP] (separation token): Used to separate different text sequences within a single input, such as when encoding multiple sentences in a single input.
    * [PAD] (padding token): Used to pad sequences to a fixed length during training and inference.
    * [MASK] (masking token): Used to hide a word for language modeling tasks such as mask language modeling.
    * [UNK] (unknown token): Used to represent out-of-vocabulary words that are not present in the model's vocabulary.

In [None]:
%%time 

tokenizer.train(files=text_splits_paths, vocab_size=VOCAB_SIZE, min_frequency=2,
                special_tokens=[
                                "<s>",
                                "<pad>",
                                "<e>",
                                "<unk>",
                                "<mask>",
])

After we trained the tokenizer, we can see what it vocabulary consists of:

In [None]:
list(tokenizer.vocab.keys())[0:10]

In the vocabulary sample, the sub-word units have been generated such that common word prefixes and suffixes are merged into a single sub-word unit. For example, "corridor" and "rummaged" both have the common prefix "Ġ" (which may be represented by a special character), indicating that the prefix "cor" and "rum" have been merged into a single sub-word unit. This can help to reduce the size of the vocabulary and capture common patterns in the text data.

Note that some sub-word units may not have a special character prefix or suffix, such as "sing" and "ink", indicating that these words are represented as individual sub-word units in the vocabulary.

#### Save Tokenizer

Here, we do two simple things:

* 1. Create a directory for the tokenizer and
* 2. Save the tokenizer in the new directory

If you wany, go to the directory of the tokenizer and check what is in.

In [None]:
os.mkdir(TOKENIZER_DIRECTORY)
tokenizer.save_model(TOKENIZER_DIRECTORY)

## Decoder

#### Creating the required objects

Since the model will not tokenize and train like magic by just giving it files like we did with the tokenizer. That is why we need to define a class extending PyTorch Dataset class in which we tokenize our captions.

In [None]:
class CustomDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.examples = []
        
        for example in df.values:
            x=tokenizer.encode_plus(example, max_length = MAX_LEN, truncation=True, padding=True)
            self.examples.append(x.input_ids)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])

# Create the train and evaluation dataset
train_dataset = CustomDataset(captions_column[:38000], tokenizer)
eval_dataset = CustomDataset(captions_column[38000:], tokenizer)

After that we need to define our data collator. Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset . To be able to build batches, data collators may apply some processing (like padding).

The `mlm_probability` parameter defines the probability with which to randomly mask tokens in input.

In [None]:
# Define the Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

#### Intialization & Training

Now, we will create our model. First, we need to define its configuration:

* vocab_size: The same parameter we passed to the tokenizer.
* max_position_embeddings: The maximum sequence length that this model might ever be used with.
* num_attention_heads: Number of attention heads. The larger the number, the more training time is required.
* num_hidden_layers: Number of hidden layers in the Transformer encoder.
* type_vocab_size: The vocabulary size of the `token_type_ids` passed when calling RobertaModel.

In [None]:
config = RobertaConfig(
    vocab_size=VOCAB_SIZE,
    max_position_embeddings=2048,
    num_attention_heads=16,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now, we will define our model and see how many parameters it has. Keep in mind that model with large number of parameters is not working for all GPUs.

In [None]:
model = RobertaForMaskedLM(config=config)

print('Num parameters: ',model.num_parameters())

#### Training the Decoder

Firstly, we need to define the configuration for training.

We have the following parameters:

* evaluation_strategy: This parameter defines how often to evaluate the model. In our case it will evaluate on full pass over dataset.

* learning_rate: The learning rate is used to govern the pace at which an algorithm updates or learns the values of a parameter estimate. In other words, the learning rate regulates the weights of our model concerning the loss gradient.

* weight_decay: Weight Decay, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the Norm of the weights

* per_device_train_batch_size and per_device_eval_batch_size: Here we define the number of samples in a batch.

* save_steps: Number of updates steps before two checkpoint saves.

* save_total_limit: If a value is passed, will limit the total amount of checkpoints.

In [None]:
training_args = TrainingArguments(
    output_dir=ROBERTA_DIRECTORY,
    overwrite_output_dir=True,
    evaluation_strategy = 'epoch',
    num_train_epochs=TRAIN_EPOCHS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    save_steps=8192,
    save_total_limit=1,
)

Now we can initialize our `Trainer` class.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

And we train it.

In [None]:
# Train the model
trainer.train()

### Evaluation and saving

### Check Perplexity score of the model

Perplexity is a measure of how well a language model predicts a sequence of words. It calculates the likelihood of a given sentence and normalizes it by the number of words, to give a score that reflects the model's uncertainty. Lower perplexity values indicate that the model is better at predicting the text, whereas higher perplexity values indicate that the model is less confident in its predictions.

In [None]:
eval_results = trainer.evaluate()

print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

### Saving tokenizer & Model to use

In [None]:
tokenizer.save_pretrained(TOKENIZER_DIRECTORY)

In [None]:
trainer.save_model(ROBERTA_DIRECTORY)

### Testing our decoder on a sample

First, we load the pipeline.

In [None]:
fill_mask = pipeline(
    "fill-mask",
    model= ROBERTA_DIRECTORY,
    tokenizer= TOKENIZER_DIRECTORY
)

Then we use it to predict masked token.

In [None]:
fill_mask("a girl going to a <mask> building")

As you can see, the model gives us pretty good predictions.

## Summary

In this notebook we trained a RoBERTa decoder, learned a bunch of new things (probably) and saw how to use HuggingFace. In the next notebook we will use our model as decoder in our image captioning model and it will be connnected to ViT encoder model using cross attention heads.