<a href="https://colab.research.google.com/github/SohamMashetty/Tl_ML_Task/blob/main/ES23BTECH11036_TL_MLDomainTask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `**Tinkerers Lab ML domain Task**`






# Problem Statement
You are provided with a dataset containing a collection of images and corresponding captions that describe
the images. Your task is to develop a machine learning model that can generate captions for the images
in the dataset. The model should produce human-like descriptions of the images, capturing the key details
and context within each image.

## Importing required libraries
*   The BertTokenizer is used to tokenize the captions. Tokenization converts the text captions into a format that can be processed by the model, specifically into input IDs. BERT's

*   Dataset and DataLoader are used to create and manage the data pipeline. In the model I created a class called FlickrDset that handles the loading and preprocessing of images and captions.

*   DataLoader is used to create iterators for batching and shuffling the data during training and evaluation.

*   torch.nn.functional provides various functions for neural network operations, for example I have used the cross_entropy function inorder to compute the loss between the predicted captions and the ground truth captions.

*   torch.nn is used to build neural network layers and models. nn.Module is the base class for all neural network modules in PyTorch. I have used layers such as nn.Embedding, nn.Linear, and nn.TransformerDecoder in the model.

*   PyTorch is the core library which is generally used for building and training neural networks. It provides tensor operations, automatic differentiation, and other utilities needed for deep learning tasks.

*   feature_extraction is used to create a feature extractor from a CNN (ViT in this case). This allowed me to use the intermediate features from the pre-trained model as inputs to the caption generator.

*   torchvision is used for tasks related to computer vision.

*   matplotlib.pyplot is used for visualizing images and plots. I have used to display images along with the generated captions

*   I used tqdm to create progress bars for loops. This helps in tracking the progress of training and evaluation loops.Since the taining takes approximately 7hrs and evaluation approximately 2 hrs

*   PIL (Python Imaging Library) handles image loading and manipulation.I have used to open images and convert them to the appropriate format for preprocessing.

*   I used os for handling file paths and directory operations. This was needed to load the data from the drive in this particular case.














In [None]:
from transformers import BertTokenizer

from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import torch.nn as nn
import torch

from torchvision.models import feature_extraction
import torchvision

import matplotlib.pyplot as lyb
from tqdm import tqdm
from PIL import Image
import numpy as np
import os

## Setting Random Seeds for Reproducibility
This code block sets the random seeds for NumPy and PyTorch to ensure reproducibility of the results. By fixing the random seed, the same sequence of random numbers is generated each time, which helps in obtaining consistent and comparable results across different runs of the model.

In [None]:
SEED = 123
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

torch.backends.cudnn.deterministic = True

## Setting up hyperparameters, Giving ratios to split data and Setting up directory paths


In [None]:
DATA_DIR = "/content/drive/MyDrive/image_captions"
IMAGES_DIR = os.path.join(DATA_DIR,'Images')

SPLIT_RATIO = {
     'train' : 0.6,
     'val' : 0.2,
     'test' : 0.2,
 }
N_EPOCHS = 10
BATCH_SIZE = 128
EMBED_SIZE = 768
LEARNING_RATE = 1e-4
NHEAD=1
NUM_LAYERS=1
BEST_MODEL_FILEPATH = "best_model.pt"
EARLY_STOPPING_STEP_SIZE = 5



WEIGHTS = torchvision.models.ViT_B_16_Weights.DEFAULT
BASE_MODEL = torchvision.models.vit_b_16
FEAT_LAYER = "getitem_5"
FEAT_DIMS = 768

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
DEVICE

'cpu'

## Loading Caption Data
This cell reads and parses the captions dataset, splitting it into image filenames and corresponding captions.The function load_data reads the file captions.txt . Each line is stripped of extra quotes and split into image filename and caption. The resulting lists of image filenames and captions are then zipped and returned. It ensures the dataset is correctly loaded and the number of images matches the number of captions.

In [None]:
def load_data(fp):
    with open(fp, mode='r') as f:
        data = [row.strip().replace('"', '').split(',', 1) for idx, row in enumerate(f) if idx > 0]
    return zip(*data)

data_filepath = os.path.join(DATA_DIR, 'captions.txt')
images, captions = load_data(data_filepath)
assert len(images) == len(captions)

n_data = len(images)
print(f"There are {n_data} images and captions")

There are 40455 images and captions


## Tokenizing Captions and Determining Maximum Sequence Length
Here I initialized a BERT tokenizer to preprocess captions and convert them into token ID's as they are of a format that the model can process. Here, we are also calculating the maximum length of a tokenized caption as we need it to define the models input size. We use tqdm library here inorder to show a progress bar during its search of all 40k captions.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

max_seq_len = max([len(tokenizer(caption).input_ids) for caption in tqdm(captions)])
print(f"Maximum sequence length in the dataset = {max_seq_len}")

100%|██████████| 40455/40455 [00:13<00:00, 2976.77it/s]

Maximum sequence length in the dataset = 43





## Data splitting and assigning indeices

This below cell splits the dataset indices into training, validation, and test sets based on the specified split ratios.It also checks to ensure no overlaping between different indices inorder to prevent data leakage.

In [None]:
# get the size of train, val and test sets
n_train = int(n_data * SPLIT_RATIO['train'])
n_val = int(n_data * SPLIT_RATIO['val'])
n_test = n_data - (n_train + n_val)

# permute all indices
indices = np.random.permutation(n_data)

# get train, val and test set indices
train_indices = indices[:n_train]
val_indices = indices[n_train:(n_train + n_val)]
test_indices = indices[(n_train + n_val):]

assert len(np.intersect1d(train_indices, val_indices)) == 0
assert len(np.intersect1d(train_indices, test_indices)) == 0
assert len(np.intersect1d(val_indices, test_indices)) == 0

## Splitting Data based on Obtained Indices
Here, the get_split function takes images,captions and index as input and returns tuples of images and corresponding captions for each set

In [None]:
def get_split(images, captions, indices):
    return zip(*[(images[idx], captions[idx]) for idx in indices])

train_images, train_captions = get_split(images, captions, train_indices)
val_images, val_captions = get_split(images, captions, val_indices)
test_images, test_captions = get_split(images, captions, test_indices)

print(f"There are {n_train} train data, {n_val} validation data and {n_test} test data.")

There are 24273 train data, 8091 validation data and 8091 test data.


## Initializing feature extractor
The following code initializes a feature extractor using a pre-trained Vision Transformer (ViT) model. It creates the feature extractor with specific weights and designates a particular layer for feature extraction. By freezing the model parameters, the code ensures these parameters remain unchanged during training, focusing updates only on subsequent parts of the captioning model.

In [None]:
def initialize_feature_extractor(base_model, weights, feat_layer, device):
    # initialize model
    feature_extractor = torchvision.models.feature_extraction.create_feature_extractor(
        BASE_MODEL(weights=WEIGHTS),
        [FEAT_LAYER],
    ).to(device)
    # freeze params
    for param in feature_extractor.parameters():
        param.requires_grad = False
    # set model to evaluation mode
    feature_extractor = feature_extractor.eval()

    # initialize image transformations
    transforms = WEIGHTS.transforms()
    return feature_extractor, transforms

feature_extractor, transforms = initialize_feature_extractor(BASE_MODEL, WEIGHTS, FEAT_LAYER, device=DEVICE)

Downloading: "https://download.pytorch.org/models/vit_b_16-c867db91.pth" to /root/.cache/torch/hub/checkpoints/vit_b_16-c867db91.pth
100%|██████████| 330M/330M [00:06<00:00, 53.0MB/s]


## Padding Dataset for Image Captioning
The FlickrDset class written below images, captions, transforms, and a tokenizer as inputs. The __getitem__ method loads and transforms an image, tokenizes the corresponding caption into input (y0) and target (y1) sequences for the model. The my_pad_sequence function ensures that the sequences in a batch are padded to the same length, allowing efficient batch processing. Specifically, my_pad_sequence stacks images into a tensor and pads caption sequences (y0 and y1) to have uniform lengths using PyTorch pad_sequence function, preparing data for model training or evaluation.

In [None]:
class FlickrDset(Dataset):
    def __init__(self, images, captions, transforms, tokenizer):
        super().__init__()
        self.images = images
        self.captions = captions
        self.transforms = transforms
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.images)

    def transform_image(self, img):
        return self.transforms(img)

    def load_image(self, fp):
        img = Image.open(os.path.join(IMAGES_DIR, fp)).convert("RGB")
        img_transformed = self.transform_image(img)
        return img_transformed

    def __getitem__(self, idx):
        x = self.load_image(self.images[idx])
        y0 = torch.tensor(self.tokenizer(self.captions[idx]).input_ids, dtype=torch.long)
        y1 =torch.tensor(self.tokenizer(self.captions[idx]).input_ids[1:] + [0], dtype=torch.long)
        return x, y0, y1

def my_pad_sequence(data):
    x, y0, y1 = zip(*data)
    x = torch.stack(x)
    y0 = torch.nn.utils.rnn.pad_sequence(y0, batch_first=True)
    y1 = torch.nn.utils.rnn.pad_sequence(y1, batch_first=True)
    return (x, y0, y1)

## Initializing Datasets and Dataloaders
This code block initializes datasets and data loaders for training, validation, and testing. It creates a FlickrDset class for each dataset split, then wraps these in DataLoader objects. The train_iterator shuffles and drops the last batch for efficient training, while val_iterator and test_iterator do not shuffle and include all batches. The collate_fn=my_pad_sequence argument ensures sequences are padded to the same length within each batch. An additional inference_iterator is set up for single-image inference

In [None]:
# initialize train, val and test datasets
train_dset = FlickrDset(train_images, train_captions, transforms, tokenizer)
val_dset = FlickrDset(val_images, val_captions, transforms, tokenizer)
test_dset = FlickrDset(test_images, test_captions, transforms, tokenizer)

# initialize train, val and test iterators
train_iterator = DataLoader(train_dset, BATCH_SIZE, shuffle=True, drop_last=True, collate_fn=my_pad_sequence, pin_memory=True)
val_iterator = DataLoader(val_dset, BATCH_SIZE, shuffle=False, drop_last=False, collate_fn=my_pad_sequence, pin_memory=True)
test_iterator = DataLoader(test_dset, BATCH_SIZE, shuffle=False, drop_last=False, collate_fn=my_pad_sequence, pin_memory=True)

# initialize test iterator for inference (batch_size=1)
inference_iterator = DataLoader(test_dset, 1, shuffle=False, drop_last=False, pin_memory=True)

## Transformer-based Caption Generation
The Transformer decoder processes the embedded input tokens, attending to image features. If is_causal is True, it employs a causal mask to prevent attending to future tokens during training. Finally, a linear layer projects decoder outputs to vocabulary space.

In [None]:
class CaptionGenerator(nn.Module):
    def __init__(self, vocab_size, embed_size, max_seq_len, nhead, num_layers):
        super().__init__()
        self.word_embed_lookup = nn.Embedding(vocab_size, embed_size)
        self.pos_embed_lookup = nn.Embedding(max_seq_len, embed_size)

        decoder_layer = torch.nn.TransformerDecoderLayer(embed_size, nhead, batch_first=True)
        self.decoder = torch.nn.TransformerDecoder(decoder_layer, num_layers)

        self.fc = nn.Linear(embed_size, vocab_size)

    def forward(self, x, image_embed, is_causal):

        seq_len = x.shape[1]

        word_embed = self.word_embed_lookup(x)
        pos_embed = self.pos_embed_lookup(torch.arange(seq_len).to(x.device))
        x_embed = word_embed + pos_embed

        if is_causal:
            tgt_mask = nn.Transformer.generate_square_subsequent_mask(seq_len).to(x.device)
        else: tgt_mask=None

        x = self.decoder(
            tgt=x_embed,
            memory=image_embed,
            tgt_mask=tgt_mask
        )
        x = self.fc(x)
        return x

## Using loss as Evaluation Metric
This function computes the cross-entropy loss given model logits and ground truth labels. It first reshapes the logits tensor to a 2D shape (batch size * sequence length, number of classes) and flattens the ground truth labels. Then, it computes the cross-entropy loss between the reshaped logits and flattened labels, ignoring padding tokens with index 0. Finally, it returns the computed loss.

In [None]:
def compute_loss(logits, y):
    B, S, C = logits.shape
    loss = F.cross_entropy(logits.view(B*S, C), y.view(-1), ignore_index=0)
    return loss

## Using loss for fine tuning of hyperparameters
This function batch_loop takes the feature extractor, input data (images and captions), and a boolean flag as input indicating whether to use a causal mask during decoding. It performs the following steps:

*   Passes the input images through the feature extractor to obtain image embeddings.

*   Feeds the image embeddings and input captions to the model to generate logits (raw predictions) for the next tokens in the captions.

*   Calculates the loss using the computed logits and the ground truth next tokens (targets) using the compute_loss function.

This function essentially makes a single forward pass through the model and computes the corresponding loss, which is then used for backpropagation and parameter updates during training.








In [None]:
def batch_loop(model, feature_extractor, x, y0, y1, is_causal):
    x, y0, y1 = x.to(DEVICE), y0.to(DEVICE), y1.to(DEVICE)
    x_embed = feature_extractor(x)[FEAT_LAYER].unsqueeze(1)
    logits = model(y0, x_embed, is_causal=is_causal)
    loss = compute_loss(logits, y1)
    return loss

## Training Epoch Function
This function performs one epoch of training for the caption generation model. It iterates over the training data batches, computing the loss for each batch using the batch_loop function. The model parameters are updated via backpropagation using the optimizer. It accumulates the total loss across batches and calculates the average loss per batch. The model is set to training mode, while the feature extractor is in evaluation mode to prevent its parameters from being updated. Finally, it returns the average loss for the epoch, which is used for monitoring training progress.

In [None]:
def train_epoch(model, feature_extractor, iterator, optimizer):
    model.train()
    feature_extractor.eval()
    loss_sum = 0.0
    for iter_idx, (x, y0, y1) in tqdm(enumerate(iterator), total=len(iterator), desc="Training Progress"):
        # compute batch loss
        loss = batch_loop(model, feature_extractor, x, y0, y1, is_causal=True)
        # update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # add loss to total sum
        loss_sum += loss.item()

    # compute average epoch loss
    n_batches = len(iterator)
    loss_avg = loss_sum / n_batches
    return loss_avg

## Evaluation Function
This function evaluates the performance of the caption generation model on a validation or test dataset. It iterates over the data batches, computing the loss for each batch using the batch_loop function. This function provides insight into how well the model generalizes to unseen data and is useful for model selection and hyperparameter tuning.

In [None]:
def evaluate(model, feature_extractor, iterator):
    model.eval()
    feature_extractor.eval()
    loss_sum = 0.0
    with torch.inference_mode():
        # iterate over batches
        for iter_idx, (x, y0, y1) in tqdm(enumerate(iterator), total=len(iterator), desc="Evaluation Progress"):
            loss = batch_loop(model, feature_extractor, x, y0, y1, is_causal=True)
            # add loss to total sum
            loss_sum += loss.item()

    # compute average epoch loss
    n_batches = len(iterator)
    loss_avg = loss_sum / n_batches
    return loss_avg

## Text Generation Function
This function generates captions for input images.It generates tokens for the caption, conditioning on the image features and previously generated tokens. At each step, it samples the next token from the probability distribution over the vocabulary generated by the model. The process continues until either the maximum length is reached or the special [SEP] token is generated, indicating the end of the caption.

In [None]:
def generate(model, feature_extractor, x, tokenizer, temp, max_length, device):
    model.eval()
    feature_extractor.eval()
    with torch.inference_mode():
        x_embed = feature_extractor(x)[FEAT_LAYER].unsqueeze(1)
        input_tokens = torch.tensor([tokenizer.convert_tokens_to_ids(['[CLS]'])], dtype=torch.long).to(device)

        generated_token_ids = []

        for _ in range(max_length):
            logits = model(input_tokens, x_embed, is_causal=False)[:,-1,:]
            next_token = torch.multinomial(logits.view(1, -1).div(temp).exp(), num_samples=1)
            input_tokens = torch.cat((input_tokens, next_token), dim=1)
            if tokenizer.decode([next_token.item()]) == '[SEP]':
                break

            generated_token_ids.append(next_token.item())

        return generated_token_ids

## Training, Evaluation and Overfitting prevention of Model
The caption generation model is initialized with specified hyperparameters such as vocabulary size, embedding size, maximum sequence length, number of attention heads, and number of layers. An AdamW optimizer is also initialized to optimize the model parameters.

 The model is trained for a specified number of epochs (N_EPOCHS). During each epoch, the model is trained on the training dataset using the train_epoch function. After training, the model performance is evaluated on the validation dataset using the evaluate function to compute the validation loss (val_loss).

**Best Model Selection:** If the validation loss decreases, indicating an improvement in performance, the current model parameters are saved as the best model. The best model's parameters are saved to the file specified by BEST_MODEL_FILEPATH. Additionally, the model's performance is evaluated on the test dataset, and the generated captions for a few test images are displayed.

**Early Stopping:** If the validation loss does not improve for a certain number of epochs (EARLY_STOPPING_STEP_SIZE), the training process is terminated early to prevent overfitting.

In [None]:
# initialize model
model = CaptionGenerator(
    vocab_size=tokenizer.vocab_size,
    embed_size=EMBED_SIZE,
    max_seq_len=max_seq_len,
    nhead=NHEAD,
    num_layers=NUM_LAYERS,
).to(DEVICE)

# initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
# initialize best val loss and best model streak count
best_val_loss = float('inf')
best_streak_count = 0

# iterate over epochs
for epoch_idx in range(1, N_EPOCHS+1):
    # train
    train_loss = train_epoch(model, feature_extractor, train_iterator, optimizer)
    # evaluate
    val_loss = evaluate(model, feature_extractor, val_iterator)
    test_loss = evaluate(model, feature_extractor, test_iterator)

    # print losses
    print(f"Epoch {epoch_idx:02} - Train loss = {train_loss:.4f}\tVal loss = {val_loss:.4f}\tTest loss = {test_loss:.4f}")

    if val_loss < best_val_loss:
        # save the curent model's parameters as the best model parameters
        torch.save(model.state_dict(), BEST_MODEL_FILEPATH)
        # replace the best test loss with the current best loss
        best_val_loss = val_loss
        # reset early stoppping counter
        best_streak_count = 0
        # display info
        print(f'The best model is found and saved. Current best val loss = {best_val_loss:.3f}\n')

        # ------------------------------------------------------
        ct = 0

        for x, y0, y1 in inference_iterator:
            x = x.to(DEVICE)
            # generate caption
            generated_token_ids = generate(model, feature_extractor, x, tokenizer, temp=1.0, max_length=max_seq_len, device=DEVICE)
            caption_generated = tokenizer.decode(generated_token_ids)
            # get groundtruth caption
            caption_gt = tokenizer.decode(y0[0,1:-1].cpu().numpy().tolist())
            # show image
            img = Image.open(os.path.join(IMAGES_DIR, test_images[ct])).convert("RGB")
            plt.imshow(img)
            plt.show()
            # print generated and groundtruth captions
            print("True:\n", caption_gt)
            print("Generated:\n", caption_generated)
            ct += 1
            if ct > 5: break
        # ------------------------------------------------------
    else:
        # update early stoppping counter
        best_streak_count += 1

    # check early stopping condition
    if best_streak_count == EARLY_STOPPING_STEP_SIZE:
        print(f"A better model has not been found in the last {EARLY_STOPPING_STEP_SIZE} epochs. Early stopping...")
        break

    print("--------------\n")

Training Progress: 100%|██████████| 189/189 [6:06:59<00:00, 116.50s/it]
Evaluation Progress:  22%|██▏       | 14/64 [22:10<1:21:59, 98.38s/it]

# **Summary Report**
## **Model Overview**
The problem statement stated to develop a caption generation model. In the model, I used a combination of a Vision Transformer (ViT) as the feature extractor and a Transformer-based decoder for generating captions. The model uses a  pre-trained ViT to extract image features, which are then sent to a Transformer decoder that generates captions token-by-token.

## **Model Performance**
The model was trained on a dataset of images and their corresponding captions. Performance metrics include the average training loss, validation loss, and test loss over multiple epochs(runs).Evaluation was performed by generating captions for a subset of test images and comparing them with the ground truth captions(The actual caption).

## **Challenges Faced**
1.   **Overfitting:** When I initially ran the model it generated huge loss when compared with the truth caption of Evatuation data. This led me to use the early stopping technique

2.   **Data Handling:** Uploading the Dataset was itself a huge issue and took a lot of time. I initially wanted to put a repositry link but when ever i pushed the files, it truncated the number of images to 100 automatically. This forced me to use drive.

3.   **Variation in caption Length:** When issuing tokens to captions, the variation in length was initially a hassle, this forced me to learn and use padding techniques inorder to make the string length same.

4.   **Training Time:** While I used ViT over a CNN for better results in the model. This resulted in huge training time. So, everytime i wanted to check my model, it took a lot of time to run.

## **Methods used to improve the model**


1.   **Early Stopping:** Early stopping was implemented to halt training when the validation loss did not improve for a specified number of epochs. This prevented the model from overfitting to the training data.

2.   **Freezing Feature Extractor:** The ViT feature extractor's parameters were frozen during training to use its pre-trained weights and simplifies the model by reducing the number of trainable parameters, making it easier to optimize and less prone to overfitting

3. **Batch Padding:** Custom padding for batch sequences made sure that sequences were padded to the same length within each batch.This reduces th etime taken for training by the model

4. **AdamW Optimizer:** I used AdamW optimizer with a learning rate scheduler to train the model faster and better. It helped adjust how quickly the model learns from data, making sure it converges to the best solution efficiently. This way, my model learns more effectively and improves its performance over time.

## **References**

### **Vision Transformers(ViT):**

*   Dosovitskiy, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.

*   Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (pp. 10347-10357). PMLR.

*   https://github.com/facebookresearch/detectron2

*   https://github.com/pytorch/vision

### **Image Captioning:**

*    Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).

*    Xu, K., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (pp. 2048-2057).

*    Anderson, P., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6077-6086).

*    https://github.com/tensorflow/models/tree/master/research/im2txt

*    https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

### **Transformers in NLP:**

*    Vaswani, A., et al. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).

*    Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.

*    https://github.com/huggingface/transformers

*    https://github.com/allenai/allennlp

*    

### **Combining Vision and Language Models:**

*    Li, L. H., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (pp. 121-137).

*    Kim, J. H., et al. (2018). Bilinear attention networks. In Advances in Neural Information Processing Systems (pp. 1564-1574).

*    Chen, Y. C., et al. (2020). UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision (pp. 104-120).

*    https://github.com/jiasenlu/vilbert_beta

### **Early Stopping**

*    rechelt, L. (1998). Early stopping-but when? In Neural Networks: Tricks of the Trade (pp. 55-69).

*    Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

*    https://github.com/Bjarten/early-stopping-pytorch

*    https://github.com/keras-team/keras-tuner

















# **Reflections**
## **Choices Made**

*   **Use of ViT over CNN:**

 The Vision Transformer is particularly useful when large-scale datasets are available for training, as it can capture long-range dependencies and global context more effectively than traditional CNNs. In this model I used ViT for its capability to extract high-level features from images, which are then used for generating captions.

* **Use of Loss as an Evaluation Parameter over BLEU/METOR/ROGUE:**

 In this code, I used loss as an evaluation metric over BLEU due to its suitability. Loss measures the difference between predicted and actual values, this directly reflects the model's ability to minimize errors during training, guiding parameter updates towards an ideal value. Conversely, BLEU (Bilingual Evaluation Understudy) is a metric commonly used in natural language processing for evaluating machine translation tasks. However, it relies on human-generated reference texts, making it less suitable particularly in tasks like image captioning where generating contextually relevant captions is essential.

* **Utilizing Pre-Trained embeddings over training embeddings from scratch:**

 Utilizing pretrained word embeddings (such as Word2Vec or GloVe) instead of training embeddings from scratch. Pretrained embeddings capture semantic relationships between words and help in improving the model's performance, especially when training data is limited.


 ## **Potential Improvements**

* **Fine-tuning the Feature Extractor:** Instead of freezing the
  parameters of the ViT feature extractor, fine-tuning them along with the caption generator could allow the model to adapt better to the specific task and potentially improve performance.

* **Data Augmentation:** Applying data augmentation techniques to
  the image data, such as random cropping, rotation, or flipping, could increase the diversity of training examples and help the model generalize better to unseen data.

* **Ensemble Learning:** Training multiple caption generator models with different architectures or initializations and combining their predictions through ensemble learning techniques could lead to improved generalization and performance.

* **Attention Mechanism Variants:** Exploring different variants
  of the transformer attention mechanism, such as self-attention with relative positional encoding or multi-head attention, could potentially capture more complex relationships in the data and enhance model performance.