# <center><font size = 3><span style="color:#422711"> <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:200%;text-align:center;border-radius:100px 10px;">INTRODUCTION</p>   </span></font></center>
 
<font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">Notebook Overview : </span></font>

* <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> This notebook contains:  </span></font>
    1. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">A Encoder Decoder Model which takes an image as an input and outputs a caption </span></font>
    2. <font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">The Encoder used is <a href = "https://huggingface.co/google/vit-base-patch16-224"><b>Vision Transformer </b></a> </span></font>
    3. <font size =3><span style = "color:#3A3E59;font-family:'Times New Roman'">The Decoder used is <a href = "https://huggingface.co/gpt2"><b>GPT2</b></a></span></font>
    4. <font size =3><span style = "color:#3A3E59;font-family:'Times New Roman'"> The model is trained on <b>Flickr8k dataset</b></span></font>
    5. <font size =3><span style = "color:#3A3E59;font-family:'Times New Roman'"> The hugging face <a href = "https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainer"><b>Seq2SeqTrainer</b></a> is used for finetuning the model</span></font>
   
*  <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The hugging face <b> transformers</b> library is used to finetune the model and <b> Pytorch</b> for data processing </span></font>


<a id='top'></a>
<p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:200%;text-align:center;border-radius:200px 10px;">TABLE OF CONTENTS</p>  

- [1. Imports](#1)
- [2. Hyperparameters](#2)
- [3. Helper Functions](#3)
- [4. Dataset](#4)
  * .[4.1 Feature Extractor and Tokenizer](#4.4)
  * [4.2 Transforms and dataframe](#4.1)
  * [4.3 Dataset Class](#4.2)
  * .[4.4 Train and validation dataset](#4.3)
- [5. Model Building](#5)
    * .[5.1 Model Initialization](#5.2)
- [6. Training](#6)
    * .[6.1 Training Arguments](#6.1)
    * .[6.2 Training using Seq2SeqTrainer](#6.2)
- .[7. Predictions](#7)

In [1]:
from IPython.display import clear_output
!pip install rouge_score -q
!pip install deep-phonemizer -q
clear_output()

<a id="1"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">1. IMPORTS 📂</p>
#### [Top ↑](#top)

In [2]:
import os

import datasets
import numpy as np
import pandas as pd
from PIL import Image
from pathlib import Path
from tqdm.auto import tqdm
import multiprocessing as mp
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import io, transforms
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader, random_split

from transformers import Seq2SeqTrainer ,Seq2SeqTrainingArguments
from transformers import VisionEncoderDecoderModel , ViTFeatureExtractor
from transformers import AutoTokenizer ,  GPT2Config , default_data_collator


if torch.cuda.is_available():    

    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


2024-05-14 14:48:10.141458: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-14 14:48:10.141547: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-14 14:48:10.281984: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


<a id="2"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">2. HYPERPARAMETERS</p>
#### [Top ↑](#top)

In [3]:
os.environ["WANDB_DISABLED"] = "true"
class config : 
    ENCODER = "google/vit-base-patch16-224"
    DECODER = "gpt2"
    TRAIN_BATCH_SIZE = 8
    VAL_BATCH_SIZE = 8
    VAL_EPOCHS = 1
    LR = 5e-5
    SEED = 42
    MAX_LEN = 100
    SUMMARY_LEN = 20
    WEIGHT_DECAY = 0.01
    MEAN = (0.485, 0.456, 0.406)
    STD = (0.229, 0.224, 0.225)
    TRAIN_PCT = 0.95
    NUM_WORKERS = mp.cpu_count()
    EPOCHS = 3
    IMG_SIZE = (224,224)
    LABEL_MASK = -100
    TOP_K = 1000
    TOP_P = 0.95

<a id="3"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">3. HELPER FUNCTIONS</p>
#### [Top ↑](#top)

<font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> There are Two helper functions:  </span></font>
1. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> The first function is to <b>build special tokens</b> while tokenizing the captions  </span></font>
2. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The second function is used to compute the <b>ROUGE-2</b> metrics as we are working with Transformers  </span></font>

In [4]:
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs
AutoTokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens

<a id="4"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">4. DATASET</p>


<a id="4.4"></a>
## <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">4.1 Feature Extractor and Tokenizer : </span></font>
#### [Top ↑](#top)

1. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> The Feature extractor is loaded using <b>ViTFeatureExtractor</b>  </span></font>
2. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The tokenizer for GPT2 is loaded using the <b>AutoTokenizer</b>  </span></font>

In [5]:
feature_extractor = ViTFeatureExtractor.from_pretrained(config.ENCODER)
tokenizer = AutoTokenizer.from_pretrained(config.DECODER)
tokenizer.pad_token = tokenizer.unk_token

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

<a id="4.1"></a>
## <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">4.2 Transforms and dataframe : </span></font>
#### [Top ↑](#top)

 <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The Transformations used are </span></font>
> 1. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"><b>Resizing</b> the image to (224,224) </span></font>
2. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"><b>Normalizing</b> the image</span></font>
3. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> Converting the image to <b>Tensor</b>  </span></font>

In [13]:
from torchvision import transforms
transforms = transforms.Compose(
    [
        transforms.Resize(config.IMG_SIZE), 
        transforms.ToTensor(),
        transforms.Normalize(
            mean=0.5, 
            std=0.5
        )
   ]
)

<a id="4.2"></a>
## <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">4.3 Dataset Class : </span></font>
#### [Top ↑](#top)

<font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The dataset is created using the following steps </span></font>
> 1. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">We read the image using the <b>Image</b> function of PIL library </span></font>
2. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> The image is <b>transformed</b> using the transformed defined above</span></font>
3. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'">The transformed image is passed through the <b>feature extractor</b> to extract the pixel values from the image </span></font>
4. <font size = 3><span style="color:#3A3E59;font-family:'Times New Roman'"> The captions are loaded from the dataframe</span></font>
5. <font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">The captions are <b>tokenized</b></span></font>
6. <font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">The tokenized captions are <b>padded</b> to max length</span></font>
7. <font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">The images and tokenized captions are returned</span></font>

In [135]:
from datasets import load_dataset, concatenate_datasets

dataset_mimic = load_dataset("hongrui/mimic_chest_xray_v_1")

In [136]:
dataset_mimic = dataset_mimic["train"]

In [137]:
dataset_mimic = dataset_mimic.select(range(50000))

In [138]:
import re
# Define a function to clean the text
def clean_text(text):
    # Replace multiple dots with a single dot
    cleaned_text = text.replace('\n', ' ').replace('__', '_').replace('__', '_').replace('__', '_') \
            .replace('__', '_').replace('__', '_').replace('__', '_').replace('__', '_').replace('  ', ' ') \
            .replace('  ', ' ').replace('  ', ' ').replace('  ', ' ').replace('  ', ' ').replace('  ', ' ') \
            .replace('..', '.').replace('..', '.').replace('..', '.').replace('..', '.').replace('..', '.') \
            .replace('..', '.').replace('..', '.').replace('..', '.').replace('1. ', '').replace('. 2. ', '. ') \
            .replace('. 3. ', '. ').replace('. 4. ', '. ').replace('. 5. ', '. ').replace(' 2. ', '. ') \
            .replace(' 3. ', '. ').replace(' 4. ', '. ').replace(' 5. ', '. ') \
            .strip().lower()
    cleaned_text = re.sub('[.,?;*!%^&_+():-\[\]{}]', '', cleaned_text.replace('"', '').replace('/', '')
                                        .replace('\\', '').replace("'", '').strip().lower())
    return cleaned_text

# Apply the cleaning function to each report in the dataset
dataset_mimic = dataset_mimic.map(lambda example: {'report': clean_text(example['report'])})

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [139]:
dataset_mimic

Dataset({
    features: ['image', 'text', 'report'],
    num_rows: 50000
})

In [140]:
def is_short_report(example):
    # Check if the report has less than 100 words
    return len(example['report'].split()) <= 100

# Apply the filtering function to the dataset
dataset_mimic = dataset_mimic.filter(is_short_report)

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [141]:
dataset_mimic

Dataset({
    features: ['image', 'text', 'report'],
    num_rows: 48231
})

In [142]:
s_mimic = dataset_mimic.train_test_split(test_size= 0.07, seed = 42)
dataset_mimic, dataset_mimic_test = s_mimic['train'], s_mimic["test"]

In [143]:
dataset_mimic, dataset_mimic_test

(Dataset({
     features: ['image', 'text', 'report'],
     num_rows: 44854
 }),
 Dataset({
     features: ['image', 'text', 'report'],
     num_rows: 3377
 }))

In [144]:
dataset_mimic = dataset_mimic.remove_columns("text")

In [145]:
max_length_mimic = max(len(sentence.split()) for sentence in dataset_mimic["report"])

In [146]:
max_length_mimic

100

In [147]:
dataset_mimic = dataset_mimic.train_test_split(test_size=0.3, seed = 42)

In [148]:
dataset_train = dataset_mimic["train"]
dataset_val = dataset_mimic["test"]

In [149]:
dataset_train, dataset_val

(Dataset({
     features: ['image', 'report'],
     num_rows: 31397
 }),
 Dataset({
     features: ['image', 'report'],
     num_rows: 13457
 }))

In [150]:
from torchvision import transforms

class ImgDataset(Dataset):
    def __init__(self, df,root_dir,tokenizer,feature_extractor, transform = None):
        self.df = df
        self.transform = transform
        self.root_dir = root_dir
        self.tokenizer= tokenizer
        self.feature_extractor = feature_extractor
        self.max_length = 100
    def __len__(self,):
        return len(self.df)
    def __getitem__(self,idx):
        caption = self.df[idx]["report"]
        image = self.df[idx]["image"]
        img = image.convert("RGB")
        
        transform = transforms.Compose(
                [
        transforms.Resize(config.IMG_SIZE), 
        transforms.ToTensor(),

               ]
            )

        if transform is not None:
            img = np.array(img)
            img = img / 255.0

            # Convert back to PIL image
            img = Image.fromarray((img * 255).astype(np.uint8))
            
            img= transform(img)
        
        
        pixel_values = self.feature_extractor(img, return_tensors="pt").pixel_values
        captions = self.tokenizer(caption, padding='max_length', truncation=True, max_length=self.max_length).input_ids

        captions = [caption if caption != self.tokenizer.pad_token_id else -100 for caption in captions]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(captions)}
        return encoding

<a id="4.3"></a>
## <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">4.4 Train and validation dataset: </span></font>
#### [Top ↑](#top)

In [151]:
train_dataset = ImgDataset(dataset_train, root_dir = "",tokenizer=tokenizer,feature_extractor = feature_extractor ,transform = transforms)
val_dataset = ImgDataset(dataset_val , root_dir = "",tokenizer=tokenizer,feature_extractor = feature_extractor , transform  = transforms)

<a id="5"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">5. MODEL BUILDING</p>

<p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:100%;text-align:center;border-radius:200px 10px;">ENCODER</p>
<br>

<img src = "https://production-media.paperswithcode.com/methods/Screen_Shot_2021-01-26_at_9.43.31_PM_uI4jjMq.png">

<br>
<font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.</span></font>

<p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:100%;text-align:center;border-radius:200px 10px;">DECODER</p>
<br> 

<img src = "https://i.stack.imgur.com/7J4O7.png" >

<br>

<font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.</span></font>
    
<font size = 3><span style = "color:#3A3E59;font-family:'Times New Roman'">This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.</span></font>
    

<a id="5.1"></a>
## <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">5.1 Model Initialization : </span></font>
#### [Top ↑](#top)

In [59]:
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(config.ENCODER, config.DECODER)

Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.crossattention.c_attn.bias', 'h.0.crossattention.c_attn.weight', 'h.0.crossattention.c_proj.bias', 'h.0.crossattention.c_proj.weight', 'h.0.crossattention.q_attn.bias', 'h.0.crossattention.q_attn.weight', 'h.0.ln_cross_attn.bias', 'h.0.ln_cross_attn.weight', 'h.1.crossattention.c_attn.bias', 'h.1.crossattention.c_attn.weight', 'h.1.crossattention.c_proj.bias', 'h.1.crossattention.c_proj.weight', 'h.1.crossattention.q_attn.bias', 'h.1.crossattention.q_attn.weight', 'h.1.ln_cross_attn.bias', 'h.1.ln_cross_attn.weight', 'h.10.crossattention.c_attn.bias', 'h.10.crossattention.c_attn.

In [60]:
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size
# set beam search parameters
model.config.eos_token_id = tokenizer.sep_token_id
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.max_length = 100
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

<a id="6"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">6. TRAINING</p>

<a id="6.1"></a>
### <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">6.1 Training Arguments : </span></font>
#### [Top ↑](#top)

In [154]:
training_args = Seq2SeqTrainingArguments(
    output_dir='/kaggle/working/',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=False,
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
    logging_steps=1024,  
    save_steps=2048, 
    warmup_steps=1024,  
    learning_rate = 5e-5,
    #max_steps=1500, # delete for full training
    num_train_epochs = 4, #TRAIN_EPOCHS
    overwrite_output_dir=True,
    save_total_limit=1,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


<a id="6.2"></a>
### <font size = 5><span style="color:#A8642A;font-family:'Times New Roman'">6.2 Training using Seq2SeqTrainer : </span></font>
#### [Top ↑](#top)

In [None]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    tokenizer=feature_extractor,
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=default_data_collator,
)
trainer.train()

In [156]:
trainer.save_model('/kaggle/working/')

Non-default generation parameters: {'max_length': 100, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3}


<a id="7"></a>
# <p style="background-color:#422711;font-family:newtimeroman;color:#F6923D;font-size:140%;text-align:center;border-radius:200px 10px;">7. PREDICTIONS</p>
#### [Top ↑](#top)

In [162]:
img = dataset_mimic_test[100]["image"].convert("RGB")

In [163]:
dataset_mimic_test[2500]["report"]

'mild right pleural effusion improved no pneumothorax improved right basilar consolidation left port-a-cath new minimal left basilar opacity likely atelectasis'

In [164]:
generated_caption = tokenizer.decode(model.generate(feature_extractor(img, return_tensors="pt").pixel_values.to("cuda"), max_length = 100, temperature = 0.5)[0])
print('\033[96m' +generated_caption+ '\033[0m')

[96m<|endoftext|>in comparison with the study of  the monitoring and support devices are essentially unchanged continued enlargement of the cardiac silhouette with pulmonary vascular congestion and bilateral pleural effusions with compressive atelectasis at the bases in the appropriate clinical setting it would be difficult to exclude superimposed pneumonia especially in the absence of a lateral view tracheostomy tube remains in place and there is no evidence of pneumothorax right subclavian picc line again extends to the mid portion of the svc[0m


In [169]:
from tqdm import tqdm

original = []
predicted = []
image_ids = []

# Wrap the loop with tqdm for progress tracking
for i in tqdm(range(30)):
    
    test = dataset_mimic_test[i]
    test_img, test_caption = test["image"].convert("RGB"), test["report"]
    
    generated_caption = tokenizer.decode(model.generate(feature_extractor(test_img, return_tensors="pt").pixel_values.to("cuda"), temperature = 1, max_length = 100)[0])

    original.append(test_caption)
    image_ids.append(i)

    predicted.append(generated_caption)

100%|██████████| 30/30 [00:57<00:00,  1.91s/it]


In [170]:
predicted_clean = [text.replace("<|endoftext|>", "") for text in predicted]

In [51]:
!pip install pycocoevalcap

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting pycocoevalcap
  Downloading pycocoevalcap-1.2-py3-none-any.whl (104.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pycocotools>=2.0.2
  Downloading pycocotools-2.0.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (403 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m403.3/403.3 kB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pycocotools, pycocoevalcap
Successfully installed pycocoevalcap-1.2 pycocotools-2.0.7
[0m

In [52]:
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider


def compute_scores(gts, res):

    # Set up scorers
    scorers = [
        (Bleu(4), ["BLEU_1", "BLEU_2", "BLEU_3", "BLEU_4"]),
        (Meteor(), "METEOR"),
        (Rouge(), "ROUGE_L"),
        (Cider(), "CIDEr")
    ]
    eval_res = {}
    # Compute score for each metric
    for scorer, method in scorers:
        try:
            score, scores = scorer.compute_score(gts, res, verbose=0)
        except TypeError:
            score, scores = scorer.compute_score(gts, res)
        if type(method) == list:
            for sc, m in zip(score, method):
                eval_res[m] = sc
        else:
            eval_res[method] = score
    return eval_res

In [53]:
# Assuming you have two lists: pred_list and original_list

# Convert lists to dictionaries with image ids as keys and captions as values
pred_dict = {i: [pred] for i, pred in enumerate(predicted_clean)}
original_dict = {i: [ref] for i, ref in enumerate(original)}

# Now you can use these dictionaries for evaluation
scores = compute_scores(original_dict, pred_dict)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [54]:
scores

{'BLEU_1': 0.24967145367892615,
 'BLEU_2': 0.13823499567555453,
 'BLEU_3': 0.07908908451517639,
 'BLEU_4': 0.04942488795271519,
 'METEOR': 0.13851008038579368,
 'ROUGE_L': 0.19439369496231987,
 'CIDEr': 0.03364910339429611}