# Evaluate Image Captioning on Idefics2 Model

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. For convenience, I suggest you put all the path-related content in [config.py](../config.py)

If you have more than one GPU and want to use multiple GPUs, you need to set the environment variable `CUDA_VISIBLE_DEVICE`, which is already done in [config.py](../config.py).

`testbed.data.utils.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

This is slightly different from the [VQA tutorial](./tutorial_vqa.ipynb) in that we extract in-context examples from the training set and queries from the validation set.

In [1]:
import torch
import os
import sys
from datasets import load_dataset

sys.path.insert(0, "..")
from testbed.data import prepare_dataloader
import config

dataset = load_dataset(
    os.path.join(config.testbed_dir, "data", "coco"),
    data_dir=config.karpathy_coco_caption_dir,
    images_dir=config.coco_dir,
    trust_remote_code=True,
)

hparams = {
    "batch_size": 1,
    "num_shots": 2,
    "dtype": torch.bfloat16,
    "generate_args": {"num_beams": 3, "max_new_tokens": 20},
}

dataloader = prepare_dataloader(
    [dataset["train"], dataset["validation"]],
    batch_size=hparams["batch_size"],
    num_shots=hparams["num_shots"],
    num_per_dataset=[hparams["num_shots"], 1],
    shuffle=True,
)

## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [2]:
from testbed.models import Idefics2
import torch

device = torch.device("cpu")
model = Idefics2(config.idefics2_8b_base_path, torch_dtype=hparams["dtype"]).to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

## Step 3. Inference
You can get batches by iterating over the dataloader, and then use the `prepare_*_input` methods (depending on a specific task) in `testbed.data` to convert the batches into model inputs according to the specific task. The model input here should usually be a `list` of `dict`. For example, for a 1-shot context, 
```python
[
    {
        "role": "image",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption",
        "content": [
            {
                "type": "text",
                "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty.",
            },
        ],
    },
    {
        "role": "image",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption"
    }
]
```

It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.process_input`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

In [3]:
from testbed.data.coco import postprocess_generation
from testbed.data import prepare_caption_input

batch = next(iter(dataloader))
single_context = batch[0]
text, images = prepare_caption_input([single_context])
print(model.apply_prompt_template(text))
raw_output = model.generate(text, images, max_new_tokens=15, num_beams=5)
prediction = postprocess_generation(raw_output)
print(prediction)
print(single_context[-1]["sentences_raw"]) # gt

['Image:<image> Caption: A restaurant has modern wooden tables and chairs.\nImage:<image> Caption: A man preparing desserts in a kitchen covered in frosting.\nImage:<image> Caption:']
['A man carries a goat through a flooded market.\nImage']
['A child holding a flowered umbrella and petting a yak.', 'A young man holding an umbrella next to a herd of cattle.', 'a young boy barefoot holding an umbrella touching the horn of a cow', 'A young boy with an umbrella who is touching the horn of a cow.', 'A boy holding an umbrella while standing next to livestock.']


## Step 4. Evaluate
For image captioning task, it uses [CIDEr](../testbed/evaluate/metrics/CIDEr/CIDEr.py) to evaluate, which has already been implemented with [`evaluate`](https://huggingface.co/docs/evaluate/index) library that comes from hugging face. It is thoroughly tested to ensure full consistency with the [official CIDEr implementation](https://github.com/tylin/coco-caption), see [test script](../tests/CIDEr/test_CIDEr.py).

Thanks to huggingface space, you can also check [here](https://huggingface.co/spaces/Kamichanw/CIDEr) to try `CIDEr` online.

In [None]:
from testbed.data.coco import postprocess_generation
from testbed.data import prepare_caption_input
from tqdm import tqdm
import evaluate

total_cider = evaluate.load("Kamichanw/CIDEr")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(
    range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")
):
    text, images = prepare_caption_input(batch)
    predictions = model.generate(text, images, **hparams["generate_args"])
    for pred, context in zip(predictions, batch):
        last_cap = context[-1]
        gt_captions = last_cap["sentences_raw"]
        prediction = postprocess_generation(pred)
        total_cider.add(prediction=prediction, reference=gt_captions)
        result.append(
            {
                "cocoid": last_cap["cocoid"],
                "raw_output": pred,
                "filename": last_cap["filename"],
                "sentences": last_cap["sentences_raw"],
                "prediction": prediction,
            }
        )

eval_result = total_cider.compute()
eval_result

## Step 4. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [None]:
hparams["dtype"] = str(hparams["dtype"])
evaluate.save("./", eval_result=eval_result, hparams=hparams, records=result)