# Evaluate Image Captioning on Idefics2 Model

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. For convenience, I suggest you put all the path-related content in [config.py](../config.py)

If you have more than one GPU and want to use multiple GPUs, you need to set the environment variable `CUDA_VISIBLE_DEVICE`, which is already done in [config.py](../config.py).

`testbed.data.utils.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

This is slightly different from the [VQA tutorial](./tutorial_vqa.ipynb) in that we extract in-context examples from the training set and queries from the validation set.

In [3]:
from datasets import load_dataset
import sys

sys.path.insert(0, "..")
from testbed.data.utils import prepare_dataloader
import config

dataset = load_dataset(
    "../testbed/data/coco",
    data_dir=config.karpathy_coco_caption_dir,
    images_dir=config.coco_dir,
    trust_remote_code=True,
)

hparams = {
    "batch_size": 1,
    "num_shots": 2,
    "precision": "bf16",
    "generate_args": {"num_beams": 3, "max_new_tokens": 20},
}

dataloader = prepare_dataloader(
    [dataset["train"], dataset["validation"]],
    batch_size=hparams["batch_size"],
    num_shots=hparams["num_shots"],
    num_per_dataset=[hparams["num_shots"], 1],
    shuffle=True,
)

## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [4]:
from testbed.models import Idefics2
import torch

device = torch.device("cuda:0")
model = Idefics2(
    config.idefics2_8b_base_path,
    precision=hparams["precision"],
    device=device,
)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

## Step 3. Inference
You can get batches by iterating over the dataloader, and then use the `prepare_*_input` methods (depending on a specific task) in `testbed.data` to convert the batches into model inputs according to the specific task. The model input here should usually be a `list` of `dict`. For example, for a 1-shot context, 
```python
[
    {
        "role": "",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption",
        "content": [
            {
                "type": "text",
                "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty.",
            },
        ],
    },
    {
        "role": "",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption"
    }
]
```

It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.generate`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

In [16]:
from testbed.data.coco import postprocess_generation
from testbed.data.utils import prepare_caption_input

batch = next(iter(dataloader))
single_context = batch[0]
print(single_context)
text, images = prepare_caption_input([single_context])
print(text)
raw_output = model.generate(text, images, max_new_tokens=15, num_beams=5)
print(raw_output)
prediction = postprocess_generation(raw_output)
print(prediction)
print(single_context[-1]["sentences_raw"]) # gt

[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F15500657F0>, 'filepath': 'COCO_train2014_000000057870.jpg', 'sentids': [787980, 789366, 789888, 791316, 794853], 'filename': 'COCO_train2014_000000057870.jpg', 'imgid': 40504, 'split': 'train', 'caption': 'A restaurant has modern wooden tables and chairs.', 'sentences_tokens': [['a', 'restaurant', 'has', 'modern', 'wooden', 'tables', 'and', 'chairs'], ['a', 'long', 'restaurant', 'table', 'with', 'rattan', 'rounded', 'back', 'chairs'], ['a', 'long', 'table', 'with', 'a', 'plant', 'on', 'top', 'of', 'it', 'surrounded', 'with', 'wooden', 'chairs'], ['a', 'long', 'table', 'with', 'a', 'flower', 'arrangement', 'in', 'the', 'middle', 'for', 'meetings'], ['a', 'table', 'is', 'adorned', 'with', 'wooden', 'chairs', 'with', 'blue', 'accents']], 'sentences_raw': ['A restaurant has modern wooden tables and chairs.', 'A long restaurant table with rattan rounded back chairs.', 'a long table with a plant on top of it sur

: 

## Step 4. Evaluate
For image captioning task, it uses [CIDEr](../testbed/evaluate/metrics/CIDEr/CIDEr.py) to evaluate, which has already been implemented with [`evaluate`](https://huggingface.co/docs/evaluate/index) library that comes from hugging face. It is thoroughly tested to ensure full consistency with the [official CIDEr implementation](https://github.com/tylin/coco-caption), see [test script](../tests/CIDEr/test_CIDEr.py).

Thanks to huggingface space, you can also check [here](https://huggingface.co/spaces/Kamichanw/CIDEr) to try `CIDEr` online.

In [14]:
from testbed.data.coco import postprocess_generation
from testbed.data.utils import prepare_caption_input
from tqdm import tqdm
import evaluate

total_cider = evaluate.load("Kamichanw/CIDEr")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(
    range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")
):
    text, images = prepare_caption_input(batch)
    predictions = model.generate(text, images, **hparams["generate_args"])
    for pred, context in zip(predictions, batch):
        last_cap = context[-1]
        gt_captions = last_cap["sentences_raw"]
        prediction = postprocess_generation(pred)
        total_cider.add(prediction=prediction, reference=gt_captions)
        result.append(
            {
                "cocoid": last_cap["cocoid"],
                "raw_output": pred,
                "filename": last_cap["filename"],
                "sentences": last_cap["sentences_raw"],
                "prediction": prediction,
            }
        )

eval_result = total_cider.compute()
eval_result

Evaluating idefics2-8b-base ...:   1%|          | 9/1666 [00:13<40:25,  1.46s/it]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
PTBTokenizer tokenized 676 tokens at 18866.12 tokens per second.


{'CIDEr': 0.8088168397310328}

## Step 4. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [None]:
evaluate.save("./", eval_result=eval_result, hparams=hparams, records=result)