# Evaluate Image Captioning on Idefics2 Model

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. For convenience, I suggest you put all the path-related content in [config.py](../config.py)

If you have more than one GPU and want to use multiple GPUs, you need to set the environment variable `CUDA_VISIBLE_DEVICE`, which is already done in [config.py](../config.py).

`testbed.data.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

In [None]:
from datasets import load_dataset
import sys

sys.path.insert(0, "..")
from testbed.data.utils import prepare_dataloader
from torch.utils.data.sampler import RandomSampler
import config

dataset = load_dataset(
    "../testbed/data/coco",
    split="validation",
    data_dir=config.karpathy_coco_caption_dir,
    images_dir=config.coco_dir,
    trust_remote_code=True,
)

hparams = {
    "batch_size": 1,
    "num_shots": 2,
    "precision": "bf16",
    "generate_args": {"num_beams": 3, "max_new_tokens": 5},
}

dataloader = prepare_dataloader(
    dataset, batch_size=hparams["batch_size"], num_shots=hparams["num_shots"], sampler=RandomSampler(dataset)
)


## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [None]:
from testbed.models import Idefics2
import torch

device = torch.device("cuda:0")
model = Idefics2(
    config.idefics2_8b_base_path,
    precision=hparams["precision"],
    device=device,
)

## Step 3. Inference
You can get batches by iterating over the dataloader, and then use the `prepare_*_input` methods (depending on a specific task) in `testbed.data` to convert the batches into model inputs according to the specific task. The model input here should usually be a `list` of `dict`. For example, for a 1-shot context, 
```python
[
    {
        "role": "",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption",
        "content": [
            {
                "type": "text",
                "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty.",
            },
        ],
    },
    {
        "role": "",
        "content": [
            {"type": "image"}
        ],
    },
    {
        "role": "caption"
    }
]
```

It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.generate`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

For visual question answering task, it use vqa accuracy to evaluate, which has already been implemented with [`evaluate`](https://huggingface.co/docs/evaluate/index) library from hugging face.