# Evaluate VQA on Idefics2 Model

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. For convenience, I suggest you put all the path-related content in [config.py](../config.py)

If you have more than one GPU and want to use multiple GPUs, you need to set the environment variable `CUDA_VISIBLE_DEVICE`, which is already done in [config.py](../config.py).

`testbed.data.utils.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

In [1]:
import torch
import os
import sys
from datasets import load_dataset

sys.path.insert(0, "..")
from testbed.data import prepare_dataloader
import config

dataset = load_dataset(
    os.path.join(config.testbed_dir, "data", "vqav2"),
    split="validation",
    data_dir=config.vqav2_dir,
    images_dir=config.coco_dir,
    trust_remote_code=True,
)

hparams = {
    "batch_size": 1,
    "num_shots": 2,
    "dtype": torch.bfloat16,
    "generate_args": {"num_beams": 3, "max_new_tokens": 5},
}

dataloader = prepare_dataloader(
    dataset,
    batch_size=hparams["batch_size"],
    num_shots=hparams["num_shots"],
    shuffle=True,
)

## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [2]:
from testbed.models import Idefics
import torch

device = torch.device("cuda:1")
model = Idefics(
    config.idefics_9b_path,
    torch_dtype=hparams["dtype"],
).to(device)

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

## Step 3. Inference
You can get batches by iterating over the dataloader, and then use the `prepare_*_input` methods (depending on a specific task) in `testbed.data` to convert the batches into model inputs according to the specific task. The model input here should usually be a `list` of `dict`. For example, for a 1-shot context, 
```python
[
    {
        "role": "instruction",
        "content": "Provide an answer to the question. Use the image to answer." ,
    },
    {
        "role": "image:
        "content": [
            {"type": "image"}
        ]
    },
    {
        "role": "question",
        "content": [
            {"type": "text", "text": "What do we see in this image?"},
        ],
    },
    {
        "role": "answer",
        "content": [
            {
                "type": "text",
                "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty.",
            },
        ],
    },
    {
        "role": "image:
        "content": [
            {"type": "image"}
        ]
    },
    {
        "role": "question",
        "content": [
            {"type": "text", "text": "And how about this image?"},
        ],
    },
    {
        "role": "answer"
    }
]
```

It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.process_input`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

Let's view the full pipeline with a mini-batch.

In [8]:
from testbed.data.vqav2 import postprocess_generation
from testbed.data import prepare_vqa_input

batch = next(iter(dataloader))
single_context = batch[0]
print(single_context)
text, images = prepare_vqa_input(
    [single_context],
    instruction="Provide an answer to the question. Use the image to answer.",
)
print(text)
inputs = model.process_input(text, images).to(device)
seq_len = inputs.input_ids.shape[-1]
generated_ids = model.generate(**inputs, **hparams["generate_args"])
generated_ids = generated_ids[:, seq_len:]
raw_output = model.processor.batch_decode(generated_ids, skip_special_tokens=True)
prediction = postprocess_generation(raw_output)
print(prediction)
print(single_context[-1]["answers"]) # gt

[{'question_type': 'none of the above', 'multiple_choice_answer': 'down', 'answers': [{'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 1}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 2}, {'answer': 'at table', 'answer_confidence': 'yes', 'answer_id': 3}, {'answer': 'skateboard', 'answer_confidence': 'yes', 'answer_id': 4}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 5}, {'answer': 'table', 'answer_confidence': 'yes', 'answer_id': 6}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 7}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 8}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 9}, {'answer': 'down', 'answer_confidence': 'yes', 'answer_id': 10}], 'answer': 'down', 'image_id': 262148, 'answer_type': 'other', 'question_id': 262148000, 'question': 'Where is he looking?', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x512 at 0x7F1E1FEC0B60>}, {'question_type': 'what are the', 'multiple_

## Step 4. Evaluate
For visual question answering task, it uses [vqa accuracy](../testbed/evaluate/metrics/vqa_accuracy/vqa_accuracy.py) to evaluate, which has already been implemented with [`evaluate`](https://huggingface.co/docs/evaluate/index) library that comes from hugging face. It is thoroughly tested to ensure full consistency with the official VQA accuracy implementation, see [test script](../tests/vqa_accuracy/test_vqa_accuracy.py).

Thanks to huggingface space, you can also check [here](https://huggingface.co/spaces/Kamichanw/vqa_accuracy) to try `vqa_accuracy` online.

In [10]:
from tqdm import tqdm
import evaluate

total_acc = evaluate.load("Kamichanw/vqa_accuracy")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(
    range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")
):
    text, images = prepare_vqa_input(
        batch, instruction="Provide an answer to the question. Use the image to answer."
    )
    seq_len = inputs.input_ids.shape[-1]
    generated_ids = model.generate(**inputs, **hparams["generate_args"])
    generated_ids = generated_ids[:, seq_len:]
    predictions = model.processor.batch_decode(generated_ids, skip_special_tokens=True)
    for pred, context in zip(predictions, batch):
        last_qa = context[-1]
        gt_answer = [item["answer"] for item in last_qa["answers"]]
        prediction = postprocess_generation(pred)
        total_acc.add(
            prediction=prediction,
            reference=gt_answer,
            question_types=last_qa["question_type"],
            answer_types=last_qa["answer_type"],
        )
        result.append(
            {
                "question_id": last_qa["question_id"],
                "raw_output": pred,
                "question": last_qa["question"],
                "question_type": last_qa["question_type"],
                "answer_type": last_qa["answer_type"],
                "prediction": prediction,
                "answers": last_qa["answers"],
            }
        )

eval_result = total_acc.compute()
eval_result

Evaluating idefics2-8b-base ...:   1%|          | 9/1111 [00:11<22:41,  1.24s/it]


{'overall': 10.0,
 'perAnswerType': {'other': 14.285714285714286, 'yes/no': 0.0},
 'perQuestionType': {'what kind of': 100.0,
  'what is': 0.0,
  'what color is the': 0.0,
  'are there': 0.0,
  'is there a': 0.0,
  'is this an': 0.0,
  'what is the': 0.0,
  'what animal is': 0.0,
  'is this a': 0.0}}

## Step 5. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [None]:
evaluate.save("./", eval_result=eval_result, hparams=hparams, records=result)