# Evaluate VQA on Idefics2 Model

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. For convenience, I suggest you put all the path-related content in [config.py](../config.py)

If you have more than one GPU and want to use multiple GPUs, you need to set the environment variable `CUDA_VISIBLE_DEVICE`, which is already done in [config.py](../config.py).

`testbed.data.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

In [1]:
from datasets import load_dataset
import sys
sys.path.insert(0, "..")
from testbed.data import prepare_dataloader
from torch.utils.data.sampler import RandomSampler
import config

dataset = load_dataset(
    "../testbed/data/vqav2", split="validation", data_dir="./dev", images_dir=config.coco_dir,
    trust_remote_code=True
)

dataloader = prepare_dataloader(dataset, batch_size=1, num_shots=2, sampler=RandomSampler(dataset))


  from .autonotebook import tqdm as notebook_tqdm


## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [2]:
from testbed.models import Idefics2
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)
device = torch.device("cuda:7")
model = Idefics2(
    config.idefics2_8b_base_path,
    precision="bf16",
    device=device,
    quantization_config=quantization_config,
)

Loading checkpoint shards: 100%|██████████| 7/7 [00:08<00:00,  1.17s/it]


## Step 3. Inference
You can get batches by iterating over the dataloader, and then use the `prepare_*_input` methods (depending on a specific task) in `testbed.data` to convert the batches into model inputs according to the specific task. The model input here should usually be a `list` of `dict`. For example, for a 1-shot context, 
```python
[
    {
        "role": "instruction",
        "content": "Provide an answer to the question. Use the image to answer." ,
    },
    {
        "role": "example",
        "query": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ],
        "answer": [
            {
                "type": "text",
                "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty.",
            },
        ],
    },
    {
        "role": "question",
        "query": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ],
    },
]
```

It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.generate`.

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

For visual question answering task, it use vqa accuracy to evaluate, which has already been implemented with [`evaluate`](https://huggingface.co/docs/evaluate/index) library from hugging face.

In [3]:
from testbed.data.vqav2 import postprocess_generation
from testbed.data import prepare_vqa_input
from tqdm import tqdm
import evaluate

total_acc = evaluate.load("../testbed/evaluate/metrics/vqa_accuracy")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")):
    images, text = prepare_vqa_input(
        batch, instruction="Provide an answer to the question. Use the image to answer."
    )
    predictions = model.generate(text, images, max_new_tokens=10, num_beams=3)
    for pred, context in zip(predictions, batch):
        last_qa = context[-1]
        gt_answer = [item["answer"] for item in last_qa["answers"]]
        prediction = postprocess_generation(pred)
        total_acc.add(
            prediction=prediction,
            reference=gt_answer,
            question_types=last_qa["question_type"],
            answer_types=last_qa["answer_type"],
        )
        result.append(
            {
                "question_id": last_qa["question_id"],
                "raw_output": pred,
                "question": last_qa["question"],
                "question_type": last_qa["question_type"],
                "answer_type": last_qa["answer_type"],
                "prediction" : prediction,
                "answers": last_qa["answers"],
            }
        )

print(total_acc.compute())

Evaluating idefics2-8b-base ...:   0%|          | 9/3333 [00:21<2:13:04,  2.40s/it]

{'overall': 60.0, 'perAnswerType': {'other': 40.0, 'yes/no': 80.0}, 'perQuestionType': {'what kind of': 100.0, 'is this a': 100.0, 'what sport is': 100.0, 'who is': 0.0, 'where is the': 0.0, 'are they': 100.0, 'are': 0.0, 'does the': 100.0, 'is this': 100.0, 'what does the': 0.0}}





## Step 4. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [7]:
evaluate.save("./", records=result)

PosixPath('result-2024_08_06-21_24_25.json')