# Evaluate Model on Your Dataset

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation on your own dataset.

Let's take Idefics2 as an example. Idefics2 is a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. The more details about Idefics2 can be found in following ways:

[paper](https://arxiv.org/abs/2405.02246) [blog](https://huggingface.co/blog/idefics2) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/idefics2)

## Step 1. Data Loading
Load dataset by `datasets` library. You can load official datasets or use a custom loading script. We use [hateful memes](https://huggingface.co/datasets/neuralcatcher/hateful_memes) as example.

`testbed.data.utils.prepare_dataloader` will use the given dataset and the sampler in PyTorch to generate a dataloader that can produce batches of size `batch_size`, each batch has `num_shots + 1` question-answer pairs.

In [3]:
import torch
from datasets import load_dataset
import os
import sys

sys.path.insert(0, "..")
from testbed.data import prepare_dataloader

hateful_memes_dir = "/data1/share/dataset/hateful_memes"
idefics2_8b_base_path = "/data1/pyz/model_weight/idefics2-8b-base"

dataset = load_dataset(
    os.path.join("..", "testbed", "data", "hateful_memes"),
    data_dir=hateful_memes_dir,
    split="train",
    trust_remote_code=True, 
)

hparams = {
    "batch_size": 1,
    "num_shots": 2,
    "dtype": torch.bfloat16,
    "generate_args": {"num_beams": 3, "max_new_tokens": 5},
}

dataloader = prepare_dataloader(
    dataset,
    batch_size=hparams["batch_size"],
    num_shots=hparams["num_shots"],
    shuffle=True,
)

## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [4]:
from testbed.models import Idefics2
import torch

device = torch.device("cpu")
model = Idefics2(idefics2_8b_base_path, torch_dtype=hparams["dtype"]).to(device)

Loading checkpoint shards: 100%|██████████| 7/7 [00:03<00:00,  2.25it/s]


## Step 3. Inference
If you need to use your own prompt template, you should do it here. Suppose we want to use the following template:
```
<image>is an image with written "<text>" on it. Is it hateful? Answer: <label>
```
The prompt template in ICLTestbed is an alias for chat template from huggingface (not familiar? see [Chat Templating](https://huggingface.co/docs/transformers/main/chat_templating)). The model input here should usually be a `list` of `dict`, referred as `messages` in prompt template. For example, for a 1-shot context, 
```python
[
    {
        "role": "",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": "is an image with written \"its their character not their color that matters\" on it. Is it hateful?",
            },
        ],
    },
    {
        "role": "answer",
        "content": [
            {
                "type": "text",
                "text": "Yes",
            },
        ],
    },
    {
        "role": "",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": "is an image with written \"don't be afraid to love again everyone is not like your ex\" on it. Is it hateful?",
            },
        ],
    },
    {
        "role": "answer"
    }
]
```

In [5]:
# fmt: off
model.prompt_template =  (
    "{% if messages[0]['role'] == 'instruction' %}"
        "Instruction: {{ messages[0]['content'] }}\n"
        "{% set messages = messages[1:] %}"
    "{% endif %}"
    "{% for message in messages %}"
        "{% if message['role'] != '' %}"
            "{{ message['role'].capitalize() }}"
            "{% if not 'content' in message or message['content'][0]['type'] == 'image' %}"
                "{{':'}}"
            "{% else %}"
                "{{': '}}"
            "{% endif %}" 
        "{% endif %}"
        "{% if 'content' in message %}"
            "{% for line in message['content'] %}"
                "{% if line['type'] == 'text' %}"
                    "{{ line['text'] }}"
                "{% elif line['type'] == 'image' %}"
                    "{{- '<image>' }}"
                "{% endif %}"
                "{% if loop.last %}"
                    "{% if message['role'] == 'answer' %}"
                        "\n\n"
                    "{% else %}"
                        " "
                    "{%+ endif %}"
                "{% endif %}"
            "{% endfor %}"
        "{% endif %}"
    "{% endfor %}"
)
# fmt: on

Next, you need to customize a prepare input to extract the data from the dataset and form the input of the model (see example above), just like `testbed.data.prepare_*_input`. Luckily, you can do this with the help of `testbed.data.prepare_input`.

In [6]:
from testbed.data import prepare_input


def prepare_hateful_memes_input(batch):
    def retriever(item, is_last):
        return [
            {
                "role": "",
                "content": [
                    {"type": "image"},
                    {
                        "type": "text",
                        "text": f'is an image with written "{item["text"]}" on it. Is it hateful?',
                    },
                ],
            },
            (
                {"role": "answer"}
                if is_last
                else {
                    "role": "answer",
                    "content": [
                        {"type": "text", "text": "yes" if item["label"] == 1 else "no"}
                    ],
                }
            ),
        ]

    return prepare_input(
        batch,
        instruction="It's a conversation between a human, the user, and an intelligent visual AI, Bot. "
        "The user sends memes with text written on them, and Bot has to say whether the meme is hateful or not.",
        retriever=retriever,
    ), [[item["img"] for item in context] for context in batch]


prepare_hateful_memes_input(next(iter(dataloader)))

([[{'role': 'instruction',
    'content': "It's a conversation between a human, the user, and an intelligent visual AI, Bot. The user sends memes with text written on them, and Bot has to say whether the meme is hateful or not."},
   {'role': '',
    'content': [{'type': 'image'},
     {'type': 'text',
      'text': 'is an image with written "its their character not their color that matters" on it. Is it hateful?'}]},
   {'role': 'answer', 'content': [{'type': 'text', 'text': 'no'}]},
   {'role': '',
    'content': [{'type': 'image'},
     {'type': 'text',
      'text': 'is an image with written "don\'t be afraid to love again everyone is not like your ex" on it. Is it hateful?'}]},
   {'role': 'answer', 'content': [{'type': 'text', 'text': 'no'}]},
   {'role': '',
    'content': [{'type': 'image'},
     {'type': 'text',
      'text': 'is an image with written "putting bows on your pet" on it. Is it hateful?'}]},
   {'role': 'answer'}]],
 [[<PIL.PngImagePlugin.PngImageFile image mode=R


It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.process_input`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles.

In [7]:
from testbed.data.hateful_memes import postprocess_generation

batch = next(iter(dataloader))
single_context = batch[0]
text, images = prepare_hateful_memes_input([single_context])
print(model.apply_prompt_template(text))
raw_output = model.generate(text, images, **hparams["generate_args"])
print(raw_output)
prediction = postprocess_generation(raw_output)
print(prediction)
print(single_context[-1]["label"]) # gt

['Instruction: It\'s a conversation between a human, the user, and an intelligent visual AI, Bot. The user sends memes with text written on them, and Bot has to say whether the meme is hateful or not.\n<image>is an image with written "its their character not their color that matters" on it. Is it hateful? Answer: no\n<image>is an image with written "don\'t be afraid to love again everyone is not like your ex" on it. Is it hateful? Answer: no\n<image>is an image with written "putting bows on your pet" on it. Is it hateful? Answer:']


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[0]
0


## Step 4. Evaluate
For hateful memes task, it uses ROC AUC to evaluate, which has already been implemented in [`evaluate`](https://huggingface.co/docs/evaluate/index) library that comes from hugging face. 

In [None]:
from testbed.data.hateful_memes import postprocess_generation
from tqdm import tqdm
import evaluate

total_roc_auc = evaluate.load("roc_auc")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(
    range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")
):
    text, images = prepare_hateful_memes_input(batch)
    predictions = model.generate(text, images, **hparams["generate_args"])
    for pred, context in zip(predictions, batch):
        last_item = context[-1]
        answer = last_item["label"]
        prediction = postprocess_generation(pred)
        total_roc_auc.add(prediction_scores=prediction, references=answer)
        result.append(
            {
                "id": last_item["id"],
                "answer": last_item["label"],
                "raw_output": pred,
                "prediction": prediction,
            }
        )

eval_result = total_roc_auc.compute()
eval_result

## Step 4. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [None]:
hparams["dtype"] = str(hparams["dtype"])
evaluate.save("./", eval_result=eval_result, hparams=hparams, records=result)