# Evaluate Model on Your Dataset

In this tutorial, we will show step by step how to use `ICLTestbed` for model inference and evaluation on your own dataset.

Let's take Mistral-7b-v0.3 as an example. Mistral is a 7-billion-parameter language model engineered for superior performance and efficiency. The more details about Mistral can be found in following ways:

[paper](https://arxiv.org/abs/2310.06825) [blog](https://huggingface.co/mistralai/Mistral-7B-v0.3) [official-code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/mistral)

## Step 1. Data Loading
Load dataset by `datasets` library. We first create a randomly generated Boolean expression dataset from [Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning](https://arxiv.org/abs/2410.04691v1)

In [1]:
import random
import datasets


# Copied from https://github.com/MikaStars39/ICLvsFinetune/blob/main/src/generate_data.py
def generate_boolean_expression(num_terms=3):
    operators = ["and", "or"]
    values = ["True", "False"]
    expression = []

    # Start with a random boolean value
    expression.append(random.choice(values))

    # Add operators and boolean values
    for _ in range(num_terms - 1):
        operator = random.choice(operators)
        value = random.choice(values)
        expression.append(operator)
        expression.append(value)

    # Join all parts to form the final expression
    expression_str = " ".join(expression)
    return expression_str, eval(expression_str)


def generate_bool_expression(
    num_groups: int = 3,
    num_terms: int = 4,
    and_false: bool = False,
    or_true: bool = False,
    randoms: bool = False,
    need_false: bool = False,
):
    if and_false == False and or_true == False and randoms == False:
        choice = random.choice(["False", "True"])
        if choice == "False":
            and_false = True
        else:
            or_true = True

    expression = []

    for _ in range(num_groups):
        # Determine the number of terms in this group
        num_terms = random.randint(2, num_terms)
        sub_expr, _ = generate_boolean_expression(num_terms)

        # Add parentheses around the sub-expression
        if len(expression) > 0:
            operator = random.choice(["and", "or"])
            expression.append(operator)
        expression.append(f"({sub_expr})")

    # Join all parts to form the final expression
    expression_str = " ".join(expression)

    if and_false:
        expression_str = "(" + expression_str + ")" + " and False"
    elif or_true:
        expression_str = expression_str + " or True"

    if need_false:
        choice = random.choice(["False", "True"])
        if choice == "False":
            expression_str = "(" + expression_str + ")" + " or False"
        else:
            expression_str = "(" + expression_str + ")" + " and True"

    return expression_str, eval(expression_str)


def generate_dataset(
    example_number: int,
):
    all_data = []
    for _ in range(example_number):
        question, answer = generate_bool_expression(randoms=True)
        all_data.append({"question": question, "answer": answer})

    return all_data

dataset = datasets.Dataset.from_list(generate_dataset(200))
dataset[range(5)]

{'question': ['(False and True or False) and (True or False and False) or (False or False or True)',
  '(True and False and False) or (False and False or False) or (False or True)',
  '(True and True) or (True and True) and (True or True)',
  '(False or False) and (True or False) or (False or False)',
  '(False and True or True) or (False and True and True) and (True or False)'],
 'answer': [True, True, True, False, True]}

In [2]:
import torch
import sys

sys.path.insert(0, "..")
from testbed.data import prepare_dataloader

mistral_7b_path = "/data/share/Mistral-7B"

hparams = {
    "batch_size": 2,
    "num_shots": 2,
    "dtype": torch.float16,
    "generate_args": {"num_beams": 3, "max_new_tokens": 5},
}

dataloader = prepare_dataloader(
    dataset,
    batch_size=hparams["batch_size"],
    num_shots=hparams["num_shots"],
    shuffle=True,
)

## Step 2. Model Building
The model in ICLTestbed can be roughly regarded as a simple combination of a processor and a specific model. You can access underlying processor or model by `model.processor` or `model.model`.

In [3]:
from testbed.models import Mistral
import torch

device = torch.device("cuda")
model = Mistral(mistral_7b_path, torch_dtype=hparams["dtype"]).to(device)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Step 3. Inference
If you need to use your own prompt template, you should do it here. Suppose we want to use the following template:
```
Question: <question> Answer: <answer>
```
The prompt template in ICLTestbed is an alias for chat template from huggingface (not familiar? see [Chat Templating](https://huggingface.co/docs/transformers/main/chat_templating)). The model input here should usually be a `list` of `dict`, referred as `messages` in prompt template. For example, for a 1-shot context, 
```python
[
    {
        "role": "question",
        "content": [
            {
                "type": "text",
                "text": "(True and False or True) and (False and False) or (False or False)",
            },
        ],
    },
    {
        "role": "answer",
        "content": [
            {
                "type": "text",
                "text": "False",
            },
        ],
    },
    {
        "role": "",
        "content": [
            {
                "type": "text",
                "text": "(False or True) and (False and False) and (True and True)",
            },
        ],
    },
    {
        "role": "answer"
    }
]
```

In [4]:
# fmt: off
model.prompt_template =  (
    "{% if messages[0]['role'] == 'instruction' %}"
        "Instruction: {{ messages[0]['content'] }}\n"
        "{% set messages = messages[1:] %}"
    "{% endif %}"
    "{% for message in messages %}"
        "{% if message['role'] != '' %}"
            "{{ message['role'].capitalize() }}: "
        "{%+ endif %}"
        "{% if 'content' in message %}"
            "{% for line in message['content'] %}"
                "{% if line['type'] == 'text' %}"
                    "{{ line['text'] }}"
                "{% endif %}"
                "{% if loop.last %}"
                    "\n\n"
                "{% endif %}"
            "{% endfor %}"
        "{% endif %}"
    "{% endfor %}"
)
# fmt: on

Next, you need to customize a prepare input to extract the data from the dataset and form the input of the model (see example above). Luckily, you can do this with the help of `register_dataset_retriever`.

In [5]:
from testbed.data import register_dataset_retriever, prepare_input


def retriever(item, is_last):
    return [
        {
            "role": "question",
            "content": [
                {
                    "type": "text",
                    "text": item["question"],
                },
            ],
        },
        (
            {"role": "answer"}
            if is_last
            else {
                "role": "answer",
                "content": [{"type": "text", "text": item["answer"]}],
            }
        ),
    ]


register_dataset_retriever("boolean", retriever=retriever)
prepare_input(
    "boolean",
    next(iter(dataloader)),
    "Here are some boolean expressions, you need to directly tell the answer. If it is true, print True, else print False.",
)

[[{'role': 'instruction',
   'content': 'Here are some boolean expressions, you need to directly tell the answer. If it is true, print True, else print False.'},
  {'role': 'question',
   'content': [{'type': 'text',
     'text': '(False and True or False) and (True or False and False) or (False or False or True)'}]},
  {'role': 'answer', 'content': [{'type': 'text', 'text': True}]},
  {'role': 'question',
   'content': [{'type': 'text',
     'text': '(True and False and False) or (False and False or False) or (False or True)'}]},
  {'role': 'answer', 'content': [{'type': 'text', 'text': True}]},
  {'role': 'question',
   'content': [{'type': 'text',
     'text': '(True and True) or (True and True) and (True or True)'}]},
  {'role': 'answer'}],
 [{'role': 'instruction',
   'content': 'Here are some boolean expressions, you need to directly tell the answer. If it is true, print True, else print False.'},
  {'role': 'question',
   'content': [{'type': 'text',
     'text': '(False or Fals


It will be transformed to the real prompt by `model.apply_prompt_template` which is a step in `model.process_input`. `apply_prompt_template` is an alias for [`apply_chat_template`](https://huggingface.co/docs/transformers/main/chat_templating).

After getting the model output, you need to do post-processing generation to clean and extract what answer should be. This is a dataset-dependent method, that is, different datasets have different post-processing styles. For our boolean expression dataset, just convert `True` to `1` and `False` to `0`.

In [9]:
from testbed.data import register_postprocess, postprocess_generation

register_postprocess("boolean", lambda pred: int(eval(pred)))
model.processor.pad_token = model.processor.eos_token
batch = next(iter(dataloader))
single_context = batch[0]
text = prepare_input("boolean", [single_context])
print(model.apply_prompt_template(text))
raw_output = model.generate(text, **hparams["generate_args"])
print(raw_output)
prediction = postprocess_generation("boolean", raw_output, stop_words=["\n", "Question", "Answer"])
print(prediction)
print(single_context[-1]["answer"])  # gt

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['Question: (False and True or False) and (True or False and False) or (False or False or True)\nAnswer: True\nQuestion: (True and False and False) or (False and False or False) or (False or True)\nAnswer: True\nQuestion: (True and True) or (True and True) and (True or True)\nAnswer: ']
['True\nQuestion: (']
[1]
True


## Step 4. Evaluate
For our task, it uses ROC AUC to evaluate, which has already been implemented in [`evaluate`](https://huggingface.co/docs/evaluate/index) library that comes from hugging face. 

In [11]:
from tqdm import tqdm
import evaluate

total_roc_auc = evaluate.load("accuracy")
result = []

# for simplicity, just run 10 batches
for i, batch in zip(
    range(10), tqdm(dataloader, desc=f"Evaluating {model.model_name} ...")
):
    text = prepare_input("boolean", batch)
    predictions = model.generate(text, **hparams["generate_args"])
    for pred, context in zip(predictions, batch):
        last_item = context[-1]
        answer = last_item["answer"]
        prediction = postprocess_generation("boolean", pred, stop_words=["\n", "Question", "Answer"])
        total_roc_auc.add(predictions=prediction, references=answer)
        result.append(
            {
                "question": last_item["question"],
                "answer": last_item["answer"],
                "raw_output":pred,
                "prediction": prediction,
            }
        )

eval_result = total_roc_auc.compute()
eval_result

Evaluating mistral-7b ...:   0%|          | 0/33 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:   3%|▎         | 1/33 [00:00<00:13,  2.42it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:   6%|▌         | 2/33 [00:00<00:11,  2.61it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:   9%|▉         | 3/33 [00:01<00:11,  2.70it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:  12%|█▏        | 4/33 [00:01<00:10,  2.78it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:  15%|█▌        | 5/33 [00:01<00:09,  2.85it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Evaluating mistral-7b ...:  18%|█▊        | 6/33 [00:02<00:09,  2.84it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


{'accuracy': 0.7}

## Step 4. Save Results
With the help of `evaluate.save`, we are able to save result and other hyper parameters to a json file.

In [None]:
hparams["dtype"] = str(hparams["dtype"])
evaluate.save("./", eval_result=eval_result, hparams=hparams, records=result)