# Evaluating a model

In this example notebook we'll evaluate the model with the most popular metrics.

## Downloading poquad dataset

In [38]:
from datasets import load_dataset

poquad = load_dataset("clarin-pl/poquad")

In [109]:
poquad_validation = poquad["validation"]

## Load model

In [43]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_path = './output/roberta-base-squad2-pl/checkpoint-8500'

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)

## Evaluate

We select the `squad` validation. There is also a `squad_v2` validation that can be used to validation a dataset, where a question might not have an answer. This evaluation will calculate `exact_match` and `f1` metrics.

In [87]:
import evaluate

metric = evaluate.load("squad")

In [92]:
from transformers import pipeline

def get_predictions(eval_dataset):
    qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

    predictions = []
    for question, context, id in zip(eval_dataset["question"], eval_dataset["context"], eval_dataset["id"]):
        answer = qa_pipeline(question=question, context=context)
        prediction = {
            'id': id,
            'prediction_text': answer['answer']
        }

        predictions.append(prediction)

    return predictions

In [110]:
part_datset = poquad_validation.select(range(1000))

predictions = get_predictions(part_datset)

In [111]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in part_datset
]

In [112]:
metric.compute(predictions=predictions, references=theoretical_answers)

{'exact_match': 34.6, 'f1': 51.20063731342978}