# Evaluation of Question-Answering Models

Evaluating question-answering (QA) models is a critical step in the development of AI systems that can understand and process human language. Question-answering systems are widely used in a range of applications, from customer service automation to interactive educational systems. The reliability of these systems depends heavily on their ability to understand questions correctly and provide accurate and relevant answers. Thus, evaluating these models with robust metrics ensures that they perform well across diverse scenarios and datasets. These models are typically evaluated using a set of metrics that determine how well the model's answers match the expected answers. In this notebok, we explore key metrics used in evaluating QA models such as Exact Match, F1 Score.

### Exact Match (EM)
The Exact Match metric is the strictest form of evaluation for QA models. It measures whether the predicted answer matches the ground truth answer exactly, on a case-insensitive basis. This metric is binary, providing a score of 1 for a perfect match and 0 otherwise.

### F1 Score
The F1 Score is used to evaluate the model's accuracy on a token level, balancing precision and recall. It considers both the tokens that are present in the predicted answer and the tokens that should have been included, offering a more nuanced assessment than the Exact Match metric.

<img src="./imgs/metrics_example.webp" alt="drawing" width="650"/>

In addition to the Exact Match (EM) and F1 score metrics, some other commonly used metrics for NLP tasks are:

#### BLEU Score
The BLEU (Bilingual Evaluation Understudy) score measures the similarity of n-grams (contiguous sequences of n items, such as words) between the predicted and reference answers. A higher BLEU Score indicates a higher level of n-gram overlap.

#### ROUGE Score:
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score evaluates the quality of a summary by comparing n-gram overlap, word sequences, and word pairs between the predicted and reference answers. A higher ROUGE Score suggests that the predicted answers are more similar to the reference answers.

# Evaluation in Python

In [23]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering
import pandas as pd

#### Data Preparation

Before evaluating the models, we need to prepare a sample dataset. This involves loading the data, preprocessing it, and preparing it for evaluation.

In [24]:
# Example data
data = {
    "context": [
        "Paris is the capital and most populous city of France.",
        "The Pacific Ocean is the largest and deepest of Earth's oceanic divisions.",
        "Python is a high-level, interpreted, and general-purpose programming language.",
        "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California.",
        "The Nobel Prize is a set of annual international awards bestowed in several categories by Swedish and Norwegian institutions."
    ],
    "question": [
        "What is the capital of France?",
        "Which ocean is the largest?",
        "What type of language is Python?",
        "Where is Tesla based?",
        "What is the Nobel Prize?"
    ],
    "answer": [
        "Paris",
        "The Pacific Ocean",
        "a high-level, interpreted, and general-purpose programming language",
        "Palo Alto, California",
        "a set of annual international awards"
    ]
}

df = pd.DataFrame(data)

### Evaluation
We define the evaluation metrics and functions here, typically involving exact match (EM) and F1 score.

In [25]:
def exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common_tokens = set(pred_tokens) & set(truth_tokens)
    if not common_tokens:
        return 0
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    return 2 * (prec * rec) / (prec + rec)

def evaluate(model, dataset):
    model.eval()
    results = []
    with torch.no_grad():
        for _, row in dataset.iterrows():
            inputs = row['encoded']
            output = model(**inputs)
            answer_start = torch.argmax(output.start_logits)
            answer_end = torch.argmax(output.end_logits) + 1
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze()[answer_start:answer_end]))
            results.append((row['question'], row['answer'], answer))
    return results


### Performance Comparison
Next, we compare the performance of two models on the scores - the base BERT model and a fine-tuned BERT model.

Load and evaluate a pre-trained base BERT model

In [26]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Preprocessing function to tokenize the data
def preprocess(question, context):
    return tokenizer(question, context, return_tensors='pt', truncation=True, padding='max_length', max_length=512)

df['encoded'] = df.apply(lambda row: preprocess(row['question'], row['context']), axis=1)
evaluation_results = evaluate(model, df)
evaluation_df = pd.DataFrame(evaluation_results, columns=['Question', 'True Answer', 'Predicted Answer'])
evaluation_df['Exact Match'] = [exact_match(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df['F1 Score'] = [compute_f1(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_out

Unnamed: 0,Question,True Answer,Predicted Answer,Exact Match,F1 Score
0,What is the capital of France?,Paris,,0,0.0
1,Which ocean is the largest?,The Pacific Ocean,the pacific ocean is the largest and deepest of,0,0.5
2,What type of language is Python?,"a high-level, interpreted, and general-purpose...",,0,0.0
3,Where is Tesla based?,"Palo Alto, California",,0,0.0
4,What is the Nobel Prize?,a set of annual international awards,,0,0.0


In [27]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

df = pd.DataFrame(data)

# Preprocessing function to tokenize the data
def preprocess(question, context):
    return tokenizer.encode_plus(question, context, return_tensors='pt', max_length=512, truncation=True, padding='max_length', add_special_tokens=True)

df['encoded'] = df.apply(lambda row: preprocess(row['question'], row['context']), axis=1)

# Evaluate the model on the provided data
evaluation_results = evaluate(model, df)
evaluation_df = pd.DataFrame(evaluation_results, columns=['Question', 'True Answer', 'Predicted Answer'])
evaluation_df['Exact Match'] = [exact_match(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df['F1 Score'] = [compute_f1(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df


Unnamed: 0,Question,True Answer,Predicted Answer,Exact Match,F1 Score
0,What is the capital of France?,Paris,paris,1,1.0
1,Which ocean is the largest?,The Pacific Ocean,pacific ocean,0,0.8
2,What type of language is Python?,"a high-level, interpreted, and general-purpose...","high - level , interpreted , and general - pur...",0,0.222222
3,Where is Tesla based?,"Palo Alto, California","palo alto , california",0,0.571429
4,What is the Nobel Prize?,a set of annual international awards,a set of annual international awards bestowed ...,0,0.571429


The fine-tuned BERT model performs better than the base BERT model in both Exact Match and F1 Score. This improvement is expected because fine-tuning allows the model to adapt to the specific task of question-answering, leading to more accurate predictions.