# Evaluation of Question-Answering Models

Evaluating question-answering (QA) models is a critical step in the development of AI systems that can understand and process human language. Question-answering systems are widely used in a range of applications, from customer service automation to interactive educational systems. The reliability of these systems depends heavily on their ability to understand questions correctly and provide accurate and relevant answers. Thus, evaluating these models with robust metrics ensures that they perform well across diverse scenarios and datasets. These models are typically evaluated using a set of metrics that determine how well the model's answers match the expected answers. In this notebok, we explore key metrics used in evaluating QA models such as Exact Match, F1 Score.

### Exact Match (EM)
The Exact Match metric is the strictest form of evaluation for QA models. It measures whether the predicted answer matches the ground truth answer exactly, on a case-insensitive basis. This metric is binary, providing a score of 1 for a perfect match and 0 otherwise.

### F1 Score
The F1 Score is used to evaluate the model's accuracy on a token level, balancing precision and recall. It considers both the tokens that are present in the predicted answer and the tokens that should have been included, offering a more nuanced assessment than the Exact Match metric.

Next we will, explore the evaluation of question-answering models using PyTorch and the transformers library. We'll use a small example dataset for demonstration purposes.

#### Data Preparation
Before evaluating the models, you need to prepare your dataset. Here's how you can create and preprocess a sample dataset:

In [36]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering
import pandas as pd
import numpy as np
from collections import Counter

#### Data Preparation

Before evaluating the models, we need to prepare your dataset. This involves loading the data, preprocessing it, and preparing it for evaluation.

In [27]:
# Example data
data = {
    "context": [
        "Paris is the capital and most populous city of France.",
        "The Pacific Ocean is the largest and deepest of Earth's oceanic divisions.",
        "Python is a high-level, interpreted, and general-purpose programming language.",
        "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California.",
        "The Nobel Prize is a set of annual international awards bestowed in several categories by Swedish and Norwegian institutions."
    ],
    "question": [
        "What is the capital of France?",
        "Which ocean is the largest?",
        "What type of language is Python?",
        "Where is Tesla based?",
        "What is the Nobel Prize?"
    ],
    "answer": [
        "Paris",
        "The Pacific Ocean",
        "a high-level, interpreted, and general-purpose programming language",
        "Palo Alto, California",
        "a set of annual international awards"
    ]
}

df = pd.DataFrame(data)

# Tokenizing the data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def preprocess(question, context):
    return tokenizer(question, context, return_tensors='pt', truncation=True, padding='max_length', max_length=512)

df['encoded'] = df.apply(lambda row: preprocess(row['question'], row['context']), axis=1)


Load a pre-trained BERT model

In [28]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_out

In [None]:
Evaluation
We define the evaluation metrics and functions here, typically involving exact match (EM) and F1 score.

In [35]:
def exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common_tokens = set(pred_tokens) & set(truth_tokens)
    if not common_tokens:
        return 0
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    return 2 * (prec * rec) / (prec + rec)

# Function to calculate BLEU score
def calculate_bleu(reference, candidate):
    reference = [reference.lower().split()]
    candidate = candidate.lower().split()
    score = sentence_bleu(reference, candidate)
    return score

# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)[0]
    return scores

# Function to calculate METEOR score
def calculate_meteor(reference, candidate):
    score = meteor_score([reference], candidate)
    return score

def evaluate(model, dataset):
    model.eval()
    results = []
    with torch.no_grad():
        for _, row in dataset.iterrows():
            inputs = row['encoded']
            output = model(**inputs)
            answer_start = torch.argmax(output.start_logits)
            answer_end = torch.argmax(output.end_logits) + 1
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze()[answer_start:answer_end]))
            results.append((row['question'], row['answer'], answer))
    return results

evaluation_results = evaluate(model, df)
evaluation_df = pd.DataFrame(evaluation_results, columns=['Question', 'True Answer', 'Predicted Answer'])
evaluation_df['Exact Match'] = [exact_match(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df['F1 Score'] = [compute_f1(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df['BLEU Score'] = [calculate_bleu(true, pred) for true, pred in zip(evaluation_df['True Answer'], evaluation_df['Predicted Answer'])]
evaluation_df['ROUGE Score'] = [calculate_rouge(true, pred) for true, pred in zip(evaluation_df['True Answer'], evaluation_df['Predicted Answer'])]
# evaluation_df['METEOR Score'] = [calculate_meteor(true, pred) for true, pred in zip(evaluation_df['True Answer'], evaluation_df['Predicted Answer'])]

evaluation_df


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Unnamed: 0,Question,True Answer,Predicted Answer,Exact Match,F1 Score,BLEU Score,ROUGE Score
0,What is the capital of France?,Paris,paris,1,1.0,1.821832e-231,"{'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'r..."
1,Which ocean is the largest?,The Pacific Ocean,pacific ocean,0,0.8,9.047425e-155,"{'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'r..."
2,What type of language is Python?,"a high-level, interpreted, and general-purpose...","high - level , interpreted , and general - pur...",0,0.222222,1.1896460000000001e-231,"{'rouge-1': {'r': 0.2857142857142857, 'p': 0.2..."
3,Where is Tesla based?,"Palo Alto, California","palo alto , california",0,0.571429,1.531972e-231,"{'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'r..."
4,What is the Nobel Prize?,a set of annual international awards,a set of annual international awards bestowed ...,0,0.571429,0.3237723,"{'rouge-1': {'r': 1.0, 'p': 0.4, 'f': 0.571428..."


In [34]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

df = pd.DataFrame(data)

# Preprocessing function to tokenize the data
def preprocess(question, context):
    return tokenizer.encode_plus(question, context, return_tensors='pt', max_length=512, truncation=True, padding='max_length', add_special_tokens=True)

df['encoded'] = df.apply(lambda row: preprocess(row['question'], row['context']), axis=1)

# Evaluate the model on the provided data
evaluation_results = evaluate(model, df)
evaluation_df = pd.DataFrame(evaluation_results, columns=['Question', 'True Answer', 'Predicted Answer'])
evaluation_df['Exact Match'] = [exact_match(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]
evaluation_df['F1 Score'] = [compute_f1(pred, true) for pred, true in zip(evaluation_df['Predicted Answer'], evaluation_df['True Answer'])]

evaluation_df


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Unnamed: 0,Question,True Answer,Predicted Answer,Exact Match,F1 Score,BLEU Score
0,What is the capital of France?,Paris,paris,1,1.0,1.821832e-231
1,Which ocean is the largest?,The Pacific Ocean,pacific ocean,0,0.8,9.047425e-155
2,What type of language is Python?,"a high-level, interpreted, and general-purpose...","high - level , interpreted , and general - pur...",0,0.222222,1.1896460000000001e-231
3,Where is Tesla based?,"Palo Alto, California","palo alto , california",0,0.571429,1.531972e-231
4,What is the Nobel Prize?,a set of annual international awards,a set of annual international awards bestowed ...,0,0.571429,0.3237723
