BERTScore is an evaluation metric that leverages the contextual embeddings generated by the BERT (Bidirectional Encoder Representations from Transformers) model to assess the quality of generated text. Unlike traditional metrics such as BLEU and ROUGE, which rely on exact n-gram matching, BERTScore compares the semantic similarity between words in the generated and reference texts by using the embeddings from the pre-trained BERT model. This approach allows for a more nuanced understanding of the quality of generated responses, particularly in capturing synonyms and paraphrases.

### First of all let's install required packages

In [1]:
! pip install bert-score
! pip install transformers

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


### Import Required Libraries

- Purpose: This line imports the BERTScorer class from the bert_score library. This library provides tools to calculate BERTScore, which measures the quality of generated text based on BERT embeddings.
- Why BERTScore: BERTScore is particularly useful in scenarios where semantic similarity is more critical than exact word matches, making it suitable for tasks like paraphrase detection and dialogue systems.

In [2]:
# Import necessary libraries
from bert_score import BERTScorer

Defining Test Cases:

-  generated_responses contains sample outputs from a model that we want to evaluate.
- reference_responses holds the expected outputs or human-written responses against which the generated responses will be compared.


Purpose: This setup allows for a direct comparison to assess how well the generated text matches the reference in terms of semantic content and meaning

In [3]:
# Sample generated and reference responses
generated_responses = [
   "I am looking for information on my account.",
   "Can you help me reset my password?"
]

reference_responses = [
   "I need help with my account.",
   "I want to reset my password."
]


BERTScorer Initialization:

- Here, we create an instance of the BERTScorer class. The lang="en" argument specifies that we are working with English text, ensuring that the appropriate BERT model for English is used.
- The rescale_with_baseline=True argument enables the scorer to use a baseline score for rescaling, which can help in normalizing the scores and making them more interpretable.


Why Initialization Matters: Proper initialization of the scorer ensures that the evaluation is performed correctly according to the specified language and settings.

In [4]:
# Initialize BERTScorer
scorer = BERTScorer(lang="en", rescale_with_baseline=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Calculation:

- The scorer.score() method computes the Precision (P), Recall (R), and F1 Score for the generated responses against the reference responses.
- Precision indicates how many of the words in the generated responses were relevant to the reference responses. Recall reflects how many relevant words in the references were captured by the generated responses. The F1 Score is the harmonic mean of Precision and Recall, providing a balance between the two.


Output: This line outputs three key metrics, allowing for a comprehensive evaluation of the generated text.

In [5]:
# Calculate BERTScore
P, R, F1 = scorer.score(generated_responses, reference_responses)

Displaying Results:
- The mean() method calculates the average score across all generated responses, providing a summary metric for each evaluation category.
- The item() method is used to convert the PyTorch tensor results to standard Python floats for easier readability.


Purpose: This section prints out the calculated Precision, Recall, and F1 Score, offering insights into the quality of the generated text based on semantic similarity.

In [6]:
# Print the results
print("Precision:", P.mean().item())
print("Recall:", R.mean().item())
print("F1 Score:", F1.mean().item())

Precision: 0.4680666923522949
Recall: 0.587756872177124
F1 Score: 0.5280133485794067


The bert_score library is imported to leverage BERT embeddings for evaluating the quality of generated text. BERTScore allows for more nuanced semantic comparisons than traditional text evaluation metrics.

In [7]:
from bert_score import BERTScorer

Evaluation of Generated Responses:

Generated Responses:
- The generated_responses list contains text outputs generated by a model.
     - These are the texts that will be evaluated for quality.
- Reference Responses:
    - The reference_responses list holds the expected ideal responses, serving as a benchmark for comparison against the generated outputs.
- Initializing BERTScorer:
    - Purpose: The BERTScorer is initialized for English text and is set to rescale with a baseline, which aids in producing more reliable scores.

In [8]:
scorer = BERTScorer(lang="en", rescale_with_baseline=True)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculating BERT Scores:
- The BERT score, including precision (P), recall (R), and F1 score (F1), is calculated to assess how well the generated responses match the reference responses semantically.

In [9]:
P, R, F1 = scorer.score(generated_responses, reference_responses)

Printing Evaluation Results:
- The mean values of precision, recall, and F1 score are printed to provide insights into the quality of the model’s responses.

In [10]:
print("Generated Responses BERTScore:")
print("Precision:", P.mean().item())
print("Recall:", R.mean().item())
print("F1 Score:", F1.mean().item())


Generated Responses BERTScore:
Precision: 0.4680666923522949
Recall: 0.587756872177124
F1 Score: 0.5280133485794067


A/B Testing Section:

- Purpose: This section evaluates two different sets of generated responses (A and B) to determine which performs better when compared to the same reference responses.
- Sample Reference Responses:
    - The references list is defined specifically for this A/B testing context, ensuring a common baseline for both sets of responses.
- Defining Response Sets:
    - Two sets of responses (responses_A and responses_B) are created for testing. Each set contains variations of generated text that will be evaluated against the same reference responses.
- Calculating BERT Scores for Both Versions:
    - BERT scores for both response sets are calculated separately to facilitate direct comparison regarding how well each aligns with the reference responses.

In [13]:
# A/B Testing Section
# This section will compare two different sets of generated responses (A and B)
# against a common set of reference responses to evaluate which performs better.
# Sample reference responses for evaluation in A/B testing
references = [
    "I need assistance with my account.",
    "Please help me reset my password."
]

# Example A/B testing with sample responses
# Two sets of responses are defined here to evaluate their performance against the same reference responses.
responses_A = ["Hello, how can I assist you?", "Can you help me with my issue?"]
responses_B = ["How may I help you today?", "I need help with my account."]

# Calculate BERTScore for both versions (A and B) using the same references
# This allows for a direct comparison of how well each set of responses aligns with the reference responses
# in terms of precision, making it easy to identify which version is more effective.
P_A, _, _ = scorer.score(responses_A, references)
P_B, _, _ = scorer.score(responses_B, references)

# Print precision scores for both versions
# Comparing the precision scores will reveal which response set (A or B) better matches the reference responses,
# providing insights into the efficacy of each approach.
print(f"\nVersion A BERTScore Precision: {P_A.mean().item()}")
print(f"Version B BERTScore Precision: {P_B.mean().item()}")


Version A BERTScore Precision: 0.0704498216509819
Version B BERTScore Precision: 0.12498872727155685
