# Appendix 1: Bert Score

In [None]:
!pip install transformers bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
from transformers import pipeline
import bert_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/llama3.1_quantized.csv'

import pandas as pd
llama_results = pd.read_csv(path) #upload the llama results csv

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
print("Llama Results Columns:", llama_results.columns)

Llama Results Columns: Index(['Unnamed: 0', 'ANSWERID', 'Question', 'Answer', 'URL',
       'InferencedAnswer', 'llama3.1_instruct'],
      dtype='object')


In [None]:
# Prepare a dictionary to store BERTScore metrics
scores = {
    "AnswerID": [],
    "Question": [],
    "Reference_Answer": [],
    "Model_Answer": [],
    "Precision": [],
    "Recall": [],
    "F1_Score": []
}

# Loop through each question-answer pair in the datasets
for _, row in llama_results.iterrows():
    answer_id = row['ANSWERID']
    question = row['Question']
    reference_answer = row['Answer']

    # Retrieve the model-generated answer corresponding to the question
    model_answer_row = llama_results[llama_results['ANSWERID'] == answer_id]
    if model_answer_row.empty:
        print(f"No model answer found for AnswerID: {answer_id}")
        continue
    model_answer = model_answer_row.iloc[0]['InferencedAnswer']

    # Calculate BERTScore for the model answer against the reference answer
    P, R, F1 = bert_score.score([model_answer], [reference_answer], lang="en")

    # Store results
    scores["AnswerID"].append(answer_id)
    scores["Question"].append(question)
    scores["Reference_Answer"].append(reference_answer)
    scores["Model_Answer"].append(model_answer)
    scores["Precision"].append(P.item())
    scores["Recall"].append(R.item())
    scores["F1_Score"].append(F1.item())

# Convert the scores dictionary to a DataFrame and display/save results
results_df = pd.DataFrame(scores)
print(results_df.head())

# Optionally, save the results to a new CSV file
results_df.to_csv("/content/drive/MyDrive/Llama3.1q_BERTScore_Benchmark_Results.csv", index=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

                AnswerID                                           Question  \
0  ADAM_0003147_Sec1.txt  What is (are) Polycystic ovary syndrome ? (Als...   
1  ADAM_0003147_Sec2.txt  What causes Polycystic ovary syndrome ? (Also ...   
2  ADAM_0002818_Sec7.txt    What are the complications of Noonan syndrome ?   
3  ADAM_0002818_Sec9.txt                   How to prevent Noonan syndrome ?   
4  GARD_0004375_Sec1.txt  What are the symptoms of Neurofibromatosis-Noo...   

                                    Reference_Answer  \
0  Polycystic ovary syndrome is a condition in wh...   
1  PCOS is linked to changes in hormone levels th...   
2  Buildup of fluid in tissues of body (lymphedem...   
3  Couples with a family history of Noonan syndro...   
4  The Human Phenotype Ontology provides the foll...   

                                        Model_Answer  Precision    Recall  \
0  Polycystic ovary syndrome (PCOS) is a conditio...   0.906680  0.928292   
1  Polycystic ovary syndrome (PCOS