# Fine-tuning LLM for Question Answering

### Please refer to the respective sections in the book for further details.


## Step 1. Setting Up the Development Environment


In [None]:
!pip install "peft==0.2.0"
!pip install "transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet
!pip install rouge-score tensorboard py7zr 

## Step 2. Data pre-processing


### Step 2.1 Load the dataset.

In [1]:
from datasets import load_dataset

dataset = load_dataset('pubmed_qa', 'pqa_labeled')

print(f"Train dataset size: {len(dataset['train'])}")

Reusing dataset pubmed_qa (/Users/shivamsolanki/.cache/huggingface/datasets/pubmed_qa/pqa_labeled/1.0.0/dd4c39f031a958c7e782595fa4dd1b1330484e8bbadd4d9212e5046f27e68924)


  0%|          | 0/1 [00:00<?, ?it/s]

Train dataset size: 1000


In [2]:
train_dataset = dataset['train'].shuffle().select(range(800))
train_dataset

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 800
})

In [3]:
train_dataset['context'][0]

{'contexts': ["Quality of Life (QoL) assessment remains integral in the investigation of women with lower urinary tract dysfunction. Previous work suggests that physicians tend to underestimate patients' symptoms and the bother that they cause. The aim of this study was to assess the relationship between physician and patient assessed QoL using the Kings Health Questionnaire (KHQ).",
  'Patients complaining of troublesome lower urinary tract symptoms (LUTS) were recruited from a tertiary referral urodynamic clinic. Prior to their clinic appointment they were sent a KHQ, which was completed before attending. After taking a detailed urogynecological history, a second KHQ was filled in by the physician, blinded to the patient responses, on the basis of their impression of the symptoms elicited during the interview. These data were analyzed by an independent statistician. Concordance between patient and physician assessment for individual questions was assessed using weighted kappa analysi

## Step 2.2 Train-test split.

In [3]:
train_dataset = dataset['train'].shuffle().select(range(800))

test_dataset = dataset['train'].shuffle().select(range(800, 1000))

questions = test_dataset['question']
answers = test_dataset['long_answer']

contexts = test_dataset['context']

context_strings = [' '.join(context['contexts']) for context in contexts]

contexts_list = []
for context_string in context_strings:
    contexts_list.append(context_string)

print("Question:", questions[9])
print("Answer:", answers[9])
print("Context:", contexts_list[9])


Question: Do emergency ultrasound fellowship programs impact emergency medicine residents' ultrasound education?
Answer: Emergency US fellowship programs had a positive impact on residents' US educational experiences. Emergency medicine residents performed more scans overall and also used bedside US for more advanced applications in programs with EUS fellowships.
Context: Recent years have seen a rapid proliferation of emergency ultrasound (EUS) programs in the United States. To date, there is no evidence supporting that EUS fellowships enhance residents' ultrasound (US) educational experiences. The purpose of this study was to determine the impact of EUS fellowships on emergency medicine (EM) residents' US education. We conducted a cross-sectional study at 9 academic medical centers. A questionnaire on US education and bedside US use was pilot tested and given to EM residents. The primary outcomes included the number of US examinations performed, scope of bedside US applications, barr

### Step 2.3 Transform dataset.

In [4]:
import numpy as np

transformed_train_dataset = train_dataset.from_dict({
    'context': [". ".join(example['context']['contexts']) for example in train_dataset],
    'question': train_dataset['question'],
    'answer': train_dataset['long_answer'],
})

transformed_test_dataset = test_dataset.from_dict({
    'context': [". ".join(example['context']['contexts']) for example in test_dataset],
    'question': test_dataset['question'],
    'answer': test_dataset['long_answer'],
})

print("Transformed Context:", transformed_train_dataset['context'][0])


Transformed Context: It remains controversial whether there is a gender difference in survival of patients with resected non-small cell lung cancer.. We retrospectively analyzed 2770 patients (1689 men and 1081 women) with non-small cell lung cancer who underwent pulmonary resection between 1995 and 2005 at the National Cancer Center Hospital, Tokyo. A gender difference in survival was studied in all patients, in those divided according to histology or pathologic stage, and in propensity-matched gender pairs.. There were no differences in background, such as preoperative pulmonary function, operation procedures, or operative mortality. The proportions of adenocarcinoma and pathologic stage I in women were greater than those in men (93.6% vs 61.7% and 71.4% vs 58.6%, respectively) (P<.001). Overall 5-year survival of women was better than that of men (81% vs 70%, P<.001). In adenocarcinoma, the overall 5-year survival for women was better than that for men in pathologic stage I (95% vs 

## Step 2.4 Tokenize dataset.

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-xxl"

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [6]:
from datasets import concatenate_datasets
import numpy as np

tokenized_inputs = concatenate_datasets([transformed_train_dataset]).map(
    lambda x: tokenizer(x["context"], x["question"], truncation=True),
    batched=True,
    remove_columns=["context", "question", "answer"]
)

input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]

max_source_length = int(np.percentile(input_lengths, 100))
print(f"Max source length: {max_source_length}")


tokenized_targets = concatenate_datasets([transformed_train_dataset]).map(
    lambda x: tokenizer(x["answer"], truncation=True),
    batched=True,
    remove_columns=["context", "question", "answer"]
)

target_lengths = [len(x) for x in tokenized_targets["input_ids"]]

max_target_length = int(np.percentile(target_lengths, 100))
print(f"Max target length: {max_target_length}")


  0%|          | 0/1 [00:00<?, ?ba/s]

Max source length: 512


  0%|          | 0/1 [00:00<?, ?ba/s]

Max target length: 171


## Step 2.5 Pre-process dataset.

In [7]:

def preprocess_function(sample, padding="max_length"):
    tokenizer.pad_token = tokenizer.eos_token

    inputs = ['Answer the question based on the context below. ' + context + ' ' + question
              for context, question in zip(sample["context"], sample["question"])]

    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    labels = tokenizer(text_target=sample["answer"], max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train_dataset = transformed_train_dataset.map(preprocess_function, batched=True, remove_columns=['context', 'question', 'answer'])
tokenized_test_dataset = transformed_test_dataset.map(preprocess_function, batched=True, remove_columns=['context', 'question', 'answer'])
print(f"Keys of tokenized dataset: {list(tokenized_train_dataset.features)}")

tokenized_train_dataset.save_to_disk("data/train")
tokenized_test_dataset.save_to_disk("data/eval")


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/800 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/200 [00:00<?, ? examples/s]

## Step 3. Model training/fine-tuning

### Step 3.1 Load the model.

In [8]:
from transformers import AutoModelForSeq2SeqLM

model_id = "philschmid/flan-t5-xxl-sharded-fp16"

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")




Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /data/rlhf/miniconda3/envs/finetune/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /data/rlhf/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...


  warn(msg)


Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]

### Step 3.2 Prepare the model.

In [9]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

lora_config = LoraConfig(
 r=16, 
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
model = prepare_model_for_int8_training(model)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817


### Step 3.3 Create a data Collator.

In [10]:
from transformers import DataCollatorForSeq2Seq

label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

### Step 3.4 Define Training Hyperparameters

In [14]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-t5-xxl"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
		auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    max_steps=1000, #10,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="no",
    report_to="tensorboard",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

### Step 3.5 Training the Model

In [15]:
trainer.train()

Step,Training Loss
1,1.9468
2,1.5378
3,1.9396
4,0.9585
5,0.9483
6,1.4215
7,1.9103
8,1.1182
9,2.289
10,1.4719


TrainOutput(global_step=1000, training_loss=1.7347052989304066, metrics={'train_runtime': 2705.6272, 'train_samples_per_second': 5.914, 'train_steps_per_second': 0.37, 'total_flos': 3.3861446664192e+16, 'train_loss': 1.7347052989304066, 'epoch': 1.25})

### Step 3.6 Saving the Model

In [16]:
peft_model_id="flan-t5-pubmed"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)

('flan-t5-pubmed/tokenizer_config.json',
 'flan-t5-pubmed/special_tokens_map.json',
 'flan-t5-pubmed/tokenizer.json')

## Step 4. Model Evaluation 

### Step 4.1 Load the fine-tuned model.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc. 
peft_model_id = "flan-t5-pubmed"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

### Step 4.2 Test the fine-tuned model.

In [4]:
prompt = f"""
Answer the question based on the context below.
 Context: To study whether nontriploid partial hydatidiform moles truly exist. We conducted a reevaluation of pathology and ploidy in 19 putative nontriploid partial hydatidiform moles using standardized histologic diagnostic criteria and repeat flow cytometric testing by the Hedley technique. On review of the 19 moles, 53% (10/19) were diploid nonpartial moles (initially pathologically misclassified), and 37% (7/19) were triploid partial moles (initial ploidy misclassifications). One additional case (5%) was a diploid early complete mole (initially pathologically misclassified).
 Question: Do nontriploid partial hydatidiform moles exist?
""".strip()

encoding = tokenizer(prompt, return_tensors="pt")
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        max_new_tokens=155,
    )
generated_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Output: ", generated_output)

Generated Output:  Nontriploid partial hydatidiform moles do not exist. The initial pathologic diagnosis of these moles is often incorrect. Flow cytometric testing by the Hedley technique is the most reliable method for determining ploidy. Flow cytometric testing by the Hedley technique is also the most reliable method for determining ploidy. Flow cytometric testing by the Hedley technique is the most reliable method for determining ploidy. Flow cytometric testing by the Hedley technique is the most reliable method for determining ploidy. Flow cytometric testing by the Hedley technique is the most reliable method for determining p


In [14]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /data/rlhf/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Step 4.3 Evaluate the fine-tuned model on the test dataset.

In [26]:
import evaluate
import numpy as np
import string
import collections
import re
from datasets import load_from_disk
from tqdm import tqdm
from nltk.translate.bleu_score import corpus_bleu
from nltk import word_tokenize
from sentence_transformers import SentenceTransformer

rouge_metric = evaluate.load('rouge')
bleu_metric = corpus_bleu

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
    if not s: return []
    return normalize_answer(s).split()

def compute_f1(a_gold, a_pred):
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def calculate_sentencesim_score(generated_answer, actual_answer):
    model = SentenceTransformer('average_word_embeddings_glove.6B.300d')
    embeddings1 = model.encode([generated_answer])[0]
    embeddings2 = model.encode([actual_answer])[0]
    similarity = np.dot(embeddings1, embeddings2) / (np.linalg.norm(embeddings1) * np.linalg.norm(embeddings2))
    return similarity

def evaluate_peft_model(sample,max_target_length=50):
    outputs = model.generate(input_ids=sample['input_ids'].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    return prediction, labels

test_dataset = load_from_disk('data/eval/').with_format('torch')

predictions, references = [], []
for sample in tqdm(test_dataset):
    p, l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

bleu = bleu_metric([[ref] for ref in references], predictions, auto_reweigh=True)
f1_scores = [compute_f1(ref, pred) for ref, pred in zip(references, predictions)]
f1_avg = sum(f1_scores) / len(f1_scores)

sentencesim_scores = [calculate_sentencesim_score(pred, ref) for pred, ref in zip(predictions, references)]
sentencesim_avg = sum(sentencesim_scores) / len(sentencesim_scores)

print(f'Rouge1: {rouge['rouge1'].mid.fmeasure* 100:.2f}%')
print(f'Rouge2: {rouge['rouge2'].mid.fmeasure* 100:.2f}%')
print(f'RougeL: {rouge['rougeL'].mid.fmeasure* 100:.2f}%')
print(f'RougeLsum: {rouge['rougeL'].mid.fmeasure* 100:.2f}%')
print(f'Avg F1: {f1_avg* 100:.2f}%')
print(f'BLEU: {bleu* 100:.2f}%')
print(f'Avg SentenceSim: {sentencesim_avg * 100:.2f}%')


Downloading (…)dc709/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

Downloading (…)mbedding_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8744edc709/README.md:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)4edc709/modules.json:   0%|          | 0.00/248 [00:00<?, ?B/s]

Avg F1: 22.27%
BLEU: 33.98%
Avg SentenceSim: 75.20%


### Step 4.4 Create an evaluation DataFrame.

In [29]:
import pandas as pd

test_dataset = load_from_disk('data/eval/').with_format('torch')

questions, predictions, references = [], [], []

for sample in tqdm(test_dataset):
    p, l = evaluate_peft_model(sample)

    question = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)

    predictions.append(p)
    references.append(l)
    questions.append(question)


df_flan_t5_eval = pd.DataFrame(
    {'Question': questions,
     'Prediction': predictions,
     'Answer': references,
    })
df_flan_t5_eval.head()

100%|█████████████████████████████████████████████████████████| 200/200 [19:10<00:00,  5.75s/it]


Unnamed: 0,Question,Prediction,Answer
0,Answer the question based on the context below...,Patients with MM have a favorable long-term ou...,The study results suggested that spinal cord u...
1,Answer the question based on the context below...,Among adults with no lifetime exposure to fluo...,Among adults aged 20 to 34 years with private ...
2,Answer the question based on the context below...,"In this study, we examined the relationship be...",Using data on cumulative hospital mortality fr...
3,Answer the question based on the context below...,It is important for the orthopedic surgeon to ...,Patients with severe pain immediately after ve...
4,Answer the question based on the context below...,"In this multi-center study, only the EBP units...",The EBP unit was associated with better patien...


In [30]:
df_flan_t5_eval.to_csv('df_flan_t5_eval.csv')

In [32]:
df_flan_t5_eval['Question'][0]

"Answer the question based on the context below. Tethering of the spinal cord is thought to increase the chance of neurological injury when scoliosis correction is undertaken. All patients with myelomeningocele (MM) are radiographically tethered, and untethering procedures carry significant morbidity risks including worsening neurological function and wound complications. No guidelines exist as regards untethering in patients with MM prior to scoliosis correction surgery. The authors' aim in this study was to evaluate their experience in patients with MM who were not untethered before scoliosis correction.. Seventeen patients with MM were retrospectively identified and 1) had no evidence of a clinically symptomatic tethered cord, 2) had undergone spinal fusion for scoliosis correction, and 3) had not been untethered for at least 1 year prior to surgery. The minimum follow-up after fusion was 2 years. Charts and radiographs were reviewed for neurological or shunt complications in the pe

In [40]:
test_dataset_alternate = dataset['train'].shuffle().select(range(800, 1000))

transformed_test_dataset = test_dataset_alternate.from_dict({
    'context': [". ".join(example['context']['contexts']) for example in test_dataset_alternate],
    'question': test_dataset_alternate['question'],
    'answer': test_dataset_alternate['long_answer'],
})

print("Transformed Context:", transformed_test_dataset['context'][0])


Transformed Context: The present study asked whether the processing of affective prosody is modulated by spatial attention. Pseudo-words with a neutral, happy, threatening, and fearful prosody were presented at two spatial positions. Participants attended to one position in order to detect infrequent targets. Emotional prosody was task irrelevant. The electro-encephalogram (EEG) was recorded to assess processing differences as a function of spatial attention and emotional valence.. Event-related potentials (ERPs) differed as a function of emotional prosody both when attended and when unattended. While emotional prosody effects interacted with effects of spatial attention at early processing levels (<200 ms), these effects were additive at later processing stages (>200 ms).


In [43]:
transformed_test_dataset['answer'][0]

'Emotional prosody, therefore, seems to be partially processed outside the focus of spatial attention. Whereas at early sensory processing stages spatial attention modulates the degree of emotional voice processing as a function of emotional valence, emotional prosody is processed outside of the focus of spatial attention at later processing stages.'

In [None]:
import pandas as pd

questions_test = []
generated_outputs_test = []
answers_test = []

for i, example in enumerate(transformed_test_dataset):
    print("Iteration:", i+1)
    context = example['context']
    question = example['question']
    answer = example['answer']
    print("Question:", question)

    prompt = f"""
    Answer the question based on the context below.
    Context: {context}
    Question: {question}
    """.strip()

    encoding = tokenizer(prompt, return_tensors="pt")
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids.to('cuda'),  # Move input to CUDA
            attention_mask=encoding.attention_mask.to('cuda'),  # Move attention mask to CUDA
            max_new_tokens=155,
        )
    generated_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "Answer:" in generated_output:
        generated_output = generated_output.split("Answer:", 1)[1].strip()

    print("Answer:", generated_output)

    questions_test.append(question)
    generated_outputs_test.append(generated_output)
    answers_test.append(answer)

df_eval_alternate = pd.DataFrame({
    'Question': questions_test,
    'Generated Output': generated_outputs_test,
    'Answer': answers_test
})


In [46]:
df_eval_alternate.head()

Unnamed: 0,Question,Generated Output,Answer
0,Is the processing of affective prosody influen...,The results suggest that affective prosody is ...,"Emotional prosody, therefore, seems to be part..."
1,Do mitochondria play a role in remodelling lac...,The results of this study suggest that mitocho...,Results depicted mitochondrial dynamics in viv...
2,Measurement of head and neck paragangliomas: i...,The linear dimension method is the most reprod...,"Due to a relatively good reproducibility, fast..."
3,Comparative safety of infliximab and etanercep...,Infliximab increases the risk of serious infec...,An increased risk of serious infections associ...
4,Does microbial contamination influence the suc...,Microbial contamination of HPC grafts is rare ...,The use of contaminated products with antibiot...


### Compute eval metrics

In [None]:
!pip install bert-score -q
!pip install nltk -q
!pip install rouge-score -q
!pip install sentence-transformers -q
!pip install from rouge_score -q

In [48]:
from bert_score import score as bert_score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
import numpy as np
import re
import string
import collections
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util
model1 = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def normalize_answer(s):
  """Lower text and remove punctuation, articles and extra whitespace."""
  def remove_articles(text):
    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
    return re.sub(regex, ' ', text)
  def white_space_fix(text):
    return ' '.join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
    if not s: return []
    return normalize_answer(s).split()

def compute_f1(a_gold, a_pred):
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def calculate_f1_score(row):
    return compute_f1(row['Answer'], row['Generated Output'])

def calculate_bleu_score(row):
    smoothie = SmoothingFunction().method4
    reference = row['Answer'].split()
    hypothesis = row['Generated Output'].split()
    return sentence_bleu([reference], hypothesis, smoothing_function=smoothie)

def compute_rouge(answer, ideal_answer):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    score = scorer.score(answer, ideal_answer)
    rouge_score = score['rouge1'].fmeasure
    return rouge_score

def calculate_rouge_score(row):
    return compute_rouge(row['Answer'], row['Generated Output'])

def sentence_similarity_alternate(ideal_answer, generated_answer):
    embedding_1= model1.encode(ideal_answer, convert_to_tensor=True)
    embedding_2 = model1.encode(generated_answer, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(embedding_1, embedding_2)
    sim_score = sim_score.cpu().numpy()[0][0]  # Move tensor to CPU and convert to NumPy array
    return sim_score

def calculate_similarity_score(row):
    return sentence_similarity_alternate(row['Answer'], row['Generated Output'])

df_eval_alternate['F1 Score'] = df_eval_alternate.apply(calculate_f1_score, axis=1)
df_eval_alternate['BLEU Score'] = df_eval_alternate.apply(calculate_bleu_score, axis=1)
df_eval_alternate['ROUGE Score'] = df_eval_alternate.apply(calculate_rouge_score, axis=1)
df_eval_alternate['SentenceSim Score'] = df_eval_alternate.apply(calculate_similarity_score, axis=1)

display(df_eval_alternate.head())
df_eval_alternate.describe()

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,Question,Generated Output,Answer,F1 Score,BLEU Score,ROUGE Score,SentenceSim Score
0,Is the processing of affective prosody influen...,The results suggest that affective prosody is ...,"Emotional prosody, therefore, seems to be part...",0.259259,0.040772,0.3,0.78582
1,Do mitochondria play a role in remodelling lac...,The results of this study suggest that mitocho...,Results depicted mitochondrial dynamics in viv...,0.275132,0.018996,0.359447,0.727707
2,Measurement of head and neck paragangliomas: i...,The linear dimension method is the most reprod...,"Due to a relatively good reproducibility, fast...",0.198758,0.020828,0.251366,0.707785
3,Comparative safety of infliximab and etanercep...,Infliximab increases the risk of serious infec...,An increased risk of serious infections associ...,0.29932,0.062334,0.3375,0.743439
4,Does microbial contamination influence the suc...,Microbial contamination of HPC grafts is rare ...,The use of contaminated products with antibiot...,0.159292,0.012999,0.183333,0.490405


Unnamed: 0,F1 Score,BLEU Score,ROUGE Score,SentenceSim Score
count,200.0,200.0,200.0,200.0
mean,0.232534,0.037779,0.26453,0.728594
std,0.079427,0.034505,0.087622,0.12075
min,0.061224,0.004558,0.071429,0.325754
25%,0.175651,0.012999,0.204246,0.660566
50%,0.227458,0.02653,0.253695,0.742767
75%,0.28249,0.050444,0.322581,0.815835
max,0.496894,0.192402,0.511628,0.975518


In [49]:
df_eval_alternate.to_csv('finetuned-flant5-pubmed-v2.0.csv')