# Fine-tuning a pre-trained LLM


In this notebook, we will fine-tune the base [LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) model from Hugging Face on the PubMedQA dataset. The goal is to train the model to generate both a final decision (“yes” or “no”) and a long-form explanation for each medical question, improving its ability to provide accurate and well-structured answers to medical questions.

##Installations and imports

Imports:

In [None]:
!pip uninstall -y transformers accelerate trl peft bitsandbytes
!pip install -U \
  transformers \
  accelerate \
  trl \
  peft \
  bitsandbytes \
  evaluate \
  rouge_score \
  bert_score

In [None]:
import torch
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import evaluate
from transformers import  AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig, TrainingArguments, EarlyStoppingCallback
from peft import LoraConfig, PeftModel
from peft import prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
from sklearn.model_selection import train_test_split
from accelerate import Accelerator
from google.colab import drive
from sklearn.metrics import classification_report
from huggingface_hub import login

We will use Llama-2-7B from Hugging Face and fine-tune it using our datasets for better performances.

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#some global variable
use_context = True
dataset_path = "/content/drive/MyDrive/NLP/Assignment/PQA-A.parquet"
model_name = "meta-llama/Llama-2-7b-hf"
save_path = "/content/drive/MyDrive/NLP/Assignment/adapter-weights"
merged_path = "/content/drive/MyDrive/NLP/Assignment/final-model"
max_new_tokens = max_seq_length = 512

## Dataset preprocessing


We chose to use the `pqa_artificial` dataset because it includes labeled data and contains more than 1,000 samples, which makes it possible to extract a balanced subset of "yes" and "no" answers for fine-tuning.

The preprocessing steps involve:
- Removing unnecessary fields
- Filtering out uncertain samples (`final_decision` = "maybe")
- Sampling 1,000 “yes” and 1,000 “no” answers to keep the classes balanced and have enough data without making training take too long.
- Flattening the context field for easier input formatting
- Combining the final_decision and long_answer into a full answer to better guide the model generated responses.
- Formatting each example into an instruction-style prompt

The processed data is then split into training and evaluation sets, and converted into Hugging Face `Dataset` objects for model training.

In [None]:
#load dataset
dataset = pd.read_parquet(dataset_path)
dataset.head(5)

Unnamed: 0,pubid,question,context,long_answer,final_decision
0,25429730,Are group 2 innate lymphoid cells ( ILC2s ) in...,{'contexts': ['Chronic rhinosinusitis (CRS) is...,"As ILC2s are elevated in patients with CRSwNP,...",yes
1,25433161,Does vagus nerve contribute to the development...,{'contexts': ['Phosphatidylethanolamine N-meth...,Neuronal signals via the hepatic vagus nerve c...,yes
2,25445714,Does psammaplin A induce Sirtuin 1-dependent a...,{'contexts': ['Psammaplin A (PsA) is a natural...,PsA significantly inhibited MCF-7/adr cells pr...,yes
3,25431941,Is methylation of the FGFR2 gene associated wi...,{'contexts': ['This study examined links betwe...,We identified a novel biologically plausible c...,yes
4,25432519,Do tumor-infiltrating immune cell profiles and...,{'contexts': ['Tumor microenvironment immunity...,Breast cancer immune cell subpopulation profil...,yes


In [None]:
def preprocess_df(df):
    #drop pubid
    df = df.drop(columns=['pubid'])

    #Keep only relevant field in the 'context' dict (and string)
    df['context'] = df['context'].apply(
        lambda x: ' '.join(x['contexts']) if isinstance(x, dict) and 'contexts' in x else str(x)
    )
    #Extract 1k rows with 'final_answer' 'yes' and 1k rows 'no and concatenate it
    df_yes = df[df['final_decision'] == 'yes'].sample(n=1000, random_state=42)
    df_no = df[df['final_decision'] == 'no'].sample(n=1000, random_state=42)
    df = pd.concat([df_yes, df_no])
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)

    df['question'] = df['question'].apply(lambda x: str(x))
    #Combine 'final_answer' and 'long_answer' as 'answer'
    df['answer'] = df['final_decision'] + ', ' + df['long_answer']
    df['answer'] = df['answer'].apply(lambda x: str(x))
    #Drop long_answer and final_decision
    df = df.drop(columns=['long_answer'])

    return df

In [None]:
dataset = preprocess_df(dataset)
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question        2000 non-null   object
 1   context         2000 non-null   object
 2   final_decision  2000 non-null   object
 3   answer          2000 non-null   object
dtypes: object(4)
memory usage: 62.6+ KB
None


In [None]:
def formatting_prompts_func(row):
    return (
        f"### Instruction:\n"
        f"You are a medical expert. Based on the following context, answer the question with 'Yes' or 'No', followed by a clear and accurate explanation.\n"
        f"### Context:\n{row['context']}\n"
        f"### Question:\n{row['question']}\n"
        f"### Answer:\n{row['answer']}"
    )

We chose this specific prompt format—based on the [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) style—because it aligns well with how LLaMA2 models are typically fine-tuned. The structured ### Instruction, ### Context, ### Question, and ### Answer format provides clear task framing, which should help the model better understand its role and expected output.<br>
As we can see from the prompt, we chose to have the model give both the `final answer` (“yes” or “no”) and an explanation (`long_answer`). That's because, when we only asked for an explanation, the model often gave vague or general responses and didn't clearly answer the question. By asking for a direct “yes” or “no” first, we made sure the model stayed focused and gave clearer, more useful answers.

In [None]:
dataset['text'] = dataset.apply(formatting_prompts_func, axis=1)
dataset.head(2)

Unnamed: 0,question,context,final_decision,answer,text
0,"Does tyrosinemia I , a model for human disease...",Medical treatment of tyrosinemia I relies on t...,no,"no, Normalization of hepatic collagen formatio...",### Instruction:\nYou are a medical expert. Ba...
1,Is primary sclerosing cholangitis associated w...,Cigarette smoking is thought to protect agains...,yes,"yes, The odds of having primary sclerosing cho...",### Instruction:\nYou are a medical expert. Ba...


Now in the 'text' column we have the training prompt for fine tuning the model.

In [None]:
train_dataset, eval_dataset = train_test_split(dataset, test_size=0.2, random_state=42)
len(train_dataset), len(eval_dataset)

(1600, 400)

In [None]:
# Convert pandas Series to Hugging Face Dataset objects
train_dataset_hf = Dataset.from_pandas(train_dataset)
eval_dataset_hf = Dataset.from_pandas(eval_dataset)

print(train_dataset_hf)
print(eval_dataset_hf)

Dataset({
    features: ['question', 'context', 'final_decision', 'answer', 'text', '__index_level_0__'],
    num_rows: 1600
})
Dataset({
    features: ['question', 'context', 'final_decision', 'answer', 'text', '__index_level_0__'],
    num_rows: 400
})


All processing steps are done, we can continue.



---



## Load the base model

To access the official LLaMA models from Meta, you need a personal access token, which can be generated at:<br>
https://huggingface.co/

In [None]:
#A personal token is needed to access the model
my_token = "..."
login(token=my_token)

We load the model from Hugging Face with weight quantization to reduce GPU usage. In simple terms, quantization means converting the model's 32-bit weights into smaller 4-bit numbers, which makes the model use less memory and run more efficiently.

In [None]:
def load_model_and_tokenizer(model_name):

  quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_compute_dtype=torch.float16,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_use_double_quant=False,
  )

  device_index = Accelerator().process_index
  device_map = {"": device_index}

  #Import the model from hf with quantization config
  model = AutoModelForCausalLM.from_pretrained(
      model_name,
      torch_dtype=torch.float16,
      device_map=device_map,
      quantization_config=quantization_config
  )

  #Import also LLaMa tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
  tokenizer.pad_token = tokenizer.eos_token
  tokenizer.padding_side = "right"
  return model, tokenizer

In [None]:
base_model, base_tokenizer = load_model_and_tokenizer(model_name)

## Fine-tuning the base model on PubMedQA


To better manage GPU usage, we use `LoRA` (Low-Rank Adaptation).
LoRA is a parameter-efficient fine-tuning (`PEFT`) technique that “freezes” most of the original model's weights, allowing only a small portion to be updated during fine-tuning. This greatly reduces the computational and memory requirements.<br>
Here's the LoRA configuration:

In [None]:
lora_configs = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.1,
    r=16,
    bias='none',
    task_type='CAUSAL_LM',
)

Now the model is ready, we still need to define the trainer



Now that the model is ready, we still need to define the trainer.<br>
We'll use `SFTTrainer` from the `trl` library, as it provides good integration with LoRA technique.

In [None]:
#Trainer configuration
training_args = SFTConfig(
    output_dir="/content/drive/MyDrive/NLP/Assignment/llama2-pubmedqa-a",
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    run_name="llama2-big-run",
    max_seq_length=512,
    report_to=[],
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=3,
    bf16=True,
    packing=False,
    save_total_limit=2,
    dataset_text_field='text',
    metric_for_best_model="eval_loss",
)

trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset_hf,
    eval_dataset=eval_dataset_hf,
    args=training_args,
    peft_config=lora_configs,
    formatting_func=None,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)


We can finally start the training phase.<br>
The training runs for 3 epochs with a batch size of 4 and a learning rate of 5e-5. The model is evaluated every 50 steps. Early stopping is also enabled to prevent overfitting.

In [None]:
trainer.train()
trainer.save_model(save_path)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
50,1.3739,1.250758
100,1.2418,1.222652
150,1.2186,1.219042
200,1.2324,1.217216
250,1.2207,1.216514
300,1.2178,1.216211


We can notice the validation loss hitting a plateu.

##Store the tuned model

Now that we have the fine-tuned model we can store it, then load it and perform inference on it to see if it performs better!

In [None]:
# Load base model (not quantized)
fresh_model = AutoModelForCausalLM.from_pretrained(model_name)

# Load adapter on top of base
peft_model = PeftModel.from_pretrained(fresh_model, save_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
#Merge the adapter and the original LLaMa model
merged_model = peft_model.merge_and_unload()
#Save the fine-tuned model (and tokenizer)
merged_model.save_pretrained(merged_path)
base_tokenizer.save_pretrained(merged_path)

('/content/drive/MyDrive/NLP/Assignment/final-model/tokenizer_config.json',
 '/content/drive/MyDrive/NLP/Assignment/final-model/special_tokens_map.json',
 '/content/drive/MyDrive/NLP/Assignment/final-model/tokenizer.model',
 '/content/drive/MyDrive/NLP/Assignment/final-model/added_tokens.json')

In [None]:
#Delete the fresh base model that was reloaded only to store the tuned version.
#The original base model with quantization configuration is still under the name base_model
del fresh_model

##Load tuned model

Now, under `merged_path` we have the full tuned-model and we can load it as we did with the base one.

In [None]:
tuned_model, tokenizer = load_model_and_tokenizer("NMantegazza/PubMedLLaMa")

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/2.68G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.84G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

##Evaluation of the two models

We'll compare the standard LLaMA2-7B model and the fine-tuned version on the PubMedQA dataset using several evaluation metrics:
  - Semantic metrics such as:
	  -	ROUGE
	  -	BERTScore (precision, recall, and F1)
  -	Classification metrics

These metrics will give us a well-rounded view of how both models perform.<br>
For the ROUGE and BERTScore evaluation, we'll use the Hugging Face `evaluate` library.


###Testing setup

In [None]:
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

To compute the classification report, we first need to extract the predicted labels from the generated text outputs, then filter out any invalid or unlabelled samples.

In [None]:
def extract_label(text):
    """
    Extracts 'yes' or 'no' from the start of a given answer string.
    Converts to lowercase and strips whitespace.
    Returns None if neither is found.
    """
    text = text.strip().lower()
    if text.startswith("yes"):
        return "yes"
    elif text.startswith("no"):
        return "no"
    return None

def compute_class_report(preds, refs):
    pred_labels = [extract_label(p) for p in preds]
    ref_labels = [extract_label(r) for r in refs]

    # Filter out any samples where label extraction failed
    valid_indices = [i for i, (p, r) in enumerate(zip(pred_labels, ref_labels)) if p in {"yes", "no"} and r in {"yes", "no"}]
    discarded_answers = len(preds) - len(valid_indices)
    filtered_preds = [pred_labels[i] for i in valid_indices]
    filtered_refs = [ref_labels[i] for i in valid_indices]

    if not filtered_preds:
        return {"classification_accuracy": None}

    report_dict = classification_report(filtered_refs, filtered_preds, output_dict=True)

    return {
    "classification_report": report_dict,
    "number_of_answers_without_labels": discarded_answers
    }

In [None]:
def evaluate_all_metrics(predictions, references):
    results = {}

    # ROUGE
    rouge_result = rouge.compute(predictions=predictions, references=references)
    results.update(rouge_result)

    # BERTScore
    bert_result = bertscore.compute(predictions=predictions, references=references, lang="en")
    bert_means = {f"bertscore_{k}": np.mean(v) for k, v in bert_result.items() if k != "hashcode"}
    results.update(bert_means)

    #Classification report
    accuracy = compute_class_report(predictions, references)
    results.update(accuracy)

    return results

Now that the evaluation function is set up, we can define our inference function.<br>
This function generates model responses from our PuBMedQa datset. For each question, it creates the same prompt using for the fine-tuning step. It uses a text-generation `pipeline` to produce the answers, collects the predictions, reference answers, and questions, and returns them.

In [None]:
def new_formatting_prompts_func(new_example, examples=[]):
    instruction = (
        "### Instruction:\n"
        "You are a medical expert. Based on the following context, answer the question with 'Yes' or 'No', followed by a clear and accurate explanation.\n"
    )
    body = ""

    for i, ex in enumerate(examples, 1):
        body += (
            f"\n### Example {i}:\n"
            f"### Context:\n{ex['context']}\n"
            f"### Question:\n{ex['question']}\n"
            f"### Answer:\n{ex['answer']}\n\n"
        )

    # Append the test instance
    body += (
        f"\n###End of examples\n"
        f"### Now answer the following:\n"
        f"### Context:\n{new_example['context']}\n"
        f"### Question:\n{new_example['question']}\n"
        f"### Answer:\n"
    )
    #print("Prompt: ", instruction+body)

    return instruction + body

In [None]:
from IPython.display import clear_output

def generate_responses(dataset, model, tokenizer, examples=[]):
    preds = []
    refs = []
    questions_out = []

    questions = dataset["question"].tolist()
    contexts = dataset["context"].tolist() if use_context else [""] * len(dataset)
    references = dataset["answer"].apply(str.strip).tolist()

    generation_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map="auto"
    )

    for i, (q, c, ref) in enumerate(zip(questions, contexts, references), 1):
        print(f"Processing example {i}/{len(questions)}...")

        if len(examples) != 0:
            prompt = new_formatting_prompts_func({"context": c, "question": q}, examples)
        else:
            prompt = (
                "### Instruction:\n"
                f"You are a medical expert. Based on the following context, answer the question with 'Yes' or 'No', followed by a clear and accurate explanation.\n"
                f"### Context:{c}\n"
                f"### Question:{q}\n"
                "### Answer:"
            )

        max_new_tokens = 128

        response = generation_pipeline(
            prompt,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id
        )[0]['generated_text']

        answer = response[len(prompt):].strip()
        preds.append(answer)
        refs.append(ref)
        questions_out.append(q)
        torch.cuda.empty_cache()

        clear_output(wait=True)  # Clear stdout after each iteration

    return questions_out, preds, refs

In [None]:
#Pretty print the evaluation results
def print_metrics(metrics: dict, title: str = "Evaluation Metrics"):
    def print_line(char='-', width=60):
        print(char * width)

    print(f"\n=== {title} ===")
    print(f"{'Metric':<40} {'Value':>15}")
    print_line()

    for k, v in metrics.items():
        if isinstance(v, dict) and k == "classification_report":
            print(f"{k}:")
            print_classification_report(v)
        elif isinstance(v, (float, int)):
            print(f"{k:<40} {v:>15.4f}")
        else:
            print(f"{k:<40} {str(v):>15}")
    print_line()
    torch.cuda.empty_cache()


def print_classification_report(report: dict):
    labels = [label for label in report if label not in ("accuracy", "macro avg", "weighted avg")]
    specials = [k for k in ("accuracy", "macro avg", "weighted avg") if k in report]

    header = f"{'Label':<10} {'Precision':>10} {'Recall':>10} {'F1-Score':>10} {'Support':>10}"
    print(header)
    print("-" * len(header))

    def print_row(label, scores):
        precision = scores.get("precision", 0.0)
        recall = scores.get("recall", 0.0)
        f1 = scores.get("f1-score", 0.0)
        support = scores.get("support", 0.0)
        print(f"{label:<10} {precision:>10.4f} {recall:>10.4f} {f1:>10.4f} {support:>10.0f}")

    for label in labels + specials:
        if label == "accuracy":
            print(f"\n{'Accuracy':<10} {report[label]:>10.4f}")
        else:
            print_row(label, report[label])

### Test the two models on PubMedQA evaluation set

In [None]:
#let's put the two model in eval mode for better results
base_model.eval()
tuned_model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096

Due to time constraints, we'll downsample the evaluation dataset to 100 samples, which should still give us a good insight into the performance of both models.

In [None]:
#down sample the eval set again
eval_dataset_2 = eval_dataset.sample(n=100, random_state=42).reset_index(drop=True)

In [None]:
b_quests, b_preds, b_refs = generate_responses(eval_dataset_2, base_model, base_tokenizer)
base_metrics = evaluate_all_metrics(b_preds, b_refs)

In [None]:
#print results
print_metrics(base_metrics, "Base Model Evaluation Metrics")


=== Base Model Evaluation Metrics ===
Metric                                             Value
------------------------------------------------------------
rouge1                                            0.0984
rouge2                                            0.0386
rougeL                                            0.0756
rougeLsum                                         0.0770
bertscore_precision                               0.7689
bertscore_recall                                  0.8440
bertscore_f1                                      0.8039
classification_report:
Label       Precision     Recall   F1-Score    Support
------------------------------------------------------
no             0.0000     0.0000     0.0000         29
yes            0.5735     1.0000     0.7290         39

Accuracy       0.5735
macro avg      0.2868     0.5000     0.3645         68
weighted avg     0.3289     0.5735     0.4181         68
number_of_answers_without_labels                 32.0000
---------



---
The base model seems to understand the meaning of the text fairly well (good BERTScore), but it doesn't match the exact words of the reference very closely (low ROUGE scores).

For classification, the model predicts "yes" almost all the time. It gets all the "yes" answers right, but misses every "no" — which brings the overall accuracy to about 57%.

Also, 32 answers were skipped because they didn't have labels. An indication that the model struggled to follow the prompt correctly.


In order to have more intesrseting results we may use a one-shot/few-shot learning technique, we modify the prompt to show the model an example of answer/question.

In [None]:
#start with one-shot

#sample one from the train_set
example_list = train_dataset.sample(n=1, random_state=42).to_dict(orient='records')

one_shot_quests, one_shot_preds, one_shot_refs = generate_responses(eval_dataset_2, base_model, base_tokenizer,example_list)
one_shot_metrics = evaluate_all_metrics(one_shot_preds, one_shot_refs)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print_metrics(one_shot_metrics, "One shot Model Evaluation Metrics")


=== One shot Model Evaluation Metrics ===
Metric                                             Value
------------------------------------------------------------
rouge1                                            0.3290
rouge2                                            0.1364
rougeL                                            0.2625
rougeLsum                                         0.2636
bertscore_precision                               0.9073
bertscore_recall                                  0.8867
bertscore_f1                                      0.8966
classification_report:
Label       Precision     Recall   F1-Score    Support
------------------------------------------------------
no             1.0000     0.3137     0.4776         51
yes            0.5833     1.0000     0.7368         49

Accuracy       0.6500
macro avg      0.7917     0.6569     0.6072        100
weighted avg     0.7958     0.6500     0.6046        100
number_of_answers_without_labels                  0.0000
-----



---



In [None]:
#few shot
examples = train_dataset.sample(n=5, random_state=42).to_dict(orient='records')

few_shot_quests, few_shot_preds, few_shot_refs = generate_responses(eval_dataset_2, base_model, base_tokenizer, examples)
few_shot_metrics = evaluate_all_metrics(few_shot_preds, few_shot_refs)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing example 100/100...


In [None]:
print_metrics(few_shot_metrics, "Few shot Model Evaluation Metrics")


=== Few shot Model Evaluation Metrics ===
Metric                                             Value
------------------------------------------------------------
rouge1                                            0.3719
rouge2                                            0.1591
rougeL                                            0.3040
rougeLsum                                         0.3063
bertscore_precision                               0.9021
bertscore_recall                                  0.8892
bertscore_f1                                      0.8954
classification_report:
Label       Precision     Recall   F1-Score    Support
------------------------------------------------------
no             0.9149     0.8431     0.8776         51
yes            0.8491     0.9184     0.8824         49

Accuracy       0.8800
macro avg      0.8820     0.8808     0.8800        100
weighted avg     0.8826     0.8800     0.8799        100
number_of_answers_without_labels                  0.0000
-----

In [None]:
#test the tuned model
t_quests, t_preds, t_refs = generate_responses(eval_dataset_2, tuned_model, tokenizer)
tuned_metrics = evaluate_all_metrics(t_preds, t_refs)

In [None]:
#Print results
print_metrics(tuned_metrics, "Tuned Model Evaluation Metrics")


=== Tuned Model Evaluation Metrics ===
Metric                                             Value
------------------------------------------------------------
rouge1                                            0.4060
rouge2                                            0.1917
rougeL                                            0.3269
rougeLsum                                         0.3275
bertscore_precision                               0.9165
bertscore_recall                                  0.8970
bertscore_f1                                      0.9064
classification_report:
Label       Precision     Recall   F1-Score    Support
------------------------------------------------------
no             0.7656     0.9608     0.8522         51
yes            0.9444     0.6939     0.8000         49

Accuracy       0.8300
macro avg      0.8550     0.8273     0.8261        100
weighted avg     0.8532     0.8300     0.8266        100
number_of_answers_without_labels                  0.0000
--------



---
The tuned model shows a big improvement. ROUGE scores are much higher, meaning its outputs are more similar to the reference texts. BERTScore is also very strong, showing it captures the meaning well.

For classification, the model performs well on both "yes" and "no" labels. It predicts "no" very accurately (96% recall), and still does well on "yes" (94% precision). Overall accuracy is 83%, with a balanced performance across both classes.

Also, no predictions were skipped this time — all responses had proper labels. The fine-tuning clearly helped the model become more reliable and balanced.


###Human testing on some PubMedQA questions

To further compare the two models, we can perform human evaluation on their answers for a small set of samples.

In [None]:
example_list = train_dataset.sample(n=1, random_state=42).to_dict(orient='records')
examples = train_dataset.sample(n=5, random_state=42).to_dict(orient='records')


# Sample a small subset for evaluation
sample_set = eval_dataset.sample(n=5, random_state=42).reset_index(drop=True)
# fine-tuned predictions
ft_quests, ft_preds, ft_refs = generate_responses(sample_set, tuned_model, tokenizer)
#one_shot predictions
one_shot_quests, one_shot_preds, one_shot_refs = generate_responses(sample_set, base_model,base_tokenizer,example_list)
#few_shot predictions
few_shot_quests, few_shot_preds, few_shot_refs = generate_responses(sample_set, base_model,base_tokenizer,examples)
# base model predictions
base_quests, base_preds, base_refs = generate_responses(sample_set, base_model,base_tokenizer)
# Build a DataFrame to compare results
compare_df = pd.DataFrame({
    "Question": base_quests,
    "Fine-tuned Prediction": ft_preds,
    "One-shot Prediction": one_shot_preds,
    "Few-shot Prediction": few_shot_preds,
    "Base Model Prediction": base_preds,
    "Reference": base_refs
})

In [None]:
compare_df

Unnamed: 0,Question,Fine-tuned Prediction,One-shot Prediction,Few-shot Prediction,Base Model Prediction,Reference
0,Does perineural invasion on prostate needle bi...,"no, Perineural invasion does not appear to be ...","no, Perineural invasion does not appear to be ...","no, The presence of PNI on prostate needle bio...",Yes. Perineural invasion is an independent pre...,"no, Perineural invasion is not a significant p..."
1,Is genetic heterogeneity of surgically resecte...,"no, Genetic heterogeneity is not related to hi...","yes, Genetic heterogeneity of surgically resec...","no, The results suggest that genetic heterogen...",Yes. Genetic heterogeneity is related to histo...,"yes, Prostate carcinoma is a genetically multi..."
2,Is chromogranin A a potential prognostic marke...,"yes, Our results suggest that CgA is an indepe...","yes, Chromogranin A is a potential prognostic ...","yes, Chromogranin A may be considered as a pot...",Yes. Chromogranin A is a potential prognostic ...,"yes, In CRPC patients treated with enzalutamid..."
3,Is [ Percentage of local recurrence following ...,"no, The percentage of local recurrences follow...","yes, The percentage of local recurrences follo...","no, The percentage of local recurrences follow...",Yes\n### Explanation:\nThe performance indicat...,"no, The percentage of local recurrences follow..."
4,Does [ Pre-operative smoking cessation always ...,"no, Pre-operative smoking cessation does not a...","yes, Pre-operative smoking cessation does not ...","no, Pre-operative smoking cessation does not a...",### Explanation:,"no, Pre-operative smoking cessation does not r..."



The fine-tuned model's answers are clearly better compared to the reference answers. While the base model often falls into infinite text loops and struggles to properly format its responses, frequently falling to follow the prompt's ### structure.