<a href="https://colab.research.google.com/github/milieureka/Fine-Tuning-LLMs-for-Enterprise-Applications/blob/main/Project-1/Finetune_Falcon_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocess

## 1. Install Dependencies




`accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer).

We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

We will also install `einops` as it is a requirement to load Falcon models.

In [1]:
!pip install --quiet transformers torch bitsandbytes accelerate datasets peft trl huggingface_hub nltk rouge_score meteor bert-score

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `finetuneFalcon-7B` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have 

In [None]:
import torch
torch.cuda.is_available()

True

Load Falcon-7B

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, Trainer, BitsAndBytesConfig, TrainingArguments
from google.colab import drive
import torch
import nltk

from datasets import load_dataset, concatenate_datasets, Dataset
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score
from bert_score import score as bertscore_score
import pandas as pd
import numpy as np
from tqdm import tqdm

In [4]:
model_name = "tiiuae/falcon-7b"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quant_config
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Test
output = pipe("What is a aspirin?", max_new_tokens=50)
print(output[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


What is a aspirin?
Aspirin is a non-steroidal anti-inflammatory drug (NSAID) that is used to treat pain and inflammation. It is also used to treat fever and reduce the risk of heart attack and stroke.
Aspirin


In [7]:
# Ensure pad_token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Prompting

In [None]:
# Define prompt templates.
prompt_template = "Medical Question: {question}\nMedical Answer: "

## User interactive promt

In [None]:
print("Interactive prompt mode. Type your prompt and press Enter (type 'quit' to exit):")
print("Using tuned hyperparameters: max_new_tokens=200, temperature=1.0, top_p=1.0, do_sample=False")

# Loop to continuously accept user prompts
while True:
    prompt_text = input("\nEnter your prompt: ")
    if prompt_text.lower() == "quit":
        print("Exiting interactive prompt mode.")
        break

    # Generate the answer using the tuned hyperparameters
    output = pipe(prompt_text, max_new_tokens=200, temperature=1.0, top_p=1.0, do_sample=True)[0]['generated_text']

    # Optionally, remove the prompt from the generated output
    generated_answer = output.replace(prompt_text, "").strip()

    print("\nGenerated Answer:")
    print(generated_answer)

Interactive prompt mode. Type your prompt and press Enter (type 'quit' to exit):
Using tuned hyperparameters: max_new_tokens=200, temperature=1.0, top_p=1.0, do_sample=False

Enter your prompt: can headache use aspirin?


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



Generated Answer:
Cure, prevention, causes of — headache — headaches can occur in case of overdose of any medication. Painkillers, antidepressants, sedates all cause headaches. In case of a migraine attack, you can’t use aspirin. These tips for headaches — a useful remedy:
— Apply a cold compress to the head, neck, and spine.
— Use a shower with a low pressure of cold water, hot water. The second option will have a negative effect, as the body needs to use more heat, that is, it takes energy from the body.
— Apply a bandage with mineral oils. For this purpose, it is advisable to use cotton gauze soaked with mineral oil or other mineral oil to relieve the pressure on the head and the neck.
— Eat less meat (meat is difficult to digest and can give more head pain). Also, you can refuse meat products.
— Eat less salty, refined food, which cause stagn

Enter your prompt: what medicine use for stomachaches?


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



Generated Answer:
I got a stomachache due to gastritis "which is a medical term meaning that the lining of the stomach is inflamed. This condition is most common when the stomach lining produces too- much acid and...
Read More »
My Stomach Hurts What Can I Use To Fix It!?!??!?!!?!?!?
Take some Alieve, and drink something hot. Your stomach will feel better.
I need some help for my tooth pain, I'm also on medicine, what else can I use?
It's ok, but I've read that it might not be the best idea. The best idea is to see a doctor and get it extracted.
What medicine are best for a stomachaches?
I recommend an adult dose of 500 mg of Tylenol every four hours. Try and get some rest, and don't eat anything until this goes away.
What medicine is best for a stomachache?
Try Tylen

Enter your prompt: quit
Exiting interactive prompt mode.


#Hyper-parameter tuning

## Load the groud truth file

In [9]:
import json
#Load Ground Truth file: a list of {"question": ..., "answer": ...} entries.
with open("/content/drive/MyDrive/grouth_truth.json", "r") as f:
    ground_truth = json.load(f)

In [None]:
#Print few entry of the ground truth file
for entry in ground_truth:
    print("Question:", entry.get("question"))
    print("Answer:", entry.get("answer"))
    print("-" * 10)

Question: What are common causes of headaches?
Answer: Common causes include stress, dehydration, poor posture, lack of sleep, and visual strain.
----------
Question: How can you differentiate between a tension headache and a migraine?
Answer: Tension headaches usually cause mild to moderate diffuse pain around the head. Migraines are often more severe, with pulsating pain on one side of the head, and can include nausea and light sensitivity.
----------
Question: What home remedies are effective for relieving headaches?
Answer: Effective remedies include hydration, rest, applying a cold or warm compress to the head, and practicing stress-reduction techniques such as meditation or gentle stretching.
----------
Question: What are typical causes of lower back pain?
Answer: Common causes include muscle or ligament strain, poor posture, heavy lifting, and underlying medical conditions like arthritis or disk disease.
----------
Question: Describe simple stretches to alleviate upper back pain

## Evaluation with both default model and tuning model

In [28]:
#Define metric calculation
def compute_metrics(reference: str, candidate: str):
    from nltk.translate.bleu_score import SmoothingFunction
    smooth = SmoothingFunction().method1

    # Tokenize the strings using whitespace (or a more advanced tokenizer)
    ref_tokens = reference.split()
    cand_tokens = candidate.split()

    # BLEU Score with smoothing
    bleu = sentence_bleu([ref_tokens], cand_tokens, smoothing_function=smooth)

    # ROUGE Scores (using rouge_score package)
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(reference, candidate)
    rouge1 = rouge_scores['rouge1'].fmeasure
    rougeL = rouge_scores['rougeL'].fmeasure

    # METEOR Score (expects tokenized inputs)
    meteor = meteor_score([reference.split()], candidate.split())

    # BERTScore (F1)
    P, R, F1 = bertscore_score([candidate], [reference], lang='en')
    bert_f1 = F1.item()

    return {
        'BLEU': bleu,
        'ROUGE-1': rouge1,
        'ROUGE-L': rougeL,
        'METEOR': meteor,
        'BERTScore': bert_f1
    }

In [29]:
# Define the core evaluation function that accepts hyperparameters
def evaluate_model(pipe, ground_truth, prompt_template,
                   max_new_tokens=100, temperature=1.0, top_p=1.0, do_sample=False):
    results = []

    for entry in tqdm(ground_truth, desc="Evaluating"):
        question = entry["question"]
        reference = entry["answer"]

        # Format the prompt with the question
        prompt = prompt_template.format(question=question)

        # Generate answer with the specified hyperparameters
        output = pipe(prompt, max_new_tokens=max_new_tokens,
                      temperature=temperature, top_p=top_p, do_sample=do_sample)[0]['generated_text']

        # Optionally, remove the prompt from the generated text
        generated_answer = output.replace(prompt, "").strip()

        # Compute metrics (assuming compute_metrics is defined elsewhere)
        metrics = compute_metrics(reference, generated_answer)

        results.append({
            "question": question,
            "reference": reference,
            "generated_answer": generated_answer,
            **metrics
        })

    # Aggregate the metrics over all examples
    aggregated = {
        'BLEU': sum(r['BLEU'] for r in results) / len(results),
        'ROUGE-1': sum(r['ROUGE-1'] for r in results) / len(results),
        'ROUGE-L': sum(r['ROUGE-L'] for r in results) / len(results),
        'METEOR': sum(r['METEOR'] for r in results) / len(results),
        'BERTScore': sum(r['BERTScore'] for r in results) / len(results)
    }
    return aggregated, results

In [None]:
# Define Evaluation Function Without Tuning (Default)
def evaluate_model_default(pipe, ground_truth, prompt_template):
    default_config = {
        "max_new_tokens": 100,
        "temperature": 1.0,
        "top_p": 1.0,
        "do_sample": False
    }
    return evaluate_model(pipe, ground_truth, prompt_template, **default_config)

# Evaluate using the default configuration
agg_default, results_default = evaluate_model_default(pipe, ground_truth, prompt_template)
print("Aggregated Metrics for Default Configuration:")
print(agg_default)

[nltk_data] Downloading package wordnet to /root/nltk_data...
Evaluating:   0%|          | 0/166 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 1/166 [00:23<1:03:56, 23.25s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 2/166 [00:32<41:38, 15.23s/it]  Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRA

Aggregated Metrics for Default Configuration:
{'BLEU': 0.017389635319110943, 'ROUGE-1': 0.18013546320796384, 'ROUGE-L': 0.14128458405745953, 'METEOR': 0.19833343850662977, 'BERTScore': 0.8470873854246485}





In [None]:
# Convert the per-example results (a list of dictionaries) to a DataFrame.
df_results = pd.DataFrame(results_default)
df_results.to_csv("detailed_default_metrics.csv", index=False)


In [None]:
import pandas as pd
import textwrap
from tabulate import tabulate

# Set Pandas display options (adjust width as needed)
pd.set_option('display.width', 120)
pd.set_option('display.max_colwidth', 50)

# Function to wrap text in a cell
def wrap_text(text, width=50):
    return "\n".join(textwrap.wrap(text, width=width)) if isinstance(text, str) else text

# Apply text wrapping to all string columns
for col in df_results.columns:
    if df_results[col].dtype == object:
        df_results[col] = df_results[col].apply(lambda x: wrap_text(x, width=50))

# Print the DataFrame in a tabular format using tabulate
print(tabulate(df_results, headers='keys', tablefmt='psql', showindex=False))


+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+------------+-----------+-----------+-----------+-------------+
| question                                           | reference                                          | generated_answer                                   |       BLEU |   ROUGE-1 |   ROUGE-L |    METEOR |   BERTScore |
|----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+------------+-----------+-----------+-----------+-------------|
| What are common causes of headaches?               | Common causes include stress, dehydration, poor    | "Headaches are a common complaint. They can be     | 0.00336784 | 0.142857  | 0.0952381 | 0.159574  |    0.841061 |
|                                                    | posture, lack of sleep, and visual strain.       

In [None]:
def evaluate_model_tuning(pipe, ground_truth, prompt_template):
    # Define multiple configurations to test
    tuning_configs = [
        {
            "name": "Config1_NearDefault",
            "max_new_tokens": 200,  # Same as default
            "temperature": 0.8,    # Slightly less random than default
            "top_p": 0.95,         # Mild nucleus sampling
            "do_sample": True      # Enable sampling
        },
        {
            "name": "Config2_LowTemp",
            "max_new_tokens": 200,  # Same as default
            "temperature": 1.5,    # More deterministic
            "top_p": 0.9,          # Moderate nucleus sampling
            "do_sample": True      # Enable sampling
        },
        {
            "name": "Config3_LongerOutput",
            "max_new_tokens": 250,  # Slightly longer than default
            "temperature": 0.7,    # Moderate randomness
            "top_p": 1.0,          # Same as original tuning
            "do_sample": True      # Enable sampling
        },
    ]

    # Evaluate each configuration
    all_results = {}
    for config in tuning_configs:
        print(f"Evaluating {config['name']}...")
        aggregated, results = evaluate_model(
            pipe, ground_truth, prompt_template, **{k: v for k, v in config.items() if k != "name"}
        )
        all_results[config['name']] = aggregated
        print(f"Aggregated Metrics for {config['name']}:")
        print(aggregated)

    return all_results

# Run the tuning evaluation
all_tuning_results = evaluate_model_tuning(pipe, ground_truth, prompt_template)

Evaluating Config1_NearDefault...


Evaluating:   0%|          | 0/166 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 1/166 [00:18<50:11, 18.25s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 2/166 [00:29<38:27, 14.07s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkp

Aggregated Metrics for Config1_NearDefault:
{'BLEU': 0.008704440480626487, 'ROUGE-1': 0.14679013008556455, 'ROUGE-L': 0.1095355835352686, 'METEOR': 0.17071727579677745, 'BERTScore': 0.8425537717629628}
Evaluating Config2_LowTemp...


Evaluating:   0%|          | 0/166 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 1/166 [00:18<51:48, 18.84s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 2/166 [00:30<40:29, 14.82s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkp

Aggregated Metrics for Config2_LowTemp:
{'BLEU': 0.0037264044871045815, 'ROUGE-1': 0.11213868123708129, 'ROUGE-L': 0.07990126018736389, 'METEOR': 0.11546376008401427, 'BERTScore': 0.8204474492245409}
Evaluating Config3_LongerOutput...


Evaluating:   0%|          | 0/166 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 1/166 [00:23<1:04:11, 23.34s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 2/166 [00:38<50:09, 18.35s/it]  Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model ch

Aggregated Metrics for Config3_LongerOutput:
{'BLEU': 0.006913408635780768, 'ROUGE-1': 0.1235083627055363, 'ROUGE-L': 0.09184573378374802, 'METEOR': 0.15587818423006916, 'BERTScore': 0.8338782747825945}
Evaluating Config4_GreedyLonger...


Evaluating:   0%|          | 0/166 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 1/166 [00:23<1:05:11, 23.71s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating:   1%|          | 2/166 [00:47<1:04:13, 23.50s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Some weights of RobertaModel were not initialized from the model ch

In [None]:
# Convert the per-example results (a list of dictionaries) to a DataFrame.
tune_results = pd.DataFrame(all_tuning_results)
tune_results.to_csv("all_tuning_results.csv", index=False)

# Apply text wrapping to all string columns
for col in tune_results.columns:
    if tune_results[col].dtype == object:
        tune_results[col] = tune_results[col].apply(lambda x: wrap_text(x, width=50))

# Print the DataFrame in a tabular format using tabulate
print(tabulate(tune_results, headers='keys', tablefmt='psql', showindex=False))


+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+------------+-----------+-----------+-----------+-------------+
| question                                           | reference                                          | generated_answer                                   |       BLEU |   ROUGE-1 |   ROUGE-L |    METEOR |   BERTScore |
|----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+------------+-----------+-----------+-----------+-------------|
| What are common causes of headaches?               | Common causes include stress, dehydration, poor    | "The most common headache is a tension headache.   | 0.00206994 | 0.0875912 | 0.0729927 | 0.155     |    0.842303 |
|                                                    | posture, lack of sleep, and visual strain.       

## Comparison result

The tuning result is slight worst than the default one, might need to seek for more systematic search (grid/ random/ Bayesian search)

In [None]:
import pandas as pd
from tabulate import tabulate
from textwrap import wrap

# Apply text wrapping to string columns for display
for col in tune_results.columns:
    if tune_results[col].dtype == object:
        tune_results[col] = tune_results[col].apply(lambda x: wrap_text(x, width=50))

# Initialize data for the comparison table
metrics = list(agg_default.keys())  # e.g., ['BLEU', 'ROUGE-1', 'ROUGE-L', 'METEOR', 'BERTScore']
data = {
    "Metric": metrics,
    "Default": [agg_default[metric] for metric in metrics]
}

# Add columns for each tuning configuration
for config_name, (agg_metrics, _) in all_tuning_results.items():
    data[config_name] = [agg_metrics[metric] for metric in metrics]

# Create the comparison DataFrame
df_comparison = pd.DataFrame(data)

# Display the comparison table in a neat format
print("Hyperparameter Tuning Results Comparison:")
print(tabulate(df_comparison, headers="keys", tablefmt="psql", showindex=False))

Hyperparameter Tuning Results Comparison

| Configuration        |   BLEU |   ROUGE-1 |   ROUGE-L |   METEOR |   BERTScore | Key Hyperparameters                                                   |
|:---------------------|-------:|----------:|----------:|---------:|------------:|:----------------------------------------------------------------------|
| Default              | 0.0174 |    0.1801 |    0.1413 |   0.1983 |      0.8471 | `do_sample=False` (Greedy), `temp=1.0`, `top_p=1.0`, `max_tokens=200` |
| Config1_NearDefault  | 0.0087 |    0.1468 |    0.1095 |   0.1707 |      0.8426 | `do_sample=True`, `temp=0.8`, `top_p=0.95`                            |
| Config2_LowTemp      | 0.0037 |    0.1121 |    0.0799 |   0.1155 |      0.8204 | `do_sample=True`, `temp=1.5` (High temp!), `top_p=0.9`                |
| Config3_LongerOutput | 0.0069 |    0.1235 |    0.0918 |   0.1559 |      0.8339 | `do_sample=True`, `temp=0.7`, `top_p=1.0`, `max_tokens=250`           |


In [None]:
import math # To initialize scores to negative infinity

# Define the metrics we want to compare (ensure these keys exist in your metrics dicts)
metric_keys = ["BLEU", "ROUGE-1", "ROUGE-L", "METEOR", "BERTScore"]

# --- Find Best Performer for Each Metric ---
best_performers = {} # Dictionary to store: {metric_key: (config_name, score)}
worst_performers = {} # Optional: Track worst for comparison {metric_key: (config_name, score)}

print("--- Metric Comparison (Higher is Better) ---")

for metric in metric_keys:
    best_score = -math.inf # Start with lowest possible score
    best_config_name = None
    worst_score = math.inf  # Start with highest possible score
    worst_config_name = None
    valid_scores_found = False

    for config_data in results_data:
        config_name = config_data["Configuration"]
        score = config_data["metrics"].get(metric) # Use .get() for safety

        # --- IMPORTANT: Check if the score is a valid number ---
        if isinstance(score, (int, float)):
            valid_scores_found = True
            # Check for best score
            if score > best_score:
                best_score = score
                best_config_name = config_name
            # Check for worst score
            if score < worst_score:
                worst_score = score
                worst_config_name = config_name
        else:
            # Handle non-numeric scores (like 'N/A' or None)
            # print(f"Skipping non-numeric score for {metric} in {config_name}: {score}") # Optional debug print
            pass # Silently skip non-numeric scores

    # Store and print the results for the current metric
    if valid_scores_found:
        best_performers[metric] = (best_config_name, best_score)
        worst_performers[metric] = (worst_config_name, worst_score) # Store worst
        print(f"Best for {metric:<10}: {best_config_name:<20} (Score: {best_score:.4f})")
        # Optional: Print worst performer too
        # print(f"Worst for {metric:<9}: {worst_config_name:<20} (Score: {worst_score:.4f})")
    else:
        print(f"Best for {metric:<10}: No valid numeric scores found for comparison.")

# --- Overall Summary (Based on winning the most metrics) ---
win_counts = {}
for config_name, score in best_performers.values():
    # Increment win count for the winning configuration
    win_counts[config_name] = win_counts.get(config_name, 0) + 1

print("\n--- Summary ---")
if not best_performers:
    print("No valid metric results found to compare.")
else:
    # Find the configuration(s) with the most wins
    max_wins = 0
    overall_winners = []
    if win_counts: # Check if win_counts is not empty
         max_wins = max(win_counts.values())
         overall_winners = [name for name, count in win_counts.items() if count == max_wins]

    if len(overall_winners) == 1:
        print(f"Overall Best Performing Configuration (most metric wins):")
        print(f"'{overall_winners[0]}' won {max_wins} out of {len(best_performers)} evaluated metrics.")
    elif len(overall_winners) > 1:
        print(f"Overall Best Performing Configurations (tied for most metric wins):")
        print(f"{', '.join(overall_winners)} each won {max_wins} out of {len(best_performers)} evaluated metrics.")
    else:
         print("Could not determine an overall winner (no wins recorded).")


    # Add the specific observation from your data
    if "Default" in overall_winners and max_wins == len(best_performers):
         print("\nNote: The 'Default' configuration achieved the highest score across all evaluated metrics.")

    # You can also print the detailed win counts:
    if win_counts:
        print("\nWin counts per configuration:")
        # Sort by win count descending
        for name, count in sorted(win_counts.items(), key=lambda item: item[1], reverse=True):
             print(f"- {name}: {count} wins")

--- Metric Comparison (Higher is Better) ---
Best for BLEU      : Default              (Score: 0.0174)
Best for ROUGE-1   : Default              (Score: 0.1801)
Best for ROUGE-L   : Default              (Score: 0.1413)
Best for METEOR    : Default              (Score: 0.1983)
Best for BERTScore : Default              (Score: 0.8471)

--- Summary ---
Overall Best Performing Configuration (most metric wins):
'Default' won 5 out of 5 evaluated metrics.

Note: The 'Default' configuration achieved the highest score across all evaluated metrics.

Win counts per configuration:
- Default: 5 wins


# Finetuning

## Load subset of datasets

• PubMedQA – Biomedical QA dataset from PubMed abstracts.

• MedQA (USMLE) – Medical board exam Q&A dataset.

• MedicineQuAD – Medication-focused QA dataset (from TGA data).

In [7]:
pubmedqa = load_dataset("pubmed_qa", "pqa_labeled", split="train[:300]")  # Evaluation
medqa = load_dataset("openlifescienceai/medqa", split="train[:300]")  # Few-shot examples
medicinequad = load_dataset("keivalya/MedQuad-MedicalQnADataset", split="train[:300]")  # Medication-specific (if available)

In [None]:
print("PubMedQA Features:", pubmedqa.column_names)
print("MedQA Features:", medqa.column_names)
print("MedicineQuAD Features:", medicinequad.column_names)

PubMedQA Features: ['pubid', 'question', 'context', 'long_answer', 'final_decision']
MedQA Features: ['id', 'data', 'subject_name']
MedicineQuAD Features: ['qtype', 'Question', 'Answer']


In [None]:
# Print questions and answers from PubMedQA
for i in range(3):
    print(f"PubMedQA Q{i}:", pubmedqa[i]["question"])
    print(f"PubMedQA A{i}:", pubmedqa[i]["long_answer"])

# Print questions and answers from MedQA
for i in range(3):
    print(f"MedQA Q{i}:", medqa[i]["data"])
    print(f"MedQA A{i}:", medqa[i]["subject_name"])

# Print questions and answers from MedicineQuAD
for i in range(3):
    print(f"MedicineQuAD Q{i}:", medicinequad[i]["Question"])
    print(f"MedicineQuAD A{i}:", medicinequad[i]["Answer"])

PubMedQA Q0: Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?
PubMedQA A0: Results depicted mitochondrial dynamics in vivo as PCD progresses within the lace plant, and highlight the correlation of this organelle with other organelles during developmental PCD. To the best of our knowledge, this is the first report of mitochondria and chloroplasts moving on transvacuolar strands to form a ring structure surrounding the nucleus during developmental PCD. Also, for the first time, we have shown the feasibility for the use of CsA in a whole plant system. Overall, our findings implicate the mitochondria as playing a critical and early role in developmentally regulated PCD in the lace plant.
PubMedQA Q1: Landolt C and snellen e acuity: differences in strabismus amblyopia?
PubMedQA A1: Using the charts described, there was only a slight overestimation of visual acuity by the Snellen E compared to the Landolt C, even in strabismus amblyopia. Small differ

## Format dataset

In [8]:
# Format datasets and keep only 'text'
def format_pubmedqa(example):
    return {"text": f"Q: {example['question']}\nA: {example['long_answer']}"}
pubmedqa_formatted = pubmedqa.map(format_pubmedqa, remove_columns=['pubid', 'question', 'context', 'long_answer', 'final_decision'])

def format_medqa(example):
    return {"text": f"Q: {example['data']['Question']}\nA: {example['data']['Correct Answer']}"}
medqa_formatted = medqa.map(format_medqa, remove_columns=['id', 'data', 'subject_name'])

def format_medicinequad(example):
    return {"text": f"Q: {example['Question']}\nA: {example['Answer']}"}
medicinequad_formatted = medicinequad.map(format_medicinequad, remove_columns=['qtype', 'Question', 'Answer'])

# Concatenate MedQA and MedicineQA for training
train_data = concatenate_datasets([medqa_formatted, medicinequad_formatted])

# Use PubMedQA as evaluation dataset
eval_data = pubmedqa_formatted

# Verify
print("Train Data Features:", train_data.column_names)
print("Eval Data Features:", eval_data.column_names)
print("First Example (Train Data, MedQA):", train_data[0])
print("First Example (Eval Data, PubMedQA):", eval_data[0])

Train Data Features: ['text']
Eval Data Features: ['text']
First Example (Train Data, MedQA): {'text': 'Q: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?\nA: Nitrofurantoin'}
First Example (Eval Data, PubMedQA): {'text': 'Q: Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?\nA: Results depicted mitochondrial dynamics in vivo as PCD progresses within the lace plant, and highlight the correlation of this organelle with o

## Pre-evalutation model before tuning

In [36]:
# Function to extract question and reference answer from text
def extract_question_answer(text):
    parts = text.split("\nA: ")
    question = parts[0].replace("Q: ", "").strip()
    reference = parts[1].strip() if len(parts) > 1 else ""
    return question, reference

In [12]:
# Function to generate answer from model
def generate_answer(question, max_length=200):
    prompt = f"Q: {question}\nA:"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        do_sample=False,
        top_p=1.0,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the answer part after "A:"
    answer = generated.split("\nA:")[1].strip() if "\nA:" in generated else generated.strip()
    return answer

In [42]:
# Evaluate model on 100 random Q&A pairs
def evaluate_model(dataset, num_samples=100):
    # Shuffle dataset and select first num_samples
    shuffled_dataset = dataset.shuffle(seed=42)  # Fixed seed for reproducibility
    selected_dataset = shuffled_dataset.select(range(min(num_samples, len(shuffled_dataset))))

    metrics_list = []
    for example in tqdm(selected_dataset, desc=f"Evaluating {num_samples} samples"):
        # Extract question and reference answer
        question, reference = extract_question_answer(example["text"])
        if not reference:
            continue  # Skip if no reference answer
        # Generate candidate answer
        candidate = generate_answer(question)
        # Compute metrics
        metrics = compute_metrics(reference, candidate)
        metrics_list.append(metrics)

    # Aggregate metrics
    aggregated = {
        key: np.mean([m[key] for m in metrics_list])
        for key in metrics_list[0].keys()
    }
    return aggregated

In [None]:
from tabulate import tabulate

# Evaluate and get aggregated metrics only
aggregated_results = evaluate_model(eval_data, num_samples=100)  # Single return value

# Prepare table data
table_data = [[metric, f"{value:.4f}"] for metric, value in aggregated_results.items()]
headers = ["Metric", "Score"]

# Print results as a table
print("\nEvaluation Results (Falcon-7B on 100 Random PubMedQA Samples):")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Evaluating 100 samples:   0%|          | 0/100 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating 100 samples:   1%|          | 1/100 [00:29<48:38, 29.48s/it]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating 100 samples:   2%|▏         | 2/100 [00:47<36:42, 22.48s/it]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating 100 samples:   3%|▎


Evaluation Results (Falcon-7B on 100 Random PubMedQA Samples):
+-----------+---------+
| Metric    |   Score |
| BLEU      |  0.0152 |
+-----------+---------+
| ROUGE-1   |  0.2042 |
+-----------+---------+
| ROUGE-L   |  0.1588 |
+-----------+---------+
| METEOR    |  0.1474 |
+-----------+---------+
| BERTScore |  0.8413 |
+-----------+---------+





## Configure QLoRA

Configure file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add dense, dense_h_to_4_h and dense_4h_to_h layers in the target modules in addition to the mixed query key value layer

In [9]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Configure the trainer

Using [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [12]:
output_dir = "/content/drive/MyDrive/Finetune/results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
    eval_strategy="steps",  # Evaluate every `eval_steps`
    eval_steps=10,  # Same frequency as save_steps
    load_best_model_at_end=True,  # Load the best model at the end
    metric_for_best_model="eval_loss",  # Use validation loss for early stopping
    greater_is_better=False,  # Lower eval_loss is better
    save_strategy="steps",
)

Then finally pass everthing to the trainer

In [13]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_train_data = train_data.map(preprocess_function, batched=True)
tokenized_eval_data = eval_data.map(preprocess_function, batched=True) if eval_data else None

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [14]:
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_eval_data,
    peft_config=peft_config,
    args=training_arguments,
)

Truncating eval dataset:   0%|          | 0/300 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Pre-process the model by upcasting the layer norms in float 32 for more stable training

In [15]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train model

To configure Weights & Biases (W&B) to save run data to your Google Drive instead of the local disk. Configure W&B to save run data to a specific folder in your Google Drive by setting the dir parameter in wandb.init()

In [16]:
import wandb
wandb.init(project="my-project", dir="/content/drive/MyDrive/Finetune/wandb_logs")

[34m[1mwandb[0m: Currently logged in as: [33mmili-eureka[0m ([33mmili-eureka-deakin-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora")
tokenizer.save_pretrained("/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora")

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
10,2.4585,0.391484
20,0.7273,0.376621
30,0.6523,0.37451
40,0.634,0.379267
50,0.6134,0.386471
60,0.6024,0.383844
70,0.593,0.382821
80,0.5346,0.394905
90,0.6039,0.395592
100,0.5188,0.389962


('/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora/tokenizer_config.json',
 '/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora/special_tokens_map.json',
 '/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora/tokenizer.json')

## Load the Fine-Tuned Model and Tokenizer

In [5]:
# Load fine-tuned model and tokenizer
fine_tuned_path = "/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora"
# Load tokenizer
finetune_tokenizer = AutoTokenizer.from_pretrained(fine_tuned_path)

Load the model with PeftModel for QLoRA weights and merge the adapter weights with the base model for inference.

In [22]:
# Check the directory
path = "/content/drive/MyDrive/Finetune/falcon7b_finetuned_qlora"
print(os.listdir(path))

['README.md', 'adapter_model.safetensors', 'adapter_config.json', 'tokenizer_config.json', 'special_tokens_map.json', 'tokenizer.json', 'training_args.bin']


Apply the learned parameters from your fine-tuning process

In [6]:
import os
from safetensors.torch import load_file

# Load LoRA config
lora_config = LoraConfig.from_pretrained(fine_tuned_path)
finetune_model = get_peft_model(model, lora_config)

# Load adapter weights using safetensors
adapter_path = os.path.join(fine_tuned_path, "adapter_model.safetensors")
state_dict = load_file(adapter_path)
finetune_model.load_state_dict(state_dict, strict=False)

finetune_model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): FalconForCausalLM(
      (transformer): FalconModel(
        (word_embeddings): Embedding(65024, 4544)
        (h): ModuleList(
          (0-31): 32 x FalconDecoderLayer(
            (self_attention): FalconAttention(
              (query_key_value): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4544, out_features=4672, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4544, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4672, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
       

## Post evaluation

In [13]:
# Function to generate answer from finetune model using the best hyper parameter
def finetune_generate_answer(question, max_length=200):
    prompt = f"Q: {question}\nA:"
    inputs = finetune_tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
    outputs = finetune_model.generate(
        **inputs,
        max_length=max_length,
        do_sample=False,
        top_p=1.0,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id
    )
    generated = finetune_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the answer part after "A:"
    finetune_model_answer = generated.split("\nA:")[1].strip() if "\nA:" in generated else generated.strip()
    return finetune_model_answer

In [43]:
# Evaluate model on 100 random Q&A pairs
def post_evaluate_model(dataset, num_samples=100):
    # Shuffle dataset and select first num_samples
    shuffled_dataset = dataset.shuffle(seed=42)  # Use the same seed with pre-evaluation
    selected_dataset = shuffled_dataset.select(range(min(num_samples, len(shuffled_dataset))))

    metrics_list = []
    for example in tqdm(selected_dataset, desc=f"Evaluating {num_samples} samples"):
        # Extract question and reference answer
        question, reference = extract_question_answer(example["text"])
        if not reference:
            continue  # Skip if no reference answer
        # Generate candidate answer
        candidate = finetune_generate_answer(question)
        # Compute metrics
        metrics = compute_metrics(reference, candidate)
        metrics_list.append(metrics)

    # Aggregate metrics
    aggregated = {
        key: np.mean([m[key] for m in metrics_list])
        for key in metrics_list[0].keys()
    }
    return aggregated

In [44]:
# Run post-evaluation
from tabulate import tabulate

# Evaluate and get aggregated metrics only
post_aggregated_results = post_evaluate_model(eval_data, num_samples=100)

# Prepare table data
table_data = [[metric, f"{value:.4f}"] for metric, value in post_aggregated_results.items()]
headers = ["Metric", "Score"]

# Print results as a table
print("\nFinetune Evaluation Results (Falcon-7B on 100 Random PubMedQA Samples):")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Evaluating 100 samples:   0%|          | 0/100 [00:00<?, ?it/s]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating 100 samples:   1%|          | 1/100 [00:21<34:41, 21.03s/it]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Evaluating 100 samples:   2%|▏         | 2/100 [00:42<34:28, 21.10s/it]Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use 


Finetune Evaluation Results (Falcon-7B on 100 Random PubMedQA Samples):
+-----------+---------+
| Metric    |   Score |
| BLEU      |  0.0143 |
+-----------+---------+
| ROUGE-1   |  0.201  |
+-----------+---------+
| ROUGE-L   |  0.1506 |
+-----------+---------+
| METEOR    |  0.1463 |
+-----------+---------+
| BERTScore |  0.8424 |
+-----------+---------+





In [48]:
# Comparison logic
comparison_data = [
    [metric, f"{aggregated_results.get(metric, 0):.4f}", f"{post_aggregated_results.get(metric, 0):.4f}",
     f"{post_aggregated_results.get(metric, 0) - aggregated_results.get(metric, 0):.4f}"]
    for metric in post_aggregated_results.keys()
]
comparison_headers = ["Metric", "Pre-Score", "Post-Score", "Improvement"]

# Print the table
print("\nPre vs. Post-Evaluation Comparison:")
print(tabulate(comparison_data, headers=comparison_headers, tablefmt="grid"))


Pre vs. Post-Evaluation Comparison:
+-----------+-------------+--------------+---------------+
| Metric    |   Pre-Score |   Post-Score |   Improvement |
| BLEU      |      0.0152 |       0.0143 |       -0.0009 |
+-----------+-------------+--------------+---------------+
| ROUGE-1   |      0.2042 |       0.201  |       -0.0032 |
+-----------+-------------+--------------+---------------+
| ROUGE-L   |      0.1588 |       0.1506 |       -0.0082 |
+-----------+-------------+--------------+---------------+
| METEOR    |      0.1474 |       0.1463 |       -0.0011 |
+-----------+-------------+--------------+---------------+
| BERTScore |      0.8413 |       0.8424 |        0.0011 |
+-----------+-------------+--------------+---------------+


# Benchmark with groundtruth dataset

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [14]:
df = pd.DataFrame(ground_truth)

df['base_answer'] = df['question'].apply(lambda q: generate_answer(question=q, max_length=200))
df['fine_tuned_answer'] = df['question'].apply(lambda q: finetune_generate_answer(question=q, max_length=200))

# Similarity function
def compute_similarity(a, b):
    vectorizer = TfidfVectorizer().fit([a, b])
    vecs = vectorizer.transform([a, b])
    return cosine_similarity(vecs[0], vecs[1])[0,0]

# Compare answers to ground truth
df['sim_base'] = df.apply(lambda row: compute_similarity(row['base_answer'], row['answer']), axis=1)
df['sim_fine_tuned'] = df.apply(lambda row: compute_similarity(row['fine_tuned_answer'], row['answer']), axis=1)

# Print formatted table with questions, ground truth, base model, and fine-tuned model answers
print("\nTable of Questions, Ground Truth, and Model Answers:")
table_columns = ['question', 'answer', 'base_answer', 'fine_tuned_answer']
print(df[table_columns].to_string(index=False))


Table of Questions, Ground Truth, and Model Answers:
                                                                         question                                                                                                                                                                                                                                                answer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [19]:
# @title
from tabulate import tabulate

# Assume df is already defined and contains the necessary columns
table_columns = ['question', 'answer', 'base_answer', 'fine_tuned_answer']

# Slice dataframe to include only the desired columns
formatted_df = df[table_columns]

# Print the formatted table using tabulate
print("\nTable of Questions, Ground Truth, and Model Answers:")
print(tabulate(formatted_df, headers='keys', tablefmt='grid', showindex=False))


Table of Questions, Ground Truth, and Model Answers:
+-----------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [20]:
# Summary statistics
print("Average Similarity - Base Model:", np.mean(df['sim_base']))
print("Average Similarity - Fine-Tuned Model:", np.mean(df['sim_fine_tuned']))

Average Similarity - Base Model: 0.21301961089246166
Average Similarity - Fine-Tuned Model: 0.56382
