# Fine Tuning Google T5 Small Model for Scientific Paper Summarization

Think of an LLM (like T5, GPT, LLaMA, etc.) as a student who has already read millions of books.

This student knows grammar, facts, reasoning, and how to form sentences.

But if you ask them to summarize medical reports, or write legal contracts, or chat like your company's support agent, they won't be perfect, because they never specifically practiced that.

üëâ Fine-tuning is like giving that student specialized training:

Instead of teaching them from scratch, you just give them a few thousand examples of exactly what you want (summaries, answers, translations, etc.).

They adjust their knowledge a little bit to become an expert in your task.

Pretraining => teaching general world knowledge.

Fine-tuning => customizing the model for your specific job.

‚ö° Example:

A base LLM knows English very well.

Fine-tuning on a dataset of customer support chats makes it talk like a helpful support agent.

Fine-tuning on summaries makes it a better summarizer.

## üìã Table of Contents

  - Download and extract scientific papers data from ArXiv and PubMed
  - Cleaning data
  - Libraries for fine tuning
  - Fine tuning and evaluating T5 small model
    - Full Fine Tuning
    - Parameter efficient fine tuning (PEFT)
      - Fine tuning with LoRA
      - Fine Tuning with QLoRA


In [3]:
import os
import re
import requests
from zipfile import ZipFile
from tqdm import tqdm
import json
import random
import numpy as np
import pandas as pd
SEED = 50

In [4]:
# Mounting gogle drive in colab to persist our files
# from google.colab import drive
# drive.mount('/content/drive')

## üóÇ Reading data from multiple resources
  
  1. ArXiv
  2. PubMed

In [5]:
DRIVE_DIR = "./drive/MyDrive/machine-learning/fine-tuning/paper-summary"
DATA_DIRS = ["./datasets", "./datasets/raw", "./datasets/processed"]
for dir in DATA_DIRS:
    os.makedirs(dir, exist_ok=True)

In [4]:
def download_and_extract(name, url):
    """
    Utility function to download and extract the datset

    Args:
        name: name of folder where files needs to be downloaded and extracted
        url: url of file to be downloaded
   """
    CHUNK_SIZE = 16 * 1024 * 1024
    os.makedirs(os.path.join(DATA_DIRS[1], name), exist_ok=True)
    zip_path = os.path.join(DATA_DIRS[1], name, url.split('/')[-1])
    extract_path = os.path.join(DATA_DIRS[1], name, 'extracted')

    if not os.path.exists(zip_path):
        print(f"Downloading {name} dataset from URL: {url} ...")
        with requests.get(url=url, stream=True) as r:
            r.raise_for_status()
            total_size = int(r.headers.get("content-length", 0))
            with open(zip_path, "wb") as f, tqdm(
                desc=name, total=total_size, unit='B', unit_scale=True
            ) as bar:
                for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
                    f.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"{zip_path} already exists")

    if url.split('/')[-1].split('.')[-1] == 'zip':
        if not os.path.exists(extract_path):
            print(f"Extracting {name} dataset...")
            with ZipFile(zip_path, "r") as zip_f:
                zip_f.extractall(extract_path)
            print(f"Extracted {name} to {zip_path}")
        else:
            print(f"Dataset already extracted")

### üìí Scientific papers datasets
   
  - Scientific papers datasets contains two sets of long and structured documents.
  - The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
  - Source: https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers

In [6]:
_CITATION = """
@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}
"""

_DESCRIPTION = """
Scientific papers datasets contains two sets of long and structured documents.
The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
  - article: the body of the document, pagragraphs seperated by "/n".
  - abstract: the abstract of the document, pagragraphs seperated by "/n".
  - section_names: titles of sections, seperated by "/n".
"""

_DOCUMENT = "article_text"
_SUMMARY = "abstract_text"

_URLS = {
    "arxiv": "https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers/1.1.1/arxiv-dataset.zip",
    "pubmed": "https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers/1.1.1/pubmed-dataset.zip",
}

In [12]:
for name, url in _URLS.items():
    download_and_extract(name, url)

Downloading arxiv dataset from URL: https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers/1.1.1/arxiv-dataset.zip ...


arxiv: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.62G/3.62G [05:46<00:00, 10.5MB/s]


Extracting arxiv dataset...
Extracted arxiv to ./datasets/raw/arxiv/arxiv-dataset.zip
Downloading pubmed dataset from URL: https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers/1.1.1/pubmed-dataset.zip ...


pubmed: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 880M/880M [01:41<00:00, 8.67MB/s] 


Extracting pubmed dataset...
Extracted pubmed to ./datasets/raw/pubmed/pubmed-dataset.zip


In [13]:
def read_json_data(file_path, num_records=1000) -> pd.DataFrame:
    """
    Sample `num_records` random JSON lines from a large JSON (.txt or .jsonl) file
    using memory-efficient reservoir sampling.
    """
    reservoir = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for i,line in enumerate(file):
            json_line = json.loads(line)
            if i < num_records:
                reservoir.append({key: " ".join(json_line[key]) for key in ["article_text", "abstract_text"]})
            else:
                j = random.randint(0,i)
                if j < num_records:
                    reservoir[j] = {key: " ".join(json_line[key]) for key in ["article_text", "abstract_text"]}

    df = pd.DataFrame(reservoir)
    df = df.rename(columns={'article_text': 'text', 'abstract_text': 'summary'})

    return df


In [14]:
df_train = pd.DataFrame(columns=['text', 'summary'])
df_test = pd.DataFrame(columns=['text', 'summary'])
df_val = pd.DataFrame(columns=['text', 'summary'])

In [15]:
for name, url in _URLS.items():
    file_path = os.path.join(DATA_DIRS[1], name, 'extracted', name+'-dataset')
    df_train = pd.concat([df_train, read_json_data(os.path.join(file_path, 'train.txt'), num_records=5000)], axis=0)
    df_test = pd.concat([df_test, read_json_data(os.path.join(file_path, 'test.txt'), num_records=1000)], axis=0)
    df_val = pd.concat([df_val, read_json_data(os.path.join(file_path, 'val.txt'), num_records=1000)], axis=0)

In [16]:
print(df_train.shape)
df_train.head(2)

(10000, 2)


Unnamed: 0,text,summary
0,in this paper we develop the idea that time di...,<S> we discuss the emergence of time dilation ...
1,the hubble deep fields ( hdfs ) are rich sourc...,<S> the original analysis of the star formatio...


In [17]:
print(df_test.shape)
df_test.head(2)

(2000, 2)


Unnamed: 0,text,summary
0,it is well known that genetic information is e...,<S> rna can be used as a high - density medium...
1,the applicability of the black scholes framew...,<S> we discuss utility based pricing and hedgi...


In [18]:
print(df_val.shape)
df_val.head(2)

(2000, 2)


Unnamed: 0,text,summary
0,"in recent years , movement for the achievement...",<S> we treat quantum counterparts of testing p...
1,the era of ` big data ' @xcite has transformed...,<S> this paper argues that there are three fun...


## üíæ Saving Data as Parqet to save space and fast access

In [19]:
df_train.to_parquet(os.path.join(DATA_DIRS[1], 'train.parquet'), index=False)
df_test.to_parquet(os.path.join(DATA_DIRS[1], 'test.parquet'), index=False)
df_val.to_parquet(os.path.join(DATA_DIRS[1], 'validation.parquet'), index=False)

## üßπ Cleanig Data

  - Removing special patterns like `<S>`
  - Removing new line characters
  - Removing non alpanumeric characters
  - Removing extra spaces

In [20]:
df_sample = df_train.sample(10).reset_index(drop=True)
df_sample.head()

Unnamed: 0,text,summary
0,a 60-year - old man was referred for further e...,<S> renal inflammatory pseudotumor is a very r...
1,extreme conditions in plasma such as high temp...,<S> oblique propagation and head - on collisio...
2,"up until the 1980s , most research in developm...",<S> systems developmental biology is an approa...
3,pyoderma gangrenosum ( pg ) is an ulcerative d...,<S> pyoderma gangrenosum ( pg ) is a rare diso...
4,urolithiasis ( ul ) in pediatric patients is r...,<S> purposewe investigated the influence of ov...


In [21]:
print(df_sample.text[0])
print(df_sample.summary[0])

a 60-year - old man was referred for further evaluation of painless gross hematuria , first noted ten days earlier . his past medical history and the results of physical examination were unremarkable , except for mild right flank discomfort . urinalysis revealed numerous red blood cells ( rbc>100/hpf ) , while other laboratory tests , including urine culture and cytology , were all negative . abdominal ct performed at another hospital revealed the presence of a well - defined right renal mass involving the lower pole ( fig . the lesion showed low attenuation , and its margin with the renal parenchyma was ill defined . for us examination , real - time equipment ( hdi 3000 ; advanced technology laboratories , bothell , wa ) with a 2 - 4 mhz curved array transducer was used . gray - scale us revealed a round , homogeneously hypoechoic lesion , 1.6 cm in diameter , located mainly within the renal sinus . on power doppler us , multiple vascularities meandering around the mass were identifie

In [22]:
def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.lower()
    text = re.sub(r"<S>", "", text)
    text = re.sub(r"<s>", "", text)
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"[^a-z0-9 ]", " ", text)
    text = re.sub(r"\s+", " ", text)

    return text.strip()

In [23]:
df_sample_clean = df_sample.map(clean_text)
df_sample_clean.head()

Unnamed: 0,text,summary
0,a 60 year old man was referred for further eva...,renal inflammatory pseudotumor is a very rare ...
1,extreme conditions in plasma such as high temp...,oblique propagation and head on collisions of ...
2,up until the 1980s most research in developmen...,systems developmental biology is an approach t...
3,pyoderma gangrenosum pg is an ulcerative disor...,pyoderma gangrenosum pg is a rare disorder of ...
4,urolithiasis ul in pediatric patients is relat...,purposewe investigated the influence of overwe...


In [24]:
df_train_clean = df_train.map(clean_text)
df_test_clean = df_test.map(clean_text)
df_val_clean = df_val.map(clean_text)

In [None]:
# Change directory to DRIVE_DIR or DATA_DIRS[-1] if you are using google drive or local
df_train_clean.to_parquet(os.path.join(DATA_DIRS[-1], 'train_cleaned.parquet'))
df_test_clean.to_parquet(os.path.join(DATA_DIRS[-1], 'test_cleaned.parquet'))
df_val_clean.to_parquet(os.path.join(DATA_DIRS[-1], 'validation_cleaned.parquet'))

In [27]:
df_train_clean.sample()

Unnamed: 0,text,summary
2756,there can be little doubt that copd is current...,copd is uniquely situated as a chronic disease...


## üõ† Improting libraries for model fine tuning

In [None]:
# # To install libraries if do not have them already
# !pip install evaluate
# !pip install rouge_score
# !pip install bert_score
# !pip install transformers[torch]

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from evaluate import load

In [9]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

cuda


## üßÆ We will use Google's T5 Small model for our summarization task

- Let's start by loading the base model and using it drectly for summarization

In [52]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

- Let's try to run a sample summary with the base model

In [12]:
def summarize(text, model, tokenizer, max_input_len=1024, max_summary_len=250):
  inputs = tokenizer([text], max_length=max_input_len, return_tensors='pt', truncation=True)
  summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=max_summary_len, early_stopping=True)

  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [32]:
sample_text, sample_summary = df_sample_clean.text[0], df_sample_clean.summary[0]
generated_summary = summarize(sample_text, base_model, tokenizer)

### üèÖ Evaluation with Ground Truth (Human Evaluation)

In [33]:
print(f"sample_summary: {sample_summary}")
print(f"generated_summary: {generated_summary}")

sample_summary: renal inflammatory pseudotumor is a very rare benign condition of unknown etiology characterized by proliferative myofibroblasts fibroblasts histiocytes and plasma cells in the case s we report the lesion appeared on contrast enhanced power doppler us images as a well defined hypoechoic mass with intratumoral vascularity and on ct as a low attenuated mass s differentiation from malignant renal neoplasms was not possible s
generated_summary: a curved array transducer was used gray scale us revealed a round homogeneously hypoechoic lesion 1 6 cm in diameter located mainly within the renal sinus on power doppler us multiple vascularities meandering around the mass were identified but none were seen within it fig 1d and the conventional power doppler us images obtained 7 minutes after injection also showed these same signals fig however severe blooming artifacts were observed around and within the kidney.


### üèÖ Let's Evaluate the Base Model on Test Dataset with Rouge and BERTscore

In [10]:
# Load metrics
rouge = load("rouge")
bertscore = load("bertscore")

In [11]:
# Creating dataset for evaluation and training
# Change directory from DRIVE_DIR to DATA_DIRS[-1] if youare not using google drive
data_files = {
    "train": os.path.join(DATA_DIRS[-1], 'train_cleaned.parquet'),
    "test": os.path.join(DATA_DIRS[-1], 'test_cleaned.parquet'),
    "validation": os.path.join(DATA_DIRS[-1], 'validation_cleaned.parquet'),
}

dataset = load_dataset("parquet", data_files=data_files)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 2000
    })
})


In [None]:
def encode(examples, tokenizer, max_input_len=1024, max_summary_len=250):
    """
    To encode model input and labels using tokenizeer
    """
    inputs = [f"summarize: {doc}" for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_len, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=max_summary_len, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
dataset = dataset.map(lambda x: encode(x, tokenizer), batched=True)

In [17]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['text', 'summary', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'summary', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})


In [18]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=base_model)

In [19]:
test_dataset = dataset["test"].with_format(
    type="torch",
    columns=["input_ids", "labels", "attention_mask"]
)
test_dataloader = DataLoader(test_dataset, batch_size=8, collate_fn=data_collator)

In [20]:
def evaluate(model, tokenizer, data_loader):
  model.eval()
  all_preds, all_labels = [], []

  for batch in tqdm(data_loader):
      with torch.no_grad():
          input_ids = batch["input_ids"].to(model.device)
          labels = batch["labels"].to(model.device)
          labels = labels.clone()
          labels[labels == -100] = tokenizer.pad_token_id

          outputs = model.generate(input_ids=input_ids, max_length=200, num_beams=4)
          preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
          refs = tokenizer.batch_decode(labels, skip_special_tokens=True)

          all_preds.extend(preds)
          all_labels.extend(refs)

  # Compute metrics
  rouge_result = rouge.compute(predictions=all_preds, references=all_labels)
  processed_result = {}
  for key, value in rouge_result.items():
      if hasattr(value, "mid"):
          processed_result[key] = value.mid.fmeasure * 100
      else:
          processed_result[key] = value * 100

  # Compute BERTScore (returns precision, recall, f1)
  bertscore_result = bertscore.compute(predictions=all_preds, references=all_labels, lang="en")
  processed_result['bertscore_precision'] = np.mean(bertscore_result['precision']) * 100
  processed_result['bertscore_recall'] = np.mean(bertscore_result['recall']) * 100
  processed_result['bertscore_f1'] = np.mean(bertscore_result['f1']) * 100

  return processed_result

In [None]:
base_model.to(DEVICE)
print(next(base_model.parameters()).device)

evaluate(base_model, tokenizer, test_dataloader)

### üèÖ Base Model Evaluation

- Rouge
  - rouge1: 26.6099
  - rouge2: 7.2635
  - rougeL: 16.3324
  - rougeLsum: 16.3225

- BERTscore
  - precsion: 83.3965
  - recall: 81.2649
  - f1: 82.3042

- As we can see with Rouge cores, base model is performing poorly ok with our test dataset let's try fine-tuning the model on our data and check if we can improve model performance on our data 

## üéõ Full Fine Tuning

- Fine tuning full model with all trainable parameters

In [53]:
def count_parameters(model):
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    all_param = sum(p.numel() for p in model.parameters())
    print(f"Total Parameters: {all_param}")
    print(f"Trainable Parameters: {trainable_params}")
    print(f"% of Trainable Parameters: {100 * trainable_params / all_param}%")

count_parameters(base_model)

Total Parameters: 60506624
Trainable Parameters: 60506624
% of Trainable Parameters: 100.0%


In [54]:
BATCH_SIZE = 8

In [None]:
# Function to compute metrices
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    return preds, labels

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    if isinstance(preds, tuple):
        preds = preds[0]
    if not isinstance(labels, np.ndarray):
        labels = labels.cpu().numpy()

    preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    preds, labels = postprocess_text(preds, labels)

    # Compute ROUGE
    result = rouge.compute(predictions=preds, references=labels, use_stemmer=True)
    processed_result = {}
    for key, value in result.items():
        if hasattr(value, "mid"):
            processed_result[key] = value.mid.fmeasure * 100
        else:
            processed_result[key] = value * 100

    # Compute BERTScore (returns precision, recall, f1)
    bertscore_result = bertscore.compute(predictions=preds, references=labels, lang="en")
    processed_result['bertscore_precision'] = np.mean(bertscore_result['precision']) * 100
    processed_result['bertscore_recall'] = np.mean(bertscore_result['recall']) * 100
    processed_result['bertscore_f1'] = np.mean(bertscore_result['f1']) * 100

    return processed_result


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=f"models/t5-small-full-finetuned",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=5,
    predict_with_generate=True,
    logging_dir="./logs",
    report_to="none",
)

#### Seq2SeqTrainingArguments:
- `output_dir`: The output directory where the model predictions and checkpoints will be written.
- `eval_strategy`: The evaluation strategy to adopt during training.
    - "no": No evaluation is done during training
    - "steps": Evaluation is done (and logged) every eval_steps.
    - "epoch": Evaluation is done at the end of each epoch.
- `learning_rate`: The initial learning rate for AdamW optimizer.
- `per_device_train_batch_size`: The batch size per device (global batch size: per_device_train_batch_size * number_of_devices). 
- `per_device_eval_batch_size`: The batch size per device accelerator core/CPU for evaluation.
- `weight_decay`: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. 
- `save_total_limit`: If a value is passed, will limit the total amount of checkpoints.
- `num_train_epochs`: Total number of training epochs to perform.
- `predict_with_generate`: Whether to use generate to calculate generative metrics (ROUGE, BLEU).
- `logging_dir`: TensorBoard log directory. Will default to *output_dir/runs/CURRENT_DATETIME_HOSTNAME*.
- `report_to`: The list of integrations to report the results and logs to.

In [57]:
trainer = Seq2SeqTrainer(
    model=base_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Seq2SeqTrainer(


#### Seq2SeqTrainer Arguments:
- `model`:  Model to train
- `args`:  Seq2SeqTrainingArguments
- `train_dataset`:  Trainig dataset 
- `eval_dataset`:  Validation dataset
- `tokenizer`:  Tokenizer to map data to model required tokens 
- `data_collator`:  A function or object that takes a list of dataset examples and collates them into a batch (tensors) that the model can consume.
- `compute_metrics`:  Custom function to compute evaluation metrics after predictions.

In [58]:
print(torch.cuda.is_available())
if torch.cuda.is_available():
  print(torch.cuda.get_device_name(0))

True
NVIDIA A10G


In [59]:
%%time
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bertscore Precision,Bertscore Recall,Bertscore F1
1,3.0637,2.891732,12.151895,4.749489,9.882997,9.877146,86.590275,78.541887,82.349137
2,3.0196,2.840195,12.506681,4.987702,10.138939,10.128267,86.841563,78.614802,82.502524
3,2.97,2.812773,12.515393,5.045473,10.191,10.175036,86.882352,78.620633,82.523735
4,2.9856,2.801583,12.46332,5.01719,10.156466,10.146626,86.872609,78.608904,82.513081
5,2.9874,2.79733,12.495336,5.042214,10.180623,10.168297,86.89226,78.619703,82.527632


CPU times: user 32min 7s, sys: 5min 14s, total: 37min 21s
Wall time: 37min 18s


TrainOutput(global_step=6250, training_loss=3.0588654272460936, metrics={'train_runtime': 2238.2118, 'train_samples_per_second': 22.339, 'train_steps_per_second': 2.792, 'total_flos': 1.35341801472e+16, 'train_loss': 3.0588654272460936, 'epoch': 5.0})

In [60]:
trainer.save_model("models/t5-small-full-finetuned")
tokenizer.save_pretrained("models/t5-small-full-finetuned")

('models/t5-small-full-finetuned/tokenizer_config.json',
 'models/t5-small-full-finetuned/special_tokens_map.json',
 'models/t5-small-full-finetuned/tokenizer.json')

### üèÖ Evaluating Full Fine Tuned Model with Rouge and BERTscore

In [61]:
# Load model and tokenizer
full_fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(f"models/t5-small-full-finetuned").to(DEVICE)
full_fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"models/t5-small-full-finetuned")
print(next(full_fine_tuned_model.parameters()).device)

evaluate(full_fine_tuned_model, full_fine_tuned_tokenizer, test_dataloader)

cuda:0


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 250/250 [10:51<00:00,  2.60s/it]


{'rouge1': 35.08882752245391,
 'rouge2': 11.924335172968819,
 'rougeL': 21.589632817687352,
 'rougeLsum': 21.570389214031827,
 'bertscore_precision': 83.97513969242573,
 'bertscore_recall': 82.87609891295433,
 'bertscore_f1': 83.40484811365604}

### üèÖ Full Fine Tuned Model Evaluation

- Rouge
  - rouge1: 35.0888
  - rouge2: 11.9243
  - rougeL: 21.5896
  - rougeLsum: 21.5703

- BERTscore
  - precsion: 83.9751
  - recall: 82.8760
  - f1: 83.40484

- Performance on Rouge improved significantly on our data compare to the base model

- Let's check the memory size f the model

In [62]:
param_size = 0
for param in full_fine_tuned_model.parameters():
    param_size += param.nelement() * param.element_size()

buffer_size = 0
for buffer in full_fine_tuned_model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

model_size = (param_size + buffer_size) / 1024**2
print(f"Model size in memory: {model_size:.2f} MB")

Model size in memory: 230.81 MB


### ü´® Problems with Full Fine-Tuning

- A big LLM (like T5, GPT, LLaMA) has billions of parameters.

- Full fine-tuning means updating all those parameters, that wil needs huge GPU memory + is very slow.

## üéõ Parameter Efficient Fine Tuning (PEFT)

- Fine tuning with LoRA (Low-Rank Adaptation)

  - Instead of changing the whole model, let's just add a tiny set of trainable layers on top of it.
  - You freeze the original model (no changes to billions of weights).
  - Add small adapter layers (lightweight matrices).
  - Train only these adapters, that means way fewer parameters to update.
  - At the end, you combine the adapter with the original model.
  - It requires way less GPU memory (can fine-tune large models on a single GPU)
  - You can have multiple LoRA adapters for different tasks (e.g., one adapter for summarization, one adapter for legal documents)

In [None]:
# # To install libraries if do not have them already
# !pip install peft
# !pip install --upgrade peft

In [24]:
from peft import LoraConfig, get_peft_model, TaskType, PeftModelForSeq2SeqLM

In [25]:
# Define LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

#### LoraConfig Arguments:
- `r`: Rank of matrices for LoRA adapter
- `lora_alpha`: A scaling factor applied to the LoRA weights.
- `target_modules`: List of module names where LoRA layers are injected (e.g., attention projection layers).
- `lora_dropout`: Dropout applied to LoRA activations during training (for regularization).
- `bias`: Whether to train biases as well.
- `task_type`: Tells PEFT which type of task you‚Äôre fine-tuning for.
    - "SEQ_CLS": sequence classification
    - "SEQ_2_SEQ_LM": encoder-decoder models (like T5, BART)
    - "CAUSAL_LM": decoder-only (like GPT, LLaMA)
    - "TOKEN_CLS": token classification
    - "QUESTION_ANS": QA tasks

In [26]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

In [None]:
# Need to remove num_items_in_batch from trasformer model, because LoRA doesn't support it yet
class PatchedPeftModel(PeftModelForSeq2SeqLM):
    def forward(self, *args, **kwargs):
        kwargs.pop("num_items_in_batch", None)  # drop it if present
        return super().forward(*args, **kwargs)

patch_lora_model = PatchedPeftModel(base_model, lora_config)

In [28]:
patch_lora_model.print_trainable_parameters()

trainable params: 589,824 || all params: 61,096,448 || trainable%: 0.9654


In [29]:
lora_data_collator = DataCollatorForSeq2Seq(tokenizer, model=patch_lora_model)

In [None]:
lora_training_args = Seq2SeqTrainingArguments(
    output_dir=f"models/t5-small-lora-finetuned",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=5,
    predict_with_generate=True,
    logging_dir="./logs",
    report_to="none",
    save_safetensors=False,
    remove_unused_columns=False
)

#### Seq2SeqTrainingArguments:
Extra arguments required for LoRA (may change in future updates)
- `save_safetensors`: Use safetensors saving and loading for state dicts instead of default torch.load and torch.save.
- `remove_unused_columns`: Whether or not to automatically remove the columns unused by the model forward method.

In [31]:
lora_dataset = dataset.remove_columns(
    [col for col in dataset["train"].column_names if col not in ["input_ids", "attention_mask", "labels"]]
)
print(lora_dataset["train"].column_names)
lora_dataset = lora_dataset.remove_columns(
    [col for col in lora_dataset["validation"].column_names if col not in ["input_ids", "attention_mask", "labels"]]
)
print(lora_dataset["validation"].column_names)
lora_dataset = lora_dataset.remove_columns(
    [col for col in lora_dataset["test"].column_names if col not in ["input_ids", "attention_mask", "labels"]]
)
print(lora_dataset["test"].column_names)

['input_ids', 'attention_mask', 'labels']
['input_ids', 'attention_mask', 'labels']
['input_ids', 'attention_mask', 'labels']


In [32]:
lora_trainer = Seq2SeqTrainer(
    model=patch_lora_model,
    args=lora_training_args,
    train_dataset=lora_dataset["train"],
    eval_dataset=lora_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=lora_data_collator,
    compute_metrics=compute_metrics
)

  lora_trainer = Seq2SeqTrainer(


In [132]:
%%time
lora_trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bertscore Precision,Bertscore Recall,Bertscore F1
1,3.3249,3.112329,10.210416,3.269741,8.36099,8.364207,84.672566,77.836604,81.087008
2,3.277,3.052506,11.041121,3.822872,8.989616,8.993462,85.349418,78.129079,81.558418
3,3.2325,3.021656,11.401158,4.077557,9.252127,9.255995,85.744994,78.26277,81.81286
4,3.2522,3.008114,11.540443,4.196062,9.360867,9.363418,85.871677,78.319947,81.901515
5,3.2423,3.00377,11.581388,4.236063,9.400156,9.401994,85.906796,78.332417,81.924238


CPU times: user 32min 26s, sys: 6min 39s, total: 39min 6s
Wall time: 39min 1s


TrainOutput(global_step=6250, training_loss=3.3556850659179687, metrics={'train_runtime': 2341.4516, 'train_samples_per_second': 21.354, 'train_steps_per_second': 2.669, 'total_flos': 1.371537408e+16, 'train_loss': 3.3556850659179687, 'epoch': 5.0})

In [137]:
lora_trainer.model.save_pretrained(f"models/t5-small-lora-finetuned")
tokenizer.save_pretrained(f"models/t5-small-lora-finetuned")

('models/t5-small-lora-finetuned/tokenizer_config.json',
 'models/t5-small-lora-finetuned/special_tokens_map.json',
 'models/t5-small-lora-finetuned/tokenizer.json')

### üèÖ Evaluating the Model Fine Tuned with LoRA

In [150]:
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
# Load LoRA weights on top of the base model
lora_fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"models/t5-small-lora-finetuned")
lora_fine_tuned_model = PeftModelForSeq2SeqLM.from_pretrained(base_model, f"models/t5-small-lora-finetuned").to(DEVICE)
print(next(lora_fine_tuned_model.parameters()).device)

evaluate(lora_fine_tuned_model, lora_fine_tuned_model, test_dataloader)

cuda:0


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 250/250 [12:44<00:00,  3.06s/it]


{'rouge1': 31.89129946884353,
 'rouge2': 9.928396604034283,
 'rougeL': 19.62841982265169,
 'rougeLsum': 19.609744594583685,
 'bertscore_precision': 83.08628111481666,
 'bertscore_recall': 82.15144068300724,
 'bertscore_f1': 82.59759333431721}

### üèÖ LoRA Fine Tuned Model Evaluation

- Rouge
  - rouge1: 31.8912
  - rouge2: 9.9283
  - rougeL: 19.6284
  - rougeLsum: 19.6097

- BERTscore
  - precsion: 83.0862
  - recall: 82.1514
  - f1: 82.5975

- Let's check the memory size for the model

In [48]:
param_size = 0
for param in lora_fine_tuned_model.parameters():
    param_size += param.nelement() * param.element_size()

buffer_size = 0
for buffer in lora_fine_tuned_model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

model_size = (param_size + buffer_size) / 1024**2
print(f"Model size in memory: {model_size:.2f} MB")

Model size in memory: 233.06 MB


- Model trained with LoRA perfomed well, we can see some loss in performance.
- But now we need to save only the LoRA parameters (adapters).
- We can train multiple adapters and use them for respective tasks. 

- With LoRA we just need to save adapters, but still the memory footprint of the model doesn't change, which means we will need large GPU to fit these models.
- We can reduce the memory footprint of the model by using the `Quantized` model for LoRA training.

### üéâ Fine Tuning with QLoRA (Quantization Low-Rank Adaptation)

  - Even with LoRA, training very large models (7B, 13B, 70B parameters) is still GPU-heavy.
  - QLoRA makes this possible on a single 24GB GPU (like RTX 3090, A100, etc.) by using quantization.
  - Instead of storing all weights in 16/32-bit precision, QLoRA compresses them to lower precision like 4-bit.
  - That's like shrinking a huge library üìö into tiny notes.
  - Cuts memory use drastically.

In [None]:
# # To install libraries if do not have them already
# !pip install bitsandbytes

In [33]:
from transformers import BitsAndBytesConfig

In [34]:
model_name = "t5-small"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

base_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

qlora_tokenizer = AutoTokenizer.from_pretrained(model_name)

#### BitsAndBytesConfig Arguments:
- `load_in_4bit`: Whether to load the model in 4-bit precision.
- `bnb_4bit_use_double_quant`: Whether to use double quantization. First quantize weights to 8-bit, then quantize the 8-bit values again to 4-bit. It reduces quantization error.
- `bnb_4bit_quant_type`: Type of 4-bit quantization.
    - "fp4": 4-bit floating point (older, less stable).
    - "nf4": NormalFloat4, special distribution-aware quantization that improves accuracy (recommended).
- `bnb_4bit_compute_dtype`: Torch dtype - the precision used for computation (after dequantization).

In [35]:
qlora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)


In [36]:
patch_qlora_model = PatchedPeftModel(base_model, qlora_config)

In [37]:
patch_qlora_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 61,686,272 || trainable%: 1.9123


In [38]:
qlora_training_args = Seq2SeqTrainingArguments(
    output_dir=f"models/t5-small-qlora-finetuned",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=5,
    predict_with_generate=True,
    logging_dir="./logs",
    report_to="none",
    logging_steps=50,
    save_safetensors=False,
    remove_unused_columns=False,
    fp16=True
)

In [39]:
qlora_data_collator = DataCollatorForSeq2Seq(tokenizer, model=patch_qlora_model)

In [40]:
qlora_trainer = Seq2SeqTrainer(
    model=patch_qlora_model,
    args=qlora_training_args,
    train_dataset=lora_dataset["train"],
    eval_dataset=lora_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  qlora_trainer = Seq2SeqTrainer(


In [43]:
%%time
qlora_trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bertscore Precision,Bertscore Recall,Bertscore F1
1,3.3597,3.135914,10.268999,3.329947,8.355658,8.359486,84.7596,77.851797,81.13422
2,3.2983,3.068748,10.939821,3.737948,8.894346,8.888834,85.272532,78.043793,81.475874
3,3.2522,3.035628,11.35952,4.031886,9.238123,9.228312,85.715881,78.244065,81.789087
4,3.2681,3.021797,11.484618,4.147022,9.358125,9.349953,85.842886,78.300418,81.877937
5,3.2587,3.017646,11.561844,4.174399,9.4084,9.397774,85.919167,78.32188,81.924524


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


CPU times: user 36min 42s, sys: 1min 38s, total: 38min 21s
Wall time: 38min 27s


TrainOutput(global_step=6250, training_loss=3.3803716064453124, metrics={'train_runtime': 2306.7909, 'train_samples_per_second': 21.675, 'train_steps_per_second': 2.709, 'total_flos': 1.38965680128e+16, 'train_loss': 3.3803716064453124, 'epoch': 5.0})

In [44]:
qlora_trainer.model.save_pretrained(f"models/t5-small-qlora-finetuned")
tokenizer.save_pretrained(f"models/t5-small-qlora-finetuned")

('models/t5-small-qlora-finetuned/tokenizer_config.json',
 'models/t5-small-qlora-finetuned/special_tokens_map.json',
 'models/t5-small-qlora-finetuned/tokenizer.json')

### üèÖ Evaluating the Model Fine Tuned with QLoRA

In [41]:
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    "t5-small",
    quantization_config=bnb_config,
    device_map="auto"
)
# Load LoRA weights on top of the base model
qlora_fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"models/t5-small-qlora-finetuned")
qlora_fine_tuned_model = PeftModelForSeq2SeqLM.from_pretrained(base_model, f"models/t5-small-qlora-finetuned").to(DEVICE)
print(next(qlora_fine_tuned_model.parameters()).device)

cuda:0


In [None]:
evaluate(qlora_fine_tuned_model, qlora_fine_tuned_tokenizer, test_dataloader)

### üèÖ QLoRA Fine Tuned Model Evaluation

- Rouge
  - rouge1: 32.1096
  - rouge2: 10.0119
  - rougeL: 19.7045
  - rougeLsum: 19.6978

- BERTscore
  - precsion: 83.0649
  - recall: 82.1939
  - f1: 82.6086

- Let's check memory size for the model 

In [45]:
param_size = 0
for param in qlora_fine_tuned_model.parameters():
    param_size += param.nelement() * param.element_size()

buffer_size = 0
for buffer in qlora_fine_tuned_model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

model_size = (param_size + buffer_size) / 1024**2
print(f"Model size in memory: {model_size:.2f} MB")

Model size in memory: 98.91 MB


- Memory footprint of the model reduced more than 50%
- While performance is similar to LoRA and Full Fine Tuned models 