# 📝 Fine-Tuning T5-Small on Dolly-15K for Summarization

This notebook demonstrates how to fine-tune the `google-t5/t5-small` model on the [`databricks/databricks-dolly-15k`](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset for **abstractive text summarization** using Hugging Face’s `Transformers`, `Datasets`, and `Accelerate` libraries.

## 🔧 Project Details

- **Model**: `google-t5/t5-small` (60M parameters)
- **Dataset**: `databricks/databricks-dolly-15k`
  - **Input**: `context`
  - **Target**: `response`
- **Task**: Abstractive Summarization
- **Libraries Used**: 🤗 Transformers, Datasets, Accelerate, PyTorch, NLTK
- **Training Method**: Custom training loop using `Accelerate`
- **Evaluation Metric**: ROUGE (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum)

## 📊 Training Results (ROUGE Scores)

| Epoch | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
|-------|---------|---------|---------|------------|
| 0     | 31.16   | 18.47   | 28.42   | 28.90      |
| 1     | 31.39   | 18.63   | 28.65   | 29.16      |
| 2     | 31.41   | 18.61   | 28.72   | 29.19      |
| 3     | 31.46   | 18.65   | 28.76   | 29.25      |
| 4     | 31.46   | 18.65   | 28.76   | 29.25      |

> The model showed consistent improvements over the first few epochs, stabilizing after epoch 3.

## 🚀 How to Use

You can load the fine-tuned model locally for summarization:

```python
from transformers import pipeline

summarizer = pipeline("summarization", model="results-t5-finetuned-squad-accelerate", tokenizer="results-t5-finetuned-squad-accelerate")

text = "Coffee drinks are made by brewing water with ground coffee beans..."
summary = summarizer(text, max_length=50)[0]["summary_text"]
print(summary)


In [None]:
!pip install --upgrade datasets

### Dataset loading and split for train, test, validation

In [None]:
from datasets import load_dataset

# Load the dataset
ds = load_dataset("databricks/databricks-dolly-15k")

#  Split into train + temp (for validation and test)
train_test_split = ds["train"].train_test_split(test_size=0.2, seed=42)
train_ds = train_test_split["train"]
temp_ds = train_test_split["test"]

#  Split temp into validation + test
val_test_split = temp_ds.train_test_split(test_size=0.5, seed=42)
val_ds = val_test_split["train"]
test_ds = val_test_split["test"]


ds_split = {
    "train": train_ds,
    "validation": val_ds,
    "test": test_ds
}


In [None]:
# Storing split value on dataset

from datasets import DatasetDict

ds = DatasetDict(ds_split)
ds


In [None]:
# Remove empty rows
def has_non_empty_context(example):
    return example["context"].strip() != ""

# Apply the filter to each split and rebuild the DatasetDict
ds = {
    split: dataset.filter(has_non_empty_context)
    for split, dataset in ds.items()
}



In [None]:
# visualize first 3 rows
for example in ds["train"].select(range(3)):
    print("Context:\n", example["context"])
    print("Response:\n", example["response"])
    print("=" * 80)


In [None]:
ds

In [None]:
from datasets import DatasetDict

# Convert dict back to DatasetDict
ds = DatasetDict(ds)

# Now set format and convert to pandas
ds.set_format("pandas")
ds_df = ds["train"][:]

# Show counts for top 20 categories
ds_df["category"].value_counts()[:20]


In [None]:
# # IF there are different type of language then we can concat them and feed the model
# # for example combining the English and Spanish reviews as a single DatasetDict object. 🤗 Datasets provides a handy concatenate_datasets() function that (as the name suggests) will stack two Dataset objects on top of each other.
# from datasets import concatenate_datasets, DatasetDict

# books_dataset = DatasetDict()

# for split in english_books.keys():
#     books_dataset[split] = concatenate_datasets(
#         [english_books[split], spanish_books[split]]
#     )
#     books_dataset[split] = books_dataset[split].shuffle(seed=42)

# # Peek at a few examples
# show_samples(books_dataset)

There are many model for text summarization, if we have different type of language then we need to use certain type of models which support Multilingual.my dataset is monolingual so i can use t5, bart and other models. But if our dataset is Multilingual, then we need to use mT5, mBART-50 models, which support Multilingual

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")

#### Preprocess the dataset

In [None]:
def preprocess_function(examples):
    contexts = [c if isinstance(c, str) else "" for c in examples["context"]]
    responses = [r if isinstance(r, str) else "" for r in examples["response"]]

    model_inputs = tokenizer(
        contexts,
        max_length=512,
        truncation=True,
    )
    labels = tokenizer(
        responses,
        max_length=30,
        truncation=True,
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [None]:
tokenized_datasets = ds.map(preprocess_function, batched=True)


In [None]:
tokenized_datasets

### 📏 ROUGE Score for Summarization Evaluation

ROUGE is a metric that evaluates how much a generated summary overlaps with a reference summary using precision, recall, and F1-score.  
- **Recall**: Measures how much of the reference summary is covered by the generated one.  
- **Precision**: Measures how much of the generated summary is relevant to the reference.  
- **F1**: Harmonic mean of precision and recall.

Example:  
Generated → "I absolutely loved reading the Hunger Games"  
Reference → "I loved reading the Hunger Games"  
→ 6 overlapping words → Recall = 1.0, Precision = 0.6, F1 ≈ 0.75

Install with: `!pip install rouge_score`


In [None]:
# !pip install rouge_score evaluate

In [None]:
import evaluate

rouge_score = evaluate.load("rouge")

In [None]:
#Check

generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

### 🧪 Lead-3 Baseline for Summarization

A simple baseline for summarization is the **lead-3** method: return the first 3 sentences of the article.  
To handle sentence boundaries accurately (e.g. “U.S.” vs full stop), we use the `nltk` library:

```bash
!pip install nltk


In [None]:
# !pip install nltk

In [None]:
# download the punctuation rules:
import nltk
nltk.download("punkt")
nltk.download('punkt_tab')

In [None]:
from nltk.tokenize import sent_tokenize

def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


In [None]:
# Convert to pandas first if needed
ds.set_format("pandas")
df = ds["train"][:]

# Use .loc or .iloc to access individual string values
print(three_sentence_summary(df.loc[1, "context"]))


In [None]:
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["context"]]
    return metric.compute(predictions=summaries, references=dataset["response"])


***Calculate ROUGE Scores*** <br>
We evaluate the summaries on a validation set and get ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum metrics: <br>

***rouge1***: Matches at the word level.<br>

***rouge2***: Matches at the bigram level (two-word phrases) — this is usually lower because exact bigram overlap is rare.<br>

***rougeL***: Measures longest common subsequences — shows fluency and sentence structure match.<br>

***rougeLsum***: Variant better suited for summaries with multiple sentences.<br>

In [None]:
import pandas as pd

score = evaluate_baseline(ds["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = {rn: round(score[rn] * 100, 2) for rn in rouge_names}
print(rouge_dict)


# Fine-tuning T5 with Accelerate

##### Preparing for the training

In [None]:
tokenized_datasets.set_format("torch")

In [None]:
# Transformers provides a DataCollatorForSeq2Seq collator that will dynamically pad the inputs and the labels for us.
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size
)

In [None]:
# setting optimizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
# Setting Accelerator
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
# Setting learning rate, it will adjust while training
from transformers import get_scheduler

num_train_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

For post-processing, we need a function that splits the generated summaries into sentences that are separated by newlines. This is the format the ROUGE metric expects, and we can achieve this with the following snippet of code:

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

In [None]:
output_dir = "results-t5-finetuned-squad-accelerate"

In [None]:
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        # repo.push_to_hub(
        #     commit_message=f"Training in progress epoch {epoch}", blocking=False
        # )

## Checking the model performance

In [None]:
from transformers import pipeline


model_dir = "results-t5-finetuned-squad-accelerate"

# Load summarization pipeline with your model
summarizer = pipeline("summarization", model=model_dir)



# Example function to summarize a test context
def print_summary(idx):
    review = ds["test"][idx]["context"]  # This should now be a string
    true_summary = ds["test"][idx]["response"]

    summary = summarizer(
        review, max_length=50, clean_up_tokenization_spaces=True
    )[0]["summary_text"]

    print(f">>> Context:\n{review}")
    print(f"\n>>> Ground Truth Summary:\n{true_summary}")
    print(f"\n>>> Model Summary:\n{summary}")


# Try on one test example
ds.reset_format()
print_summary(0)
