# Transformer-Based Title Generation - Part 3

In Part 3 of the project, we explore the use of a pretrained T5 sequence-to-sequence model for the task of book title generation based on book descriptions. This task is framed as a text summarization problem, where the model receives a book description as input and generates an appropriate title as output.
  
We begin by loading the datasets that were prepared in Part 1 of the project. These include training, validation and test sets, each containing book descriptions and their corresponding titles. The data is preprocessed and tokenized to match the input format expected by the **T5 model**.
  
The following components are used in the setup:  
- Model and tokenizer: We use the **t5-small** variant loaded from the **Huggingface Model Hub** via  
tokenizer = T5Tokenizer.from_pretrained('t5-small')  
model = T5ForConditionalGeneration.from_pretrained('t5-small')  
- Data Collation: We utilize **DataCollatorForSeq2Seq** to handle dynamic padding and ensure efficient batching during training and evaluation.
- Evaluation Metric: The **ROUGE metric** is employed to evaluate the quality of the generated titles against the reference titles, with a focus on ROUGE-1.
  
Finally, we test the model on three examples from the test dataset, comparing the generated titles to the original ones to assess the model's performance.

## Load Data

We're importing the preprocessed datasets made in Part 1.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_train = pd.read_csv('/kaggle/input/train-book-data/train_book_data.csv')
df_valid = pd.read_csv('/kaggle/input/valid-book-data/valid_book_data.csv')
df_test = pd.read_csv('/kaggle/input/test-book-data/test_book_data.csv')

df_train.tail()

Unnamed: 0,Title,Description
1589,howl and other poems,"the prophetic poem, which was born by a genera..."
1590,crown of midnight (throne of glass #2),"""a line that should never be crossed is about ..."
1591,the cuckoo's calling (cormoran strike #1),a brilliant debut secret in a classic vein: a ...
1592,"saga, volume 2 (saga (collected editions) #2)",by the award-winning writer brian k. vaughan (...
1593,legend (legend #1),here there is an alternative title edition for...


## Pretrained Model

We will use the T5 model, a multitask model.  
**T5 (Text-To-Text Transfer Transformer)** is a model developed by Google that was trained to handle many NLP tasks using a unified format. Every task is treated as text-to-text. So instead of having separate models for translation, summarization, classification, etc., T5 uses task-specific prefix tokens to understand what it's supposed to do.  
  
Examples:  
- "translate English to German: That is good" → "Das ist gut"
- "summarize: This book is about..." → "Short summary"
- "cola sentence: The cat sat on the mat" → "acceptable" (grammatical acceptability task)

In our case - generate a book title based on a book description - title is usually a short, high-level summary. So we're treating title generation as a form of summarization.

In [3]:
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=044ce0d6e1eb9c9142e519c2c20327c76724796565d41f8d1e9ed269bb675672
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.5.1
    Uninstalling fsspec-2025.5.1:
      Successfully uninstalled fsspec-2025.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.8.0 req

In [5]:
import random
import torch
from transformers import (
    T5Tokenizer, T5ForConditionalGeneration,
    Trainer, TrainingArguments, DataCollatorForSeq2Seq
)
from datasets import Dataset
from evaluate import load

2025-07-20 08:07:09.713076: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752998829.901583      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752998829.954558      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [6]:
# For reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7c05aaa6a8b0>

The following code prepares our dataset of book descriptions and titles to fine-tune a pretrained **T5 model** so it can generate book titles based on descriptions.  
  
The original DataFrames contain two columns: "Description" and "Title". These are renamed to "input_text" and "target_text" to match the expected format for sequence-to-sequence training.  
A task-specific prefix ("summarize: ") is added to each input text so the T5 model knows what to do.

In [7]:
# Convert to the appropriate format
df_train = df_train.rename(columns={'Description': 'input_text', 'Title': 'target_text'})
df_train['input_text'] = "summarize: " + df_train['input_text']  # T5 uses task-specific prefix tokens

df_valid = df_valid.rename(columns={'Valid_Description': 'input_text', 'Valid_Title': 'target_text'})
df_valid['input_text'] = "summarize: " + df_valid['input_text'] 

The pandas DataFrame is converted into a **Huggingface Dataset**, which integrates better with the Huggingface Trainer and transformers library.

In [8]:
# Convert pandas DataFrames to Hugging Face Datasets
dataset_train = Dataset.from_pandas(df_train)
dataset_valid = Dataset.from_pandas(df_valid)

The **T5Tokenizer** and **T5ForConditionalGeneration model** are loaded from the **Huggingface Model Hub**. "t5-small" is a smaller, faster variant of the T5 model — good for low-resource environments.

In [9]:
# Load model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

The following function tokenize() prepares the text data (both inputs and targets) for training the model T5. It handles the tokenization of both inputs and targets by using the T5 tokenizer to convert text into token IDs that the model can understand.
   
Normally, when training a sequence-to-sequence model like T5, we need to manually pad all inputs and outputs to a fixed max_length and replace padding tokens in labels with -100, so the loss function ignores them. But we will use a DataCollatorForSeq2Seq, that will handle this for us. So this is a simplified tokenization function.

In [10]:
# Tokenization function to prepare input and target text
def tokenize(batch):
    # Tokenize the input text (book descriptions) with truncation
    model_input = tokenizer(batch['input_text'], truncation=True)
    
    # Switch tokenizer to "target mode" for encoding the output text (book titles)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch['target_text'], truncation=True)
    
    # Assign tokenized labels to the 'labels' key
    model_input["labels"] = labels["input_ids"]
    
    return model_input

We apply the tokenization function to the full datasets (dataset_train and dataset_valid) using the .map() method. The batched=True argument enables batch-wise processing for better performance.

In [11]:
# Apply tokenization to training and validation datasets
tokenized_train = dataset_train.map(tokenize, batched=True)
tokenized_valid = dataset_valid.map(tokenize, batched=True)

Map:   0%|          | 0/1594 [00:00<?, ? examples/s]



Map:   0%|          | 0/100 [00:00<?, ? examples/s]

A **DataCollatorForSeq2Seq** is created to handle dynamic padding during training and evaluation. Using a collator allows the model to pad only to the length of the longest sequence in each batch during training — improving efficiency.
    
It dynamically pads each batch to the maximum length in that batch — more memory-efficient than static padding.  
It automatically replaces padding tokens in labels with -100.  
It works seamlessly with a Trainer.

In [12]:
# Create a data collator that dynamically pads inputs and labels per batch and ensures padding tokens in labels are masked out during loss computation
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

The next function compute_metrics() is used during evaluation of the model. It computes the **ROUGE-1** score, that we have seen in Part2 of the project, between the model’s predicted text and the reference labels.

1. Input Handling  
The function expects a tuple (predictions, labels) as input.
If predictions contains extra information (logits plus scores), it selects only the token IDs or logits from the first element.

3. Format Normalization  
If the model output is still in logits (not token IDs), it applies argmax to pick the most probable token at each position.
If predictions are beam search outputs (shape: batch_size × num_beams × seq_len), only the first beam (most likely hypothesis) is selected.


5. Label Preparation  
Labels used during training typically use -100 to mask out padding tokens.
Before decoding them back into text, these are replaced with the tokenizer's pad_token_id so they can be correctly transformed into strings.

6. Decoding  
Both predictions and labels (now just token IDs) are converted back to readable strings using tokenizer.batch_decode(), skipping special tokens like pad and eos.

7. Metric Calculation  
The decoded predictions and labels are passed to the ROUGE metric. WIth use_stemmer=True the process of reducing words to their base or root form is avtive. For example "running" becomes "run" and "cats" becomes "cat"  
ROUGE-1 is commonly used for evaluating short text generation tasks. It measures the overlap of unigrams (individual words) between the predicted and reference texts, making it well-suited for short, concise outputs.  
The function extracts the ROUGE-1 F1 score, multiplies it by 100 (to express it as a percentage) and returns it.


In [13]:
# Load the ROUGE metric using the datasets library
rouge = load("rouge")

# Define a function to compute evaluation metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # If predictions is a tuple, take only the first part
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    # If predictions are raw logits (3D: batch_size x seq_len x vocab_size), apply argmax to get token IDs
    if predictions.ndim == 3 and predictions.shape[-1] == tokenizer.vocab_size:
        predictions = np.argmax(predictions, axis=-1)

    # If predictions have shape (batch_size, num_beams, seq_len), take only the first beam
    if predictions.ndim == 3:
        predictions = predictions[:, 0, :]

    # Replace -100 in labels with the tokenizer’s pad_token_id (so we can decode them correctly)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Ensure prediction IDs are within the valid vocabulary range
    predictions = np.clip(predictions, 0, tokenizer.vocab_size - 1)

    # Decode predicted and reference sequences to strings
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute the ROUGE scores (with stemming)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    
    # Extract the ROUGE-1 F1 score and scale to percentage
    rouge_1 = result["rouge1"] * 100
    
    return {"rouge1": round(rouge_1, 2)}

Downloading builder script: 0.00B [00:00, ?B/s]

The following code sets up a Huggingface Trainer to fine-tune the model.  
  
It defines **TrainingArguments**:  
- Small batch size (batch_size) allows training even on low-memory machines like laptops or free GPUs.  
- Logging step (logging_steps) gives detailed feedback on what’s happening at each training step.  
- No external logging (report_to="none") ensures clean local runs without integration into tracking tools like TensorBoard or W&B.  
- TQDM progress bar is enabled, which provides a live, visual indicator of training progress in the console.  

In [14]:
# Define TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',                  # Directory where model checkpoints will be saved
    num_train_epochs=10,                     # Number of full passes through the training dataset. 10 epochs are enough because the dataset is small and we want to prevent overfitting.
    per_device_train_batch_size=10,          # Number of training examples per GPU/CPU. per_device_train_batch_size=10, keeps memory use low. 
    do_eval=True,                            # Enable evaluation on the validation set at the end of training
    logging_dir='./logs',                    # Directory to store log files
    logging_steps=25,                        # Log metrics
    save_strategy="epoch",                   # Save a checkpoint
    report_to="none",                        # Disable integration with tools like WandB or TensorBoard
    disable_tqdm=False                       # Show the training progress bar
)

We initialize the **Trainer** object from Hugging Face that manages the full training and evaluation pipeline.

In [15]:
# Initialize Trainer
trainer = Trainer(
    model=model,                             
    args=training_args,                      # The training configuration defined above
    train_dataset=tokenized_train,           # The tokenized training dataset
    eval_dataset=tokenized_valid,            # The tokenized validation dataset
    data_collator=data_collator,             # The DataCollator defined above
    tokenizer=tokenizer,                     # Tokenizer used to process text data
    compute_metrics=compute_metrics          # Function to compute evaluation metrics
)

  trainer = Trainer(


Let's train the model.

In [16]:
# Start training
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
25,4.2964
50,3.5918
75,3.1399
100,2.9438
125,2.9291
150,2.9514
175,2.7387
200,2.7215
225,2.7572
250,2.6159


TrainOutput(global_step=1600, training_loss=2.3327521181106565, metrics={'train_runtime': 316.5165, 'train_samples_per_second': 50.361, 'train_steps_per_second': 5.055, 'total_flos': 2072905076244480.0, 'train_loss': 2.3327521181106565, 'epoch': 10.0})

The loss decreases consistently, but the rate of improvement is slowing in later steps.

We evaluate the predictions using **trainer.evaluate()** to run the model on the validation dataset specified earlier in the Trainer.  
It computes the loss and the ROUGE-1 metric by using the compute_metrics function we passed during Trainer initialization to generate evaluation scores.

In [17]:
# Evaluation
final_metrics = trainer.evaluate()
print("Final Evaluation Metrics:", final_metrics)

Final Evaluation Metrics: {'eval_loss': 2.1549863815307617, 'eval_rouge1': 0.0, 'eval_runtime': 143.358, 'eval_samples_per_second': 0.698, 'eval_steps_per_second': 0.091, 'epoch': 10.0}


Some explanation:
  
**eval_loss:**  
This is the average loss value computed on the validation dataset.  
The loss is moderate, but not necessarily a good indicator of real performance in a text generation task. On its own, it doesn’t reveal whether the model generates useful titles.  
**eval_rouge1:**  
A ROUGE-1 score of 0.0 means the model’s generated output has no significant overlap with the true titles. This is a strong signal that the model is not learning to generate meaningful or accurate titles.  
**eval_runtime:**  
Total time spent evaluating the model on the validation dataset.  
**eval_samples_per_second:**  
The speed at which the model processes samples during evaluation.  
**eval_steps_per_second:**  
Similar to above, this refers to how fast the evaluation runs per step (batch).  
**epoch:**  
Indicates that the results are from the last epoch of training.

Now let's see how the pretrained model performs on the test data.
  
The next function, generate_title, takes a book description as input and generates a predicted book title using our pretrained T5 model.
  
It prepends "summarize: " to the description, to specify the summarization task.  
The input text is tokenized and converted into PyTorch tensors (return_tensors="pt"). It is truncated if too long to fit within the model’s maximum token limit (512 tokens for T5-small).  
The model’s generate method produces a sequence of output token IDs representing the predicted title, limited to a maximum length of 32 tokens.  
Finally, the generated token IDs are decoded back into a readable string, omitting any special tokens like padding or start/end markers. 

In [18]:
# Function to generate a title from a given description
def generate_title(description_text):
    # Prepend the T5 task prefix "summarize:" to the input text
    input_text = "summarize: " + description_text
    
    # Tokenize the input and convert it to a PyTorch tensor
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    
    # Move the input tensor to the same device as the model (here GPU)
    input_ids = input_ids.to(model.device)
    
    # Generate output token IDs using beam search decoding
    output_ids = model.generate(input_ids, max_length=32, num_beams=3, early_stopping=True)
    
    # Decode the token IDs into a string and remove special tokens
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

We loop through the first 3 entries in the test dataset by using the generate_title() defined above:  
For each row, it retrieves the original description and the true title. It then calls the model to generate a title based on the description. Finally, it prints the original description, the original title and the generated title for comparison.

In [19]:
# Loop through the first 3 rows of df_test and generate titles
for i in range(3):
    # Get the original description and corresponding title from the test dataframe
    original_description = df_test.loc[i, 'Test_Description']
    original_title = df_test.loc[i, 'Test_Title']  
    
    # Generate a title using the model based on the description
    generated_title = generate_title(description_text=original_description)
    
    # Print the original and generated content for comparison
    print(f"Original Description #{i+1}:\n{original_description}\n")
    print(f"Original Title #{i+1}: {original_title}\n")
    print(f"Generated Title #{i+1}: {generated_title}\n")
    print("-" * 80)

Original Description #1:
starting over sucks.when we moved to west virginia right before my senior year, i'd pretty much resigned myself to thick accents, dodgy internet access, and a whole lot of boring… until i spotted my hot neighbor, with his looming height and eerie green eyes. things were looking up.and then he opened his mouth.daemon is infuriating. arrogant. stab-worthy. we do not get alon starting over sucks.when we moved to west virginia right before my senior year, i'd pretty much resigned myself to thick accents, dodgy internet access, and a whole lot of boring… until i spotted my hot neighbor, with his looming height and eerie green eyes. things were looking up.and then he opened his mouth.daemon is infuriating. arrogant. stab-worthy. we do not get along. at all. but when a stranger attacks me and daemon literally freezes time with a wave of his hand, well, something… unexpected happens. the hot alien living next door marks me.you heard me. alien. turns out daemon and his 

Example 1  
Generated Title: if i don't kill him first, that's what i'm getting out of this alive  
The generated title captures the tone but not the theme (sci-fi/romance). Not really a suitable title. The model used here (T5) was originally trained with a "summarize" prefix for summarization tasks, not specifically for title generation. Since there is no native "title generation" prefix in the T5 pretraining objectives, we have reused the summarization objective by prepending "summarize:" to the input. While this workaround is reasonable, it may not align perfectly with the expectations of concise title generation and may encourage the model to produce longer outputs rather than short, catchy titles.  
  
Example 2  
Generated Title: The Age of Genius  
The generated title is a shortened but accurate and acceptable version of the original title.  
  
Example 3  
Generated Title: The Fault in Our Stars  
The generated title i a perfect match. Likely memorized, but still ideal. T5 was pretrained on a massive public corpus (books, Wikipedia, web data,...). Since The Fault in Our Stars is a very famous book, it’s almost certain that both its description and title were present in the pretraining data. The model might have memorized this mapping during pretraining, even before we fine-tuned it. So during testing, when it saw the description, it simply retrieved the known title from its pretraining “memory.”  
  
Overall, the model seems to perform well for nonfiction and known titles but struggles with fiction when the title is abstract or creative.  