# Project Introduction  
This project focuses on building and evaluating models for automatic book title generation based on book descriptions. It is divided into three main parts — from raw text extraction and preparation to model training, fine-tuning and evaluation.
  
Generating meaningful and creative titles from descriptions is a particularly challenging task in natural language processing. Unlike summaries or classifications, titles are often abstract, metaphorical or intentionally ambiguous and they don’t always offer a direct reflection of the book’s content. While descriptions provide clear insight into the storyline or subject matter, titles may be poetic, symbolic or chosen for marketing appeal, making them hard to predict from a purely semantic or structural standpoint.
  
In addition to the inherent difficulty of the task, this project is further constrained by the limited size of the dataset. A small corpus increases the risk of overfitting and reduces the model’s ability to generalize. To counter this, careful preprocessing and data augmentation techniques are applied to enhance the dataset and improve training outcomes.
  
By combining practical data handling with advanced transformer-based models, the project explores how far we can push automated title generation under real-world constraints.
  
### Part 1: Data Collection and Preparation  
In the first stage, book data is collected through web scraping from an online bookstore (books.toscrape.com is a demo website for web scraping purposes). This includes titles, descriptions, star ratings and prices. Since the raw dataset is relatively small, two data augmentation techniques are applied to improve model generalization:  
Backtranslation: English descriptions are translated into German and then back into English to generate semantically equivalent but syntactically varied text.  
Paraphrasing: A T5-based paraphraser generates alternative versions of descriptions while preserving the original meaning.  
This step ensures that the training data is diverse and suitable for fine-tuning language models.  
  
### Part 2: Transformer Model for Title Generation  
The second part involves building a custom transformer-based model that generates book titles from descriptions. This includes:  
Byte-Pair Encoding (BPE) for effective tokenization of input/output text.  
Hyperparameter optimization using Optuna, guided by the ROUGE score as the evaluation metric.  
The goal is to learn an end-to-end mapping from description to title.
  
### Part 3: Fine-Tuning a Pretrained T5 Model  
In the final part, a pretrained T5 model is fine-tuned on the same description-to-title task. By leveraging the model's pretrained language understanding, we aim to improve performance on title generation, especially with limited training data. This step serves as a benchmark against the custom transformer developed in Part 2.

# Transformer-Based Title Generation - Part 1

The first part of the project focuses on extracting and preparing data for the title generation task. Book information — such as titles, descriptions, ratings and prices—is collected via web scraping from an online bookstore. The primary objective in this stage is to ensure that the input data is clean, consistent and suitable for training language models.
  
Although the preprocessing applied is minimal, it is intentionally designed to be efficient and effective. The code removes only the most disruptive forms of noise while preserving the natural linguistic structure of the text. This design choice ensures that the Byte-Pair Encoding (BPE) tokenizer (used in part 2 of the project) can learn consistent subword patterns, which is critical for generating fluent and accurate book titles. Subword-level regularity is particularly important in this task, as small variations in the input can significantly affect the model's output.
  
To further improve data quality, non-English book descriptions are filtered out using language detection. This step ensures that the model is trained exclusively on English-language text, avoiding confusion caused by multilingual input and improving model performance on the downstream task.
  
Because the scraped dataset is relatively small, two complementary data augmentation techniques are applied to expand and diversify the training data:  
Backtranslation: Book descriptions are translated from English to German and then back to English. This produces alternate phrasings that preserve meaning but introduce syntactic variety.  
Paraphrasing: A pretrained T5 paraphraser is used to generate semantically equivalent but lexically diverse versions of each description.  
  
By combining careful filtering, lightweight preprocessing and powerful augmentation strategies, Part 1 sets a foundation for training models in the following stages of the project.

## Web Scraping Book Data from books.toscrape.com

The following script scrapes information about books listed on the website books.toscrape.com. It collects the title, star rating, price and description of each book across multiple pages, then stores the data in a pandas DataFrame.  
Note: This website is a publicly available demo created specifically for web scraping practice. It does not list real-world books — all titles, prices and descriptions are fictional and intended for educational or testing purposes only.

In [1]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=ad9d03aff94a9773d1dde618d56770cd340dbd4c29fb7c28bbd7a9c5eed1294e
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from langdetect import detect, DetectorFactory
import re
import html
from sklearn.model_selection import train_test_split

In [3]:
books = []

# Base URL to resolve relative links
base_url = "https://books.toscrape.com/catalogue/"

# Load the first page
response = requests.get(f"{base_url}page-1.html")
soup = BeautifulSoup(response.content, 'html.parser')

# Read pagination info to determine the number of pages
pager = soup.find('ul', class_='pager')
if pager:
    current_page_text = pager.find('li', class_='current').text.strip()
    last_page = int(current_page_text.split()[-1])  # Extract the "50"
else:
    last_page = 1  # If no pagination, assume only one page

# Iterate through all pages
for page in range(1, last_page + 1):
    url = f"{base_url}page-{page}.html"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # The book listings are inside an <ol> tag
    ol = soup.find('ol')   # Go into soup and find all ol. ol=ordered list. Contains all articles=books
    articles = ol.find_all('article', class_='product_pod')  # Go into ol and find all articles=books

    for article in articles:
        # Extract title from the image alt attribute
        image = article.find('img')
        title = image.attrs['alt']   # The full title is here

        # Extract star rating ("One", "Two", "Three", etc.)
        star_tag = article.find('p')
        star = star_tag['class'][1]  # We only need the number of stars

        # Extract price and convert to float
        price = article.find('p', class_='price_color').text
        price = float(price[1:])  # Remove "£". We don't need the currency, therefore [1:]. Convert string to float.

        # Construct full link to the book's detail page
        link_tag = article.find('h3').find('a')
        book_url = base_url + link_tag['href']

        # Visit the book's detail page to extract its description
        book_response = requests.get(book_url)
        book_soup = BeautifulSoup(book_response.content, 'html.parser')

        # Find the book description in a <meta name="description"> tag
        description = book_soup.find('meta', attrs={"name": "description"})
        if description:
            description_text = description['content'].strip()
        else:
            # Alternatively, try finding the description inside an <article><p> block
            description_tags = book_soup.find('article').find_all('p')
            if len(description_tags) > 3:
                description_text = description_tags[3].text.strip()
            else:
                description_text = "No description found"

        # Save extracted data
        books.append([title, star, price, description_text])

# Convert the list of books into a pandas DataFrame
df = pd.DataFrame(books, columns=['Title', 'Star Rating', 'Price', 'Description'])
df.tail()

Unnamed: 0,Title,Star Rating,Price,Description
995,Alice in Wonderland (Alice's Adventures in Won...,One,55.53,
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,57.06,High school student Kei Nagai is struck dead i...
997,A Spy's Devotion (The Regency Spies of London #1),Five,16.97,"In England’s Regency era, manners and elegance..."
998,1st to Die (Women's Murder Club #1),One,53.98,"James Patterson, bestselling author of the Ale..."
999,"1,000 Places to See Before You Die",Five,26.08,"Around the World, continent by continent, here..."


The next code filters out all rows in the DataFrame df where the "Description" column contains non-English text. It uses the **langdetect library** to identify the language of each description. Only English entries are desired in our DataFrame, because we will use tasks with pretrained English models.
  
The script also:  
- Ensures deterministic language detection using a fixed seed.
- Calculates and prints the number of removed rows.
- Displays the first non-English description that was removed for inspection (if any).

In [4]:
# Ensure consistent results from langdetect
DetectorFactory.seed = 0

# Function to check if a text is English
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False  # Handle short/empty strings or detection errors

# Store original number of rows
original_len = len(df)

# Identify rows where description is NOT English
non_english_mask = ~df['Description'].apply(is_english)

# Save non-English rows before dropping (optional preview)
non_english_rows = df[non_english_mask]

# Drop rows with non-English descriptions
df = df[~non_english_mask].reset_index(drop=True)

# Print number of removed rows
removed_rows = original_len - len(df)
print(f"{removed_rows} rows removed (non-English descriptions).")

# Show the first removed row (if there was any)
if not non_english_rows.empty:
    print("\nFirst removed row:")
    print(non_english_rows.iloc[0])

3 rows removed (non-English descriptions).

First removed row:
Title                                                 Soumission
Star Rating                                                  One
Price                                                       50.1
Description    Dans une France assez proche de la nôtre, un h...
Name: 2, dtype: object


The following code defines a preprocessing function that prepares raw text (book titles and descriptions) for use with transformer models that rely on subword tokenization, specifically Byte Pair Encoding (BPE).

Since BPE learns to represent frequently occurring subword units, the goal is to preserve as much meaningful structure and variability in the raw text as possible. This means:  
- Lowercasing (text.lower()) helps reduce vocabulary size. For example, "Book" and "book" will be treated as the same token.  
- Punctuation & special characters are preserved: The function does not remove punctuation, stopwords or diacritics (German umlauts). This is intentional — BPE benefits from learning frequent patterns that may include punctuation and character combinations (".", "-ing", "schön").  
- HTML, URLs and invisible/control characters are removed. These are noisy and unlikely to form useful subword patterns. Removing them reduces unnecessary tokens and improves learning efficiency.
- Unicode normalization: Zero-width spaces or RTL/LTR markers can break tokenization and result in inconsistent splits. Their removal ensures that the input text is clean and consistent.  
- Whitespace normalization: Collapsing multiple spaces helps BPE find more predictable patterns. Excess whitespace doesn't carry semantic meaning.  
- HTML entity decoding ("&amp"; → "&"): BPE would treat entities like &amp; as entirely different from &, fragmenting the vocabulary unnecessarily.

For BPE-based models, removing stopwords or applying stemming/lemmatization would eliminate important patterns (common suffixes like -ing, -ed). It would reduce the richness of training data, break useful co-occurrence statistics that BPE needs for efficient subword learning. Instead, the goal is to retain raw structure while removing noise that would interfere with token splitting.

In [5]:
def preprocess_text(text):
    if not isinstance(text, str):
        raise ValueError("Input must be a string")

    # Convert to lowercase
    text = text.lower()

    # Remove "...more" (and variants like "... more" or "…more")
    text = re.sub(r'\.\.\.\s*more', '', text)
    text = re.sub(r'…\s*more', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)

    # Remove control characters (ASCII control codes)
    text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)

    # Remove specific Unicode characters like zero-width space
    text = re.sub(r'\u200b', '', text)

    # Remove other invisible formatting characters (RTL/LTR marks)
    text = re.sub(r'[\u200e\u200f\u202a-\u202e]', '', text)

    # Collapse multiple spaces into one
    text = re.sub(r'\s+', ' ', text).strip()

    # Normalize punctuation marks
    text = text.replace("’", "'").replace("“", '"').replace("”", '"')

    # Decode HTML entities (&amp; → &)
    text = html.unescape(text)

    return text

In [6]:
# Apply preprocessing to the "Title" and "Description" columns
df['title_prepro'] = df['Title'].apply(preprocess_text)
df['description_prepro'] = df['Description'].apply(preprocess_text)

# Drop unused columns to keep only the preprocessed text
df.drop('Title', axis=1, inplace=True)
df.drop('Description', axis=1, inplace=True)
df.drop('Star Rating', axis=1, inplace=True)
df.drop('Price', axis=1, inplace=True)

# Display the cleaned DataFrame
df.head()

Unnamed: 0,title_prepro,description_prepro
0,a light in the attic,it's hard to imagine a world without a light i...
1,tipping the velvet,"""erotic and absorbing...written with starling ..."
2,sharp objects,"wicked above her hipbone, girl across her hear..."
3,sapiens: a brief history of humankind,from a renowned historian comes a groundbreaki...
4,the requiem red,patient twenty-nine.a monster roams the halls ...


We will split the data.  
This approach ensures that the model is trained on a large portion of the data (train_descriptions, train_titles), validated on a separate subset (valid_descriptions, valid_titles) and finally tested on an unseen subset (test_descriptions, test_titles). This separation helps evaluate the model's performance and generalization ability.  

In [7]:
# Split data
titles = df["title_prepro"].astype(str).tolist()
descriptions = df["description_prepro"].astype(str).tolist()

train_titles, temp_titles, train_descriptions, temp_descriptions = train_test_split(
    titles, descriptions, test_size=0.2, random_state=42)
valid_titles, test_titles, valid_descriptions, test_descriptions = train_test_split(
    temp_titles, temp_descriptions, test_size=0.5, random_state=42)

## Data Augmentation: Backtranslation

**Backtranslation** is a powerful data augmentation technique used to generate synthetic training data by translating existing text to another language and then translating it back to the original language. This method creates paraphrased versions of our input text while preserving its meaning.

Transformer models like T5 or GPT require large and diverse datasets to generalize well.  
We're working with a very small dataset, our model might memorize examples instead of learning general patterns and overfits quickly. It may performs poorly on unseen data. Backtranslation helps increase the diversity of our training examples without requiring labeled data from scratch.

The purpose of backtranslation in our case is to simulate new ways of describing the same book, not to generate new books or new titles. If we change the target (title), we're no longer augmenting the same example, but rather creating a new training pair — which defeats the point of using backtranslation for data augmentation. In very low-resource scenarios, we might consider augmenting not just the inputs (descriptions), but also the targets (titles). However, this crosses the line from data augmentation into synthetic data generation. Our targets are Book titles that must stay fixed because they are highly specific or creative. Backtranslating can alter meaning, create inaccurate labels or confuse the model. In this case, backtranslation of targets does not preserve label integrity.

In [8]:
pip install nlpaug transformers

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11
Note: you may need to restart the kernel to use updated packages.


In [9]:
import nlpaug.augmenter.sentence as nas
import random
from transformers import MarianMTModel, MarianTokenizer
import torch
from nltk import sent_tokenize
from nltk.tokenize import sent_tokenize
from tqdm import tqdm

# For reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

2025-07-16 13:28:56.716511: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752672537.071455      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752672537.172007      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<torch._C.Generator at 0x7d1f1e582af0>

The following code sets up the environment and loads pre-trained translation models from the **Helsinki-NLP collection on Hugging Face**. These models will be used for our backtranslation.
  
This setup includes:  
- Checking for GPU availability.
- Loading English-to-German and German-to-English translation models.
- Loading the corresponding tokenizers.

In [10]:
# Enable GPU support if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define model names for backtranslation:
# These are pre-trained translation models from Helsinki-NLP
en_to_de_model_name = "Helsinki-NLP/opus-mt-en-de"  # English to German
de_to_en_model_name = "Helsinki-NLP/opus-mt-de-en"  # German to English

# Load the English-to-German tokenizer and model, then move model to the selected device (GPU/CPU)
en_to_de_tokenizer = MarianTokenizer.from_pretrained(en_to_de_model_name)
en_to_de_model = MarianMTModel.from_pretrained(en_to_de_model_name).to(device)

# Load the German-to-English tokenizer and model, also moving to the device
de_to_en_tokenizer = MarianTokenizer.from_pretrained(de_to_en_model_name)
de_to_en_model = MarianMTModel.from_pretrained(de_to_en_model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

The following code implements a robust backtranslation pipeline, where English text is first translated into German and then back into English. 
  
To handle longer texts effectively, the code first divides the input into manageable **chunks** using the chunk_text function, which splits the input by sentences and ensures that each chunk stays within a specified word limit. 
  
The core translation logic leverages two machine translation models (English to German and German to English), applying probabilistic decoding techniques to introduce variability in the outputs. In both translation steps, the code uses the .generate() method with the following parameters:  
- **do_sample=True** enables sampling instead of deterministic decoding, allowing the model to produce more diverse outputs.
- The **top_k=80** setting restricts the sampling to the top 80 most likely tokens, removing extremely unlikely candidates. Complementing this, **top_p=0.97** enables nucleus sampling, which considers the smallest group of top tokens whose cumulative probability is at least 97%, further balancing diversity and coherence.
- The **temperature=0.99** slightly increases randomness by flattening the probability distribution, promoting natural-sounding variation without losing too much semantic fidelity.
- Additionally, **max_length=512** caps the length of generated outputs to avoid excessively long or runaway generations, while **early_stopping=True** ensures that the model halts generation once it believes the sentence is complete. 

In [11]:
# Chunking function
def chunk_text(text, max_words_per_chunk=100):
    if not text.strip():
        return []

    # Split text into individual sentences
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence_word_count = len(sentence.split())

        # If adding this sentence would exceed the max word count per chunk, finalize the current chunk
        if current_word_count + sentence_word_count > max_words_per_chunk:
            if current_chunk:
                chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_word_count = sentence_word_count
        else:
            current_chunk.append(sentence)
            current_word_count += sentence_word_count

    # Append the final chunk if it contains any sentences
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks


# Backtranslation EN to DE to EN with error handling
def backtranslate_en_to_de_to_en(text, max_words_per_chunk=100):
    try:
        # Break the input into manageable chunks
        chunks = chunk_text(text, max_words_per_chunk=max_words_per_chunk)
        backtranslated_chunks = []

        for chunk in chunks:
            # English to German Translation
            encoded_en = en_to_de_tokenizer(
                chunk,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512  # Prevent the model from processing overly long inputs
            ).to(device)

            translated_de = en_to_de_model.generate(
                **encoded_en,
                do_sample=True,      # Enable sampling instead of greedy decoding for diversity. 
                top_k=10,            # Limit sampling to the top most likely next tokens. Smaller k allows less probable tokens to be chosen, increasing variety.
                top_p=0.2,           # Nucleus sampling: consider only top tokens with cumulative probability. Lower p includes more unpredictable tokens.
                temperature=1.9,     # Controls randomness: closer to 1.0 means more randomness, lower = more deterministic. Higher temperature = more randomness, less deterministic output.
                max_length=512,      # Limit output length to prevent excessively long generations
                early_stopping=True  # Stop when the model thinks the output is complete
            )

            # Decode the generated German text from token IDs
            text_de = en_to_de_tokenizer.decode(
                translated_de[0],
                skip_special_tokens=True
            )
            
            # German to English Translation
            encoded_de = de_to_en_tokenizer(
                text_de,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512
            ).to(device)

            translated_en = de_to_en_model.generate(
                **encoded_de,
                do_sample=True,
                top_k=10,
                top_p=0.2,
                temperature=1.9,
                max_length=512,
                early_stopping=True
            )

            backtranslated_text = de_to_en_tokenizer.decode(
                translated_en[0],
                skip_special_tokens=True
            ).strip()

            # Preprocessing new text
            backtranslated_chunks.append(preprocess_text(backtranslated_text))

        return ' '.join(backtranslated_chunks)

    except Exception as e:
        print(f"[Backtranslation Error]: {e}")
        return text

In the following code, 50% of the training data is augmented using backtranslation, while the corresponding titles are kept unchanged to maintain label consistency. The selected indices for backtranslation are chosen randomly.
  
The remaining 50% of the data, corresponding to the indices not selected here, is reserved for a separate paraphrasing augmentation step. This ensures that the two augmentation methods — backtranslation and paraphrasing — are applied to distinct, non-overlapping portions of the dataset, helping to increase diversity and reduce redundancy in the augmented training samples.
  
For each selected sample, the description is passed through the backtranslation pipeline to generate a new paraphrased description, which is appended to the list of augmented descriptions. The original title is kept unchanged and appended to the list of titles for the augmented data. This way, the model trains on varied inputs with consistent target outputs.

In [12]:
# Backtranslation loop — only for a random 50% of samples
num_samples = len(train_descriptions)
bt_indices = random.sample(range(num_samples), int(0.5 * num_samples))

# Select subset of data to augment
sampled_descriptions = [train_descriptions[i] for i in bt_indices]
sampled_titles = [train_titles[i] for i in bt_indices]

# Lists for augmented samples
bt_descriptions = []
bt_titles = []

# Backtranslate descriptions only. Keep titles unchanged.
for orig_desc, orig_title in tqdm(zip(sampled_descriptions, sampled_titles), total=len(bt_indices), desc="Backtranslating"):
    new_desc = backtranslate_en_to_de_to_en(orig_desc)
    bt_descriptions.append(new_desc)
    bt_titles.append(orig_title)  # unchanged title

Backtranslating: 100%|██████████| 398/398 [39:07<00:00,  5.90s/it]


Let's have a look at some backtranslated descriptions.

In [13]:
n = 3

for i in range(n):
    idx = bt_indices[i] 

    print(f"\n--- Example {i+1} ---")
    print(f"Original Title:      {train_titles[idx]}")
    print(f"Backtranstaled Title:     {bt_titles[i]}")
    print(f"\nOriginal Description:\n{train_descriptions[idx]}")
    print(f"\nBacktranstaled Description:\n{bt_descriptions[i]}")


--- Example 1 ---
Original Title:      the white queen (the cousins' war #1)
Backtranstaled Title:     the white queen (the cousins' war #1)

Original Description:
philippa gregory presents the first of a new series set amid the deadly feuds of england known as the wars of the roses.brother turns on brother to win the ultimate prize, the throne of england, in this dazzling account of the wars of the plantagenets. they are the claimants and kings who ruled england before the tudors, and now philippa gregory brings them to life through philippa gregory presents the first of a new series set amid the deadly feuds of england known as the wars of the roses.brother turns on brother to win the ultimate prize, the throne of england, in this dazzling account of the wars of the plantagenets. they are the claimants and kings who ruled england before the tudors, and now philippa gregory brings them to life through the dramatic and intimate stories of the secret players: the indomitable women, sta

The backtranslated description retains the general structure and key thematic content of the original, effectively preserving the intent and meaning. It demonstrates moderate paraphrastic variation — some phrases are reformulated or slightly altered in wording, which indicates that the backtranslation process is introducing diversity.  
However, the level of transformation remains relatively mild, with sections closely mirroring the original phrasing or being repeated with only small grammatical shifts.

## Data Augmentation: Paraphrasing

Paraphrasing is another data augmentation technique where input texts are rewritten in different words while preserving their original meaning.
  
When the training dataset is small, the model can easily overfit or fail to generalize to unseen descriptions. Paraphrasing helps address this by exposing the model to a wider variety of language patterns and sentence structures and by teaching the model to generate consistent outputs (titles) from semantically similar inputs (paraphrased descriptions).
  
We are training a custom encoder-decoder transformer model and since we only have a limited number of (description, title) pairs, paraphrasing is applied only to the input descriptions — titles remain unchanged. This creates new synthetic training pairs where the input is varied, but the target stays the same.  
This teaches the model that multiple phrasings can map to the same correct title, reinforcing its understanding of meaning.

In [14]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

In the following code, we load a pretrained T5 model fine-tuned for paraphrasing, specifically the **"ramsrigouthamg/t5_paraphraser" checkpoint**.
  
The code performs automatic text paraphrasing by splitting the input into smaller chunks and generating paraphrased versions using a pre-trained language model. First, the same chunk_text function as before breaks the input text into sentence-based chunks, each constrained to a maximum word count (default 100 words), to ensure that the model input remains within acceptable limits. These chunks are then processed in the paraphrase_text function, where each one is prepended with a "paraphrase:" prompt and tokenized using a dedicated paraphrasing tokenizer. The model then generates paraphrased alternatives using sampling-based decoding. Specifically, the .generate() method is configured with several parameters that enhance output diversity, similar to the parameters used for backtranslation. Additionally the **num_return_sequences** parameter can generate several alternative paraphrases for each chunk. Although multiple options can be generated, only the first one is selected (outputs[0]). The generated tokens are decoded, postprocessed and combined into a final paraphrased version of the input text.

In [15]:
# Load the pretrained T5 model and tokenizer for paraphrasing
# This model is trained specifically to generate paraphrases of input text.
paraphrase_model = T5ForConditionalGeneration.from_pretrained("ramsrigouthamg/t5_paraphraser")
paraphrase_tokenizer = T5Tokenizer.from_pretrained("ramsrigouthamg/t5_paraphraser")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [16]:
# Function to paraphrase text
def paraphrase_text(text, max_length=512, num_return_sequences=1):
    try:
        # Split the input text into smaller chunks to avoid exceeding model limits
        chunks = chunk_text(text)  # Uses the chunk_text function defined earlier

        paraphrased_chunks = []

        for chunk in chunks:
            # Prepare the input prompt for the paraphrasing model
            prompt = "paraphrase: " + chunk.strip()

            # Tokenize the input prompt
            encoding = paraphrase_tokenizer.encode_plus(
                prompt,
                return_tensors="pt",        # Return PyTorch tensors
                truncation=True,            # Truncate input if it exceeds max_length
                padding="max_length",       # Pad input to the max_length
                max_length=512              # Max number of tokens allowed in input
            )

            input_ids = encoding["input_ids"]
            attention_mask = encoding["attention_mask"]

            # Generate paraphrased output using a sampling-based strategy
            outputs = paraphrase_model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=max_length,      # Maximum length of generated text
                do_sample=True,             # Enables sampling for more diverse outputs
                top_k=10,                   # Limits choices to top most probable tokens
                top_p=0.1,                  # Nucleus sampling: top tokens with cumulative probability
                temperature=1.9,            # Adds randomness: closer to 1.0 = more variation
                num_return_sequences=num_return_sequences  # Generate multiple paraphrase options
            )

            # Select the first generated paraphrase
            best_output = outputs[0]

            # Decode token IDs to string and clean it
            decoded = paraphrase_tokenizer.decode(best_output, skip_special_tokens=True).strip()

            # Add the paraphrased chunk to the list
            paraphrased_chunks.append(decoded)

            # Combine all paraphrased chunks into final output
            paraphrased_text = ' '.join(paraphrased_chunks)

        # Text post-processing
        return preprocess_text(paraphrased_text)

    except Exception as e:
        print(f"Error during paraphrasing: {e}")
        return text

The next code performs data augmentation via paraphrasing on the remaining 50% of the training samples that were not used for backtranslation.

In [17]:
# Select the remaining 50% of the training samples that were NOT used for backtranslation
pp_indices = list(set(range(num_samples)) - set(bt_indices))

# Get the corresponding book descriptions and titles for paraphrasing
sampled_descriptions = [train_descriptions[i] for i in pp_indices]
sampled_titles = [train_titles[i] for i in pp_indices]

# Lists to store augmented (paraphrased) data
augmented_descriptions = []
augmented_titles = []

# Loop through each sample in the selected subset
for i in tqdm(range(len(pp_indices)), desc="Paraphrasing"):
    desc = sampled_descriptions[i]   # Original description
    title = sampled_titles[i]        # Corresponding title (kept unchanged)

    # Generate a paraphrased version of the description
    new_desc = paraphrase_text(desc)

    # Save the augmented description and the original title
    augmented_descriptions.append(new_desc)
    augmented_titles.append(title)

Paraphrasing: 100%|██████████| 399/399 [1:36:04<00:00, 14.45s/it]


Let's have a look at some paraphrased descriptions.

In [18]:
n = 3

for i in range(n):
    idx = pp_indices[i]  

    print(f"\n--- Example {i+1} ---")
    print(f"Original Title:      {train_titles[idx]}")
    print(f"Augmented Title:     {augmented_titles[i]}")
    print(f"\nOriginal Description:\n{train_descriptions[idx]}")
    print(f"\nAugmented Description:\n{augmented_descriptions[i]}")


--- Example 1 ---
Original Title:      lowriders to the center of the earth (lowriders in space #2)
Augmented Title:     lowriders to the center of the earth (lowriders in space #2)

Original Description:
the lovable trio from the acclaimed lowriders in space are back! lupe impala, elirio malaria, and el chavo octopus are living their dream at last. they're the proud owners of their very own garage. but when their beloved cat genie goes missing, they need to do everything they can to find him. little do they know the trail will lead them to the realm of mictlantecuhtli, the the lovable trio from the acclaimed lowriders in space are back! lupe impala, elirio malaria, and el chavo octopus are living their dream at last. they're the proud owners of their very own garage. but when their beloved cat genie goes missing, they need to do everything they can to find him. little do they know the trail will lead them to the realm of mictlantecuhtli, the aztec god of the underworld, who is keepin

The paraphrased description retains the main thematic elements and core information from the original, which is beneficial for data augmentation in low-resource transformer training scenarios. The structure demonstrates some linguistic variation, contributing to training diversity.  

Let's put it all together.

In [19]:
# New training data
train_descriptions_all = train_descriptions + augmented_descriptions + bt_descriptions
train_titles_all = train_titles + augmented_titles + bt_titles

# DataFrame with new training data
df_prep = pd.DataFrame({
    'Title': train_titles_all,
    'Description': train_descriptions_all
})

We will now check the training, validation and test texts for any missing entries and remove them if necessary. Afterwards, the prepared DataFrames will be saved as CSV files.

In [20]:
# Create Dataframes
# df_prep with new training data
df_prep = pd.DataFrame({
    'Title': train_titles_all,
    'Description': train_descriptions_all
})

# df_valid with validation Data
df_valid = pd.DataFrame({
    'Valid_Title': valid_titles,
    'Valid_Description': valid_descriptions
})

# df_test with test data
df_test = pd.DataFrame({
    'Test_Title': test_titles,
    'Test_Description': test_descriptions
})

In [21]:
# Check training data for missing entries
df_prep[df_prep.isnull().any(axis=1)]

Unnamed: 0,Title,Description


In [22]:
# Function that removes rows with missing values from a DataFrame and prints the number of rows that were dropped
def drop_missing_and_report(df):
    initial_rows = len(df)
    df_cleaned = df.dropna()
    dropped_rows = initial_rows - len(df_cleaned)
    print(f"Dropped {dropped_rows} rows with missing values.")
    return df_cleaned

In [23]:
# Remove rows with missing entries
df_prep = drop_missing_and_report(df_prep)

Dropped 0 rows with missing values.


In [24]:
# Remove rows with missing entries
df_valid = drop_missing_and_report(df_valid)

Dropped 0 rows with missing values.


In [25]:
# Remove rows with missing entries
df_test = drop_missing_and_report(df_test)

Dropped 0 rows with missing values.


In [26]:
# Save prepared Dataframes as CSV files
df_prep.to_csv('train_book_data.csv',index=False)
df_valid.to_csv('valid_book_data.csv',index=False)
df_test.to_csv('test_book_data.csv',index=False)