###Notebook Overview

This Notebook focuses on building a text summarization system using the T5-Flan pretrained model, a large language model fine-tuned to perform well on instruction-based tasks like summarization. The dataset used is BillSum, a collection of U.S. Congressional bills, making the project well-suited for summarizing complex, domain-specific documents. The implementation leverages the Hugging Face Transformers library, which simplifies working with modern NLP models and reduces the coding effort compared to using lower-level libraries like PyTorch or TensorFlow.

Instead of training the model from scratch, the T5-Flan model is fine-tuned on the BillSum dataset. This approach is computationally efficient and ensures the model adapts to the dataset while retaining its language understanding capabilities. The fine-tuned model is evaluated using ROUGE metrics, which compare generated summaries with reference summaries to measure performance.

The system is tested on BillSum's test set, and the model is used to generate summaries for new input text. This demonstrates its ability to summarize complex content accurately and efficiently, making it practical for real-world applications.

###Import required libraries

In [1]:
!pip install -q transformers datasets transformers[torch] tensorboard evaluate rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following

###Code implementation

Import the necessary libraries and load the BillSum dataset. Split the dataset into training and testing sets, allocating 80% of the data for training. The BillSum dataset is already divided into three subsets: train, test, and ca_test, with ca_test being the smallest. Since I am using Google Colab with GPU for testing, I have chosen to work with the ca_test subset. However, this can be adjusted by modifying the split parameter in load_dataset

In [2]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split
# Load billsum dataset
dataset = load_dataset("billsum",split="ca_test")

print(dataset)

from datasets import load_dataset

# Load billsum dataset
dataset = load_dataset("billsum", split="ca_test")

# Split the dataset into train and test sets
train_size = int(0.8 * len(dataset))
train_dataset = dataset.select(range(train_size))
test_dataset = dataset.select(range(train_size, len(dataset)))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


Import the tokenizer from the Hugging Face Transformers library and load the Flan-T5-Base model. This tokenizer will be used to preprocess and tokenize the dataset in preparation for fine-tuning.

In [3]:
from transformers import AutoTokenizer,AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [4]:
from random import randrange

sample = train_dataset[randrange(len(train_dataset))]
print(f"dialogue: \n{sample['summary']}\n---------------")
print(f"summary: \n{sample['title']}\n---------------")
print(f"text: \n{sample['text']}\n--------------")

dialogue: 
Existing law requires the Commission on Peace Officer Standards and Training to establish and keep updated a continuing education classroom training course for peace officer interactions with persons with mental
illnesses or developmental
disabilities. Under existing law, this course consists of classroom instruction and utilizes interactive training methods to ensure that training is as realistic as possible. Under existing law, this course includes training in identifying indicators of mental disability, conflict resolution techniques, and alternatives to lethal force.
Existing law also requires the commission to develop, in consultation with specified entities, adequate instruction in the handling of persons with developmental disabilities or mental illnesses for inclusion in the basic training course for law enforcement officers.
This bill would require the commission, in collaboration with relevant stakeholders, to study and submit a report to the Legislature, on or bef

This code processes and tokenizes the combined dataset to prepare it for a machine learning model. It calculates the maximum input (summary) and target (title) sequence lengths after tokenization, with truncation for long sequences and padding for shorter ones, ensuring uniform input/output sizes for training.

In [5]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([train_dataset, test_dataset]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["summary", "title"])
print(tokenized_inputs)
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([train_dataset, test_dataset]).map(lambda x: tokenizer(x["title"], truncation=True), batched=True, remove_columns=["summary", "title"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")


Map:   0%|          | 0/1237 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 1237
})
Max source length: 512


Map:   0%|          | 0/1237 [00:00<?, ? examples/s]

Max target length: 190


This function tokenizes inputs i.e. summary and tiltle, handles padding, and prepares them for training by replacing padding tokens in labels

In [6]:
def preproces_function(samples, padding="max_length"):
    inputs = ["summarize: " + item for item in samples["summary"]]
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    labels = tokenizer(text_target=samples["title"], max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length" :
        labels["input_ids"] = [
            [(l if l  != tokenizer.pad_token_id else - 100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [7]:
tokenized_train_data = train_dataset.map(preproces_function, batched=True, remove_columns=["text", "summary", "title"])
print(f"keys of tokenized train data: {list(tokenized_train_data.features)}")

tokenized_test_data = test_dataset.map(preproces_function, batched=True, remove_columns=["text", "summary", "title"])
print(f"keys of tokenized test data: {list(tokenized_test_data.features)}")

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

keys of tokenized train data: ['input_ids', 'attention_mask', 'labels']


Map:   0%|          | 0/248 [00:00<?, ? examples/s]

keys of tokenized test data: ['input_ids', 'attention_mask', 'labels']


The below code snippet computes text summarization evaluation metrics. It loads the ROUGE metric, processes predictions and labels by tokenizing sentences, strips unnecessary whitespace, and handles padding tokens. The final evaluation computes ROUGE scores and the average generated text length, returning metrics for model performance analysis.

In [8]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
nltk.download("punkt_tab")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    #multiply by 100 to change the decimal value into percentage
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

This code initializes a pre-trained model for sequence-to-sequence tasks, configures training parameters, and prepares data handling for effective fine-tuning.The setup ensures efficient training with specified batch sizes, learning rates, and GPU compatibility while keeping only the best checkpoints. Training is then initiated

In [9]:
from transformers import AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

label_pad_token_id = -100 #padding token id
#Data collator
data_collator = DataCollatorForSeq2Seq(
                  tokenizer, model=model,
                  label_pad_token_id=label_pad_token_id,
                  pad_to_multiple_of=8)


output_dir="fintuned-flan-t5-small"
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=3,
    # logging & evaluation strategies
    #logging_dir=f"{output_dir}/logs",
    #logging_strategy="steps",
    #logging_steps=500,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2, #checkpoint, we are keeping last 2 best one.
    load_best_model_at_end=True,
    report_to="none",
  )

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_test_data,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.28191,57.7677,42.0198,56.3764,56.5616,18.995968
2,No log,1.198351,50.4458,37.0464,49.3112,49.3606,19.0
3,No log,1.180497,47.5108,33.2832,46.2584,46.3103,19.0


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=372, training_loss=1.5213575465704805, metrics={'train_runtime': 753.0244, 'train_samples_per_second': 3.94, 'train_steps_per_second': 0.494, 'total_flos': 2031675064713216.0, 'train_loss': 1.5213575465704805, 'epoch': 3.0})

In [10]:
trainer.evaluate()



{'eval_loss': 1.1804972887039185,
 'eval_rouge1': 47.5108,
 'eval_rouge2': 33.2832,
 'eval_rougeL': 46.2584,
 'eval_rougeLsum': 46.3103,
 'eval_gen_len': 19.0,
 'eval_runtime': 40.1519,
 'eval_samples_per_second': 6.177,
 'eval_steps_per_second': 0.772,
 'epoch': 3.0}

In [11]:
saved_model_id="results"
trainer.model.save_pretrained(saved_model_id)
tokenizer.save_pretrained(saved_model_id)

('results/tokenizer_config.json',
 'results/special_tokens_map.json',
 'results/spiece.model',
 'results/added_tokens.json',
 'results/tokenizer.json')

This code loads a fine-tuned model and tokenizer, initializes a summarization pipeline, selects a random dataset sample, and generates a summary for comparison.

In [13]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
from random import randrange

# Load the tokenizer and model from local directory
model_name = saved_model_id
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Create summarization pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=0)

# Assuming you have loaded your dataset earlier in the code
# Select a random test sample
sample = test_dataset[randrange(len(test_dataset))]
print(f"Summary: \n{sample['summary']}\n---------------")

# Summarize dialogue
res = summarizer(sample["summary"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")


Token indices sequence length is longer than the specified maximum sequence length for this model (646 > 512). Running this sequence through the model will result in indexing errors


Summary: 
Existing law creates in the State Treasury the Indian Gaming Special Distribution Fund for the receipt and deposit of moneys received by the state from certain Indian tribes pursuant to the terms of gaming compacts entered into with the state. Existing law authorizes moneys in that fund to be used for specified purposes, including for grants for the support of state and local government agencies impacted by tribal government gaming. Existing law, until January 1, 2021, creates a County Tribal Casino Account in the treasury of each county that contains a tribal casino, which is funded according to specified formulas. Existing law requires the Controller, in consultation with the California Gambling Control Commission, to divide the County Tribal Casino Account for each county that has gaming devices that are subject to an obligation to make contributions to the Indian Gaming Special Distribution Fund into a separate account, known as an Individual Tribal Casino Account, for ea

Now we will run on a unseen data which was not used for training and see how well it will summarize.

In [15]:
indian_economy_summary = """
The Indian economy is one of the fastest-growing economies in the world, characterized by its vast population, diverse culture, and emerging market opportunities. It operates as a mixed economy, blending elements of socialism and capitalism. India's economy is driven by agriculture, manufacturing, and services, with significant contributions from sectors such as information technology, pharmaceuticals, and automotive.

Agriculture is a crucial sector, employing a large portion of the population and contributing to food security and rural livelihoods. However, the agricultural sector faces challenges such as low productivity, land fragmentation, and water scarcity, requiring reforms to enhance efficiency and sustainability.

Manufacturing is a key driver of economic growth, with India emerging as a global manufacturing hub. The "Make in India" initiative aims to promote domestic manufacturing and attract foreign investment, fostering industrial development and job creation.

Services play a vital role in the Indian economy, accounting for a significant share of GDP and employment. The information technology (IT) sector, in particular, has experienced rapid growth, making India a global leader in software services and outsourcing.

Infrastructure development is a priority for the Indian government, with investments in transportation, energy, and urban development. Initiatives such as the National Infrastructure Pipeline (NIP) aim to modernize infrastructure and support economic growth.

Despite significant progress, the Indian economy faces challenges such as income inequality, poverty, and environmental degradation. Policymakers focus on inclusive growth, social welfare programs, and sustainable development to address these challenges and unlock the country's full economic potential.

Overall, the Indian economy presents vast opportunities for investment and growth, driven by a young and dynamic workforce, entrepreneurial spirit, and ongoing reforms.
"""

print(f"Summary: \n{indian_economy_summary}\n---------------")

# Summarize dialogue
res = summarizer(indian_economy_summary)

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")


Summary: 

The Indian economy is one of the fastest-growing economies in the world, characterized by its vast population, diverse culture, and emerging market opportunities. It operates as a mixed economy, blending elements of socialism and capitalism. India's economy is driven by agriculture, manufacturing, and services, with significant contributions from sectors such as information technology, pharmaceuticals, and automotive.

Agriculture is a crucial sector, employing a large portion of the population and contributing to food security and rural livelihoods. However, the agricultural sector faces challenges such as low productivity, land fragmentation, and water scarcity, requiring reforms to enhance efficiency and sustainability.

Manufacturing is a key driver of economic growth, with India emerging as a global manufacturing hub. The "Make in India" initiative aims to promote domestic manufacturing and attract foreign investment, fostering industrial development and job creation.

