# Summarization

In this tutorial, you will explore the concept of text summarization using hugging face transformers, a key task in natural language processing (NLP).

You will learn:

* **Using pre trained text summarization model**
* **How to fine tune summarization models**

## Use Pre Trained Model

### Import the necessary libraries

In [1]:
from transformers import pipeline#@@@@@@@2

### Create a summarization pipeline

We use Hugging Face's pipeline() function to create a summarization pipeline, specifying the task as "summarization".

In [2]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")#@@@@@@

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### Provide some text for summarization

You can pass a long text that you want to summarize.

In [None]:
text = """
The field of natural language processing (NLP) is rapidly growing, with transformers being one of the most powerful tools.
They have revolutionized tasks like machine translation, text summarization, and question-answering. Transformers work by
using a mechanism called attention to focus on different parts of the input text. Since the introduction of the transformer
architecture, models such as BERT, GPT, and BART have emerged, each contributing to state-of-the-art performance in various
NLP tasks. Hugging Face provides easy access to these models through its library, making it accessible for both researchers
and developers to use pretrained models or fine-tune them on specific tasks.
"""

In [3]:
#@@@@@@@@@@@
text="""


There was once a pleased elephant who generally harassed smaller animals. He would go
to the ant colony and shower water on the ants. The ants, with their size, could just cry.
The elephant laughed and threatened the ants that he would kill them
"""

### Generate a summary

Now, you can generate a summary for the provided text by calling the summarizer() method.

 **Parameters**:
*   max_length: The maximum length of the summary.
*   min_length: The minimum length of the summary.
*   do_sample: If set to False, it will use greedy decoding (i.e., always select the most likely next word).

In [4]:
#@@@@@@@@
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])

The elephant laughed and threatened the ants that he would kill them if they cried. The ants, with their size, could just cry.


In [6]:
#@@@@@@@@
summary = summarizer(text, max_length=50, min_length=25, do_sample=True)
print(summary[0]['summary_text'])

The elephant would go to the ant colony and shower water on the ants. The ants, with their size, could just cry. The elephant laughed and threatened to kill them.


## Fine Tuning

### Install & Immport Libraries

Summarization creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

* Extractive: extract the most relevant information from a document.
* Abstractive: generate new text that captures the most relevant information.

This guide will show you how to:

1. Finetune T5 on the California state bill subset of the BillSum dataset for abstractive summarization.
2. Use your finetuned model for inference.

In [5]:
!pip install transformers datasets evaluate rouge_score



In [None]:
!pip transformers datasets evaluate rouge_score #@@@@@@@@@@

### Load the Model

In [6]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

billsum_subset = billsum.select(range(200))

README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

### Lets take a Look at the Data

In [7]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

In [8]:
print(f'billsum has {len(billsum)} examples')
print(f'first title: {billsum["title"]}')
print(f'first text: {billsum["text"]}')
print(f'first summary: {billsum["summary"]}')

billsum has 1237 examples
first title: ['An act to amend Section 215.1 of the Revenue and Taxation Code, relating to taxation, to take effect immediately, tax levy.', 'An act to amend Section 3550 of, and to add Section 1170.02 to, the Penal Code, relating to parole.', 'An act\nto add Chapter 3.8 (commencing with Section 6305) to Part 1 of Division 2 of the Revenue and Taxation Code,\nrelating to taxation.', 'An act to amend Sections 75220, 75221, and 75222 of, and to add Sections 75223, 75224, and 75225 to, the Public Resources Code, relating to transportation, and making an appropriation therefor.', 'An act to add and repeal Section 4788 of the Probate Code, relating to resuscitative measures.', 'An act to amend Sections 87207 and 89506 of the Government Code, relating to the Political Reform Act of 1974.', 'An act\nto add Article 7 (commencing with Section 128590) to Chapter 5 of Part 3 of Division 107 of the Health and Safety Code,\nrelating to health care, and making an appropriat

There are two fields that you’ll want to use:

* `text`: the text of the bill which’ll be the input to the model.
* `summary`: a condensed version of text which’ll be the model target.

### Load the Model

In [None]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [10]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
Use the keyword `text_target` argument when tokenizing labels.
Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [11]:
preifi= "summrized:"

def preprocess_function(examples):
    inputs = [preifi + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

### Preprocessing

In [None]:
#@@@@@@@@@
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/v2.21.0/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)#@@@@@@@

In [12]:
tokin=billsum.map(preprocess_function,batched=True)

Map:   0%|          | 0/1237 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](/docs/transformers/v4.44.2/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq#@@@@@


data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

In [13]:
from transformers import DataCollatorForSeq2Seq

data_collect=DataCollatorForSeq2Seq(tokenizer,model=checkpoint , return_tensors ="tf")

### Evaluate

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [14]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [15]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

### Train

To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:

In [None]:
from transformers import create_optimizer, AdamWeightDecay#@@@@@

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [16]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Then you can load T5 with [TFAutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/v4.44.2/en/model_doc/auto#transformers.TFAutoModelForSeq2SeqLM):

In [None]:
from transformers import TFAutoModelForSeq2SeqLM#@@@@@@@

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [17]:
from transformers import TFAutoModelForSeq2SeqLM
model=TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Convert your datasets to the `tf.data.Dataset` format with [prepare_tf_dataset()](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset):

In [None]:
train_size = int(0.7 * len(tokenized_billsum))#@@@@@
validation_size = int(0.15 * len(tokenized_billsum))
test_size = int(0.15 * len(tokenized_billsum))
tokenized_billsum = {"train": tokenized_billsum.select(range(train_size)),
                     "validation": tokenized_billsum.select(range(train_size, train_size + validation_size)),
                     "test": tokenized_billsum.select(range(train_size + validation_size, train_size + validation_size + test_size))}

In [20]:
train_size=int(0.7* len(tokin))
validation_size = int(0.15 * len(tokin))
test_size = int(0.15 * len(tokin))

tokenized_billsum={"traine:":tokin.select(range(train_size)),
                   "vald:":tokin.select(range(train_size ,train_size+validation_size)),
                   "test:":tokin.select(range(train_size+validation_size,train_size+validation_size+test_size))}

In [None]:
batch_size=4#@@@@@@@@@@@
tf_train_set = model.prepare_tf_dataset(
    tokenized_billsum["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_billsum["validation"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_billsum["test"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [22]:
bach_size=4

tf_train_set=model.prepare_tf_dataset(
    tokenized_billsum["traine:"],
    shuffle=True,
    batch_size=bach_size,
    collate_fn=data_collect,
)

tf_validation_set=model.prepare_tf_dataset(
    tokenized_billsum["vald:"],
    shuffle=False,
    batch_size=bach_size,
    collate_fn=data_collect,
)

tf_test_set=model.prepare_tf_dataset(
    tokenized_billsum["test:"],
    shuffle=False,
    batch_size=bach_size,
    collate_fn=data_collect,
)



Configure the model for training with [compile](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don’t need to specify one unless you want to:

In [None]:
import tensorflow as tf#@@@@@

model.compile(optimizer=optimizer)  # No loss argument!

In [23]:
import tensorflow as tf

model.compile(optimizer=optimizer)

The last two things to setup before you start training is to compute the ROUGE score from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](https://huggingface.co/docs/transformers/en/main_classes/keras_callbacks).

Pass your compute_metrics function to [KerasMetricCallback](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/keras_callbacks#transformers.KerasMetricCallback):

In [24]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

Finally, you’re ready to start training your model! Call [fit](https://keras.io/api/models/model_training_apis/#fit-method) with your training and validation datasets, the number of epochs, and your callbacks to finetune the model:

In [25]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7898e2394370>

In [26]:
model.save_pretrained("my_awesome_billsum_model")
tokenizer.save_pretrained("/content/my_awesome_billsum_model")

('/content/my_awesome_billsum_model/tokenizer_config.json',
 '/content/my_awesome_billsum_model/special_tokens_map.json',
 '/content/my_awesome_billsum_model/spiece.model',
 '/content/my_awesome_billsum_model/added_tokens.json',
 '/content/my_awesome_billsum_model/tokenizer.json')

### Inference

Great, now that you’ve finetuned a model, you can use it for inference!

Come up with some text you’d like to summarize. For T5, you need to prefix your input depending on the task you’re working on. For summarization you should prefix your input as shown below:

In [27]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a [pipeline](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/pipelines#transformers.pipeline) for summarization with your model, and pass your text to it:

In [None]:
c

summarizer = pipeline("summarization", model="/content/my_awesome_billsum_model")
summarizer(text)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at /content/my_awesome_billsum_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It will ask the ultra-wealthy and corporations to pay their fair share."}]

In [32]:
from transformers import pipeline

summrize = pipeline ("summarization", model="/content/my_awesome_billsum_model")
summrize(text)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at /content/my_awesome_billsum_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs . it's the most aggressive action on tackling the climate crisis in American history . no one making under $400,000 per year will pay a penny more in taxes."}]