# Introduction

This notebook is an adaptation from: https://huggingface.co/docs/transformers/tasks/summarization

We basically just skip the training part and jump straight into interferene. The reason for that is, that the training crashes the colab instance due to a lack of resources in the free version. We are just using the pretrained model from the tutorial above. Since it is trained on the billsum dataset the evaluation also uses this dataset. In theory we should be able to any dataset we like but it need to have a *summary of the input text.*

Running this takes about 20 minutes in total.


In [3]:
# Transformers installation
! pip install transformers datasets evaluate rouge_score
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git



# Summarization

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load BillSum dataset

Start by loading the smaller California state bill subset of the BillSum dataset from the 🤗 Datasets library:

In [5]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

print(billsum)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [6]:
billsum = billsum.train_test_split(test_size=0.2)

Then take a look at an example:

In [7]:
print(billsum)

billsum["train"][0]

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})


{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 14186.36 of the Welfare and Institutions Code is amended to read:\n14186.36.\n(a) It is the intent of the Legislature that a universal assessment process for LTSS be developed and tested. The initial uses of this tool may inform future decisions about whether to amend existing law regarding the assessment processes that currently apply to LTSS programs, including IHSS.\n(b) (1) In addition to the activities set forth in paragraph (9) of subdivision (a) of Section 14186.35, county agencies shall continue IHSS assessment and authorization processes, including making final determinations of IHSS hours pursuant to Article 7 (commencing with Section 12300) of Chapter 3 and regulations promulgated by the State Department of Social Services.\n(2) No sooner than January 1, 2015, for the counties and beneficiary categories specified in subdivision (e), counties shall also utilize the universal assessme

There are two fields that you'll want to use:

- `text`: the text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

In [8]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [9]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [10]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

## Train

**not quite sure how much of this we still need. I just leave it here for now and execute it. We are not doing the training ourself since it crashes the colab instance due to not enough computing power**

<Tip>

If you aren't familiar with finetuning a model with Keras, take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-a-tensorflow-model-with-keras)!

</Tip>
To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:

In [13]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Then you can load T5 with [TFAutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.TFAutoModelForSeq2SeqLM):

In [14]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Convert your datasets to the `tf.data.Dataset` format with [prepare_tf_dataset()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset):

In [15]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_billsum["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_billsum["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with some text you'd like to summarize. For T5, you need to prefix your input depending on the task you're working on. For summarization you should prefix your input as shown below:

In [16]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for summarization with your model, and pass your text to it:

In [17]:
from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

In [18]:
print(tf_test_set)

import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

# Load pretrained model and tokenizer
model_name = "stevhliu/my_awesome_billsum_model"
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)


# function for inference
def generate_summary(batch):
    inputs = batch[0]  # Extrahiere die Eingabedaten aus dem Tupel
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]
    summaries = model.generate(input_ids, attention_mask=attention_mask, max_length=128)
    return summaries

# start inference
all_summaries = [] # holds all the all the generated summaries after completion
for batch in tf_test_set:
    summaries = generate_summary(batch)
    decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True) for s in summaries]
    all_summaries.extend(decoded_summaries)




<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, None), dtype=tf.int64, name=None))>


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFT5ForConditionalGeneration: ['decoder.embed_tokens.weight', 'encoder.embed_tokens.weight', 'lm_head.weight']
- This IS expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [19]:
# just some debug prints
print(len(all_summaries))
print(all_summaries[0])

248
The United States Consumer Product Safety Commission identified window coverings as one of the top five hidden home hazards in the country. The United States Consumer Product Safety Commission has recorded 184 reported fatal strangulations nationwide from 1996 through 2012 involving window covering cords among children eight years and younger. The United States Consumer Product Safety Commission has recorded 184 reported fatal strangulations nationwide from 1996 through 2012 involving window covering cords among children eight years and younger. The Legislature has subsequently amended this measure to enact legislation to protect children from the preventable strangulation hazard.


In [20]:
#debug print
print(billsum["test"][0]["summary"])

# build a list of the refernce summaries in order to compute the metrics
reference_summaries = []

for data in billsum["test"]:
  reference_summaries.append(data["summary"])

print(reference_summaries)

Existing law generally regulates various business activities and practices, including the sale within the state of cribs and bunk beds intended for use by children.
This bill would, beginning January 1, 2018, make it unlawful to sell to a purchaser located in the state a corded window covering. The bill would define “corded window covering” as a window covering, including, but not limited to, blinds, curtains, draperies, and shades, that has an accessible cord, and would define “accessible cord” as any cord determined to be accessible pursuant to the 2012 American National Standard for Safety of Corded Window Covering Products adopted by the United States Consumer Product Safety Commission and any successor standards.
Existing law provides for the licensing and regulation of community care facilities, including, among others, residential facilities, foster family homes, certified family homes, and group homes by the State Department of Social Services. Existing law also provides for th

In [21]:
# compute rouge score

import evaluate

rouge = evaluate.load("rouge")

results = rouge.compute(predictions=all_summaries, references=reference_summaries)



In [22]:
print(results)

{'rouge1': 0.29718226987457147, 'rouge2': 0.12891403592204048, 'rougeL': 0.20136778033557284, 'rougeLsum': 0.2519496526399201}


In [23]:
# define function to calculate the jaccard score

def jaccard_similarity(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    if not union:
        return 0.0
    return len(intersection) / len(union)


In [26]:
# calculate jaccard
jaccard_scores = []

for ref, gen in zip(reference_summaries, all_summaries):
    ref_tokens = tokenizer.tokenize(ref)
    gen_tokens = tokenizer.tokenize(gen)
    score = jaccard_similarity(ref_tokens, gen_tokens)
    jaccard_scores.append(score)

# calculate average jaccard score and print
average_jaccard = sum(jaccard_scores) / len(jaccard_scores)
print(f'AVG Jaccard: {average_jaccard}')

# after this part is "tutorial stuff again"

AVG Jaccard: 0.19226321717656186


You can also manually replicate the results of the `pipeline` if you'd like:


Tokenize the text and return the `input_ids` as TensorFlow tensors:

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="tf").input_ids



Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.TFGenerationMixin.generate) method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [21]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model", from_pt=True)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFT5ForConditionalGeneration: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
- This IS expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Decode the generated token ids back into text:

In [22]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It'll ask the ultra-wealthy and corporations to pay their fair share."