In [60]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [81]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Downloading builder script:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading and preparing dataset billsum/default to /root/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc...


Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Dataset billsum downloaded and prepared to /root/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc. Subsequent calls will reuse this data.


In [82]:
billsum = billsum.train_test_split(test_size=0.2)

In [83]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 9050 of the Elections Code is amended to read:\n9050.\nAfter the Secretary of State determines that a measure will appear on the ballot at the next statewide election, the Secretary of State shall promptly transmit a copy of the measure to the Legislative Analyst. The Legislative Analyst shall provide and return to the Secretary of State a ballot title and summary and ballot label for the measure. The Legislative Analyst shall prepare a ballot title and summary and ballot label for each measure submitted to the voters of the whole state by a date sufficient to meet the ballot pamphlet public display deadlines.\nSEC. 2.\nSection 9051 of the Elections Code is amended to read:\n9051.\n(a) (1) The ballot title and summary may differ from the legislative, circulating, or other title and summary of the measure and shall not exceed 100 words, not including the fiscal impact.\n(2) The ballot title and

In [84]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [85]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [86]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [88]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

In [92]:
!pip install evaluate


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.0 responses-0.18.0


In [93]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [94]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [95]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [97]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [98]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_billsum["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_billsum["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [99]:
import tensorflow as tf

model.compile(optimizer=optimizer)

In [102]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [107]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

In [108]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_billsum_model",
    tokenizer=tokenizer,
)

/content/my_awesome_billsum_model is already a clone of https://huggingface.co/Stavan1402/my_awesome_billsum_model. Make sure you pull the latest changes with `repo.git_pull()`.


In [109]:
callbacks = [metric_callback, push_to_hub_callback]

In [115]:
text = "summarize: Football, also known as soccer, is a captivating sport that ignites passion and unites people across the globe. With its rich history and widespread popularity, football has become a cultural phenomenon and a source of immense joy for millions. Played on a rectangular field, this team sport involves two competing sides, each striving to score goals by maneuvering the ball into the opponent's net using any part of their body except their arms and hands. The game demands a combination of skill, strategy, teamwork, and athleticism, captivating both players and spectators alike.Football transcends borders, races, and backgrounds, fostering a sense of camaraderie among supporters and players. It has the power to bring people from different walks of life together, as they cheer for their favorite teams and players, creating a vibrant atmosphere filled with excitement and anticipation. The game's simplicity, yet complexity, provides an endless array of possibilities and strategies, captivating fans with its unpredictability.The beauty of football lies not only in the exhilarating goals and skillful displays but also in the values it instills. Discipline, perseverance, and teamwork are fundamental principles essential for success in the sport. Players must work cohesively, communicating seamlessly on the field, showcasing their individual talents while upholding the collective objective of victory. This synergy between individuals serves as a valuable life lesson, teaching the importance of collaboration, dedication, and sportsmanship.Football has not only become a significant sporting event but also a catalyst for social change. It has the ability to inspire, break down barriers, and raise awareness about critical issues. From promoting gender equality and inclusivity to advocating for fair play and justice, football has played a pivotal role in driving positive transformations within society.Whether it's the mesmerizing skills of Lionel Messi, the commanding presence of Cristiano Ronaldo, or the tactical genius of coaches like Pep Guardiola, football continues to captivate hearts and minds worldwide. The game's universal appeal reaches far beyond the pitch, leaving an indelible mark on cultures, communities, and individuals. Football is more than just a sport; it is an embodiment of passion, unity, and the pursuit of greatness."

In [116]:
from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

[{'summary_text': "football, also known as soccer, is a captivating sport that ignites passion and unites people across the globe. It involves two competing sides, each striving to score goals by maneuvering the ball into the opponent's net using any part of their body except their arms and hands. The game demands a combination of skill, strategy, teamwork, and athleticism, captivating both players and spectators alike, fostering a sense of camaraderie among supporters and players, as they cheer for their favorite teams and players."}]