In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Setting paths to ensure the code can be exceuted when using Git LFS within a Jupyter Notebook.

In [2]:
# Setting path and git_LFS_path
import os
os.environ['PATH'] = '/opt/homebrew/bin:' + os.environ['PATH']
os.environ['GIT_LFS_PATH'] = '/opt/homebrew/bin/git-lfs'

Loading the training dataset for text summarization

In [3]:
from datasets import load_dataset
billsum = load_dataset("billsum", split="ca_test")

In [4]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

Splitting the training and testing dataset.

In [5]:
billsum = billsum.train_test_split(test_size=0.2)

Loading a pre-trained tokenizer by T5 model. The T5 "Text-To-Text Transfer Transfomer" is a transformer-based neural network architecture for NLP

In [None]:
from transformers import AutoTokenizer
checkpoint = "t5-small" # name of the pre-trained model checkpoint
# Loading tokenizer by using T5 tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint) 

Cerate a function to preprocess the text:

Input Text Processing:
inputs = [prefix + doc for doc in examples["text"]]: This line adds the specified prefix to each document in the input text. It creates a list of strings where each string is formed by concatenating the prefix with the corresponding document.
model_inputs = tokenizer(inputs, max_length=1024, truncation=True): This line tokenizes the modified input texts using the specified tokenizer. The resulting model_inputs variable holds the tokenized inputs.

Label Text Processing (for Training):
labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True): This line tokenizes the target summaries (labels) using the same tokenizer. The resulting labels variable holds the tokenized labels.

Adding Labels to Model Inputs:
model_inputs["labels"] = labels["input_ids"]: This line adds the tokenized labels to the model_inputs dictionary under the key "labels."

Return Model Inputs:
return model_inputs: The function returns the processed inputs that can be used as input to a summarization model during training.

This preprocessing function takes a dictionary of examples containing "text" and "summary" keys, adds a prefix to the input texts, tokenizes both the modified inputs and the target summaries, and returns a dictionary of model inputs suitable for training a summarization model.

In [7]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Using the map function from Hugging Face and the apply the preprocess function to pre-process the text from the dataset billsum.

Create a data collator for sequence to sequence task.

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, 
                                       model=checkpoint, 
                                       return_tensors="tf")

2023-12-18 09:41:13.682074: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Using rouge metirc (Recall-Oriented Understdy for Gisting Evaluation) for automatic evaluation of machine generated text.

ROUGE metrics measure the overlap and agreement between the words, n-grams (contiguous sequences of n items, usually words), or sequences in the generated summary and the reference summaries. The key ROUGE metrics include ROUGE-N (precision, recall, and F1 score for n-grams), ROUGE-L (longest common subsequence-based metrics), and ROUGE-W (word overlap-based metrics).

In [10]:
import evaluate

rouge = evaluate.load("rouge")

Create a function to compute the above metrics, especially for the above rouge matric.

eval_pred is a tuple containing model predictions and labels.
decoded_preds and decoded_labels decode the model predictions and labels using the tokenizer, skipping special tokens.
Handling Padding Tokens:
labels = np.where(labels != -100, labels, tokenizer.pad_token_id) replaces tokens with the pad token ID where the labels are equal to -100. This is a common practice in masked language model training.
ROUGE Computation:
result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True) computes ROUGE scores using the decoded predictions and labels.
Generation Length Calculation:
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions] calculates the length of each generated summary.
result["gen_len"] = np.mean(prediction_lens) computes the average length of the generated summaries and adds it to the ROUGE result.
Result Formatting:
The function returns a dictionary where ROUGE metrics are rounded to four decimal places.

In [25]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

# Train the model

Create a Adam optimezer with weight decay.

Adam Optimizer: a optimization algorithm is a combination of two common optimizers: RMSprop and momentuem. Adam maintains two moving averages for each parameter: the first moment (mean) and the second moment (uncentered variance). These moving averages are used to adaptively adjust the learning rates for each parameter.

Adam Weight Decay: is a variant of Adam that includes weight decay regularization. Weight decay is a regularization technique that penalizes large weights in the model to prevent overfitting. In the context of neural networks, weight decay is often used to add a term to the loss function that discourages the model from assigning too much importance to any single weight.

In [26]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Load the pre-trained seq2seq model for Tensorflow.

In [27]:
from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Prepare the training and testing dataset:

In [28]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_billsum["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_billsum["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Compile the model by Tensorflow:

In [29]:
import tensorflow as tf

# No loss argument!
model.compile(optimizer=optimizer)  

Call back the rouge metric to evaulate data.

In [30]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset = tf_test_set)

Setting an option to upload the trained model to hugging face hub repository.

In [31]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="amoshughugface/summarisation",
    tokenizer=tokenizer,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got fo

In [37]:
model.fit(x=tf_train_set, validation_data=tf_test_set,
          epochs = 3,
         callbacks = push_to_hub_callback)

Epoch 1/3

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch 2/3


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch 3/3


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<keras.callbacks.History at 0x7fd0541c1760>

In [43]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [51]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("amoshughugface/summarisation")
model = TFAutoModelForSeq2SeqLM.from_pretrained("amoshughugface/summarisation")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at amoshughugface/summarisation.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [52]:
inputs = tokenizer(text, return_tensors="tf", max_length=1024, truncation=True)

In [53]:
# Generate the summary
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

2023-12-19 12:13:18.615548: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_compile_on_demand_op.cc:178 : UNIMPLEMENTED: Could not find compiler for platform Host: NOT_FOUND: could not find registered compiler for platform Host -- was support for that platform linked in?


UnimplementedError: Exception encountered when calling layer 'SelfAttention' (type TFT5Attention).

{{function_node __wrapped__XlaDynamicSlice_device_/job:localhost/replica:0/task:0/device:CPU:0}} Could not find compiler for platform Host: NOT_FOUND: could not find registered compiler for platform Host -- was support for that platform linked in? [Op:XlaDynamicSlice]

Call arguments received by layer 'SelfAttention' (type TFT5Attention):
  • hidden_states=tf.Tensor(shape=(4, 1, 512), dtype=float32)
  • mask=tf.Tensor(shape=(4, 1, 1, 2), dtype=float32)
  • key_value_states=None
  • position_bias=None
  • past_key_value=('tf.Tensor(shape=(4, 8, 1, 64), dtype=float32)', 'tf.Tensor(shape=(4, 8, 1, 64), dtype=float32)')
  • layer_head_mask=None
  • query_length=None
  • use_cache=True
  • training=False
  • output_attentions=False