# Title Generation with T5 - Text Summarization - Custom LR Approach

Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, its attention computations are masked so that tokens cannot attend to tokens to their right, as this would result in label leakage.

In [None]:
import evaluate
from huggingface_hub import notebook_login
rouge_score = evaluate.load("rouge")
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, AdamWeightDecay, pipeline
from transformers import DefaultDataCollator, DataCollatorForSeq2Seq
import tensorflow as tf
from datasets import Dataset, DatasetDict, load_dataset
from sklearn.model_selection import train_test_split
import pandas as pd
import re
import math
import plotly.express as px
import plotly.io as pio
import os
import numpy as np
import nltk
from nltk import sent_tokenize
from tqdm import tqdm
import seaborn as sns
from clr_callback import *
%matplotlib inline
import matplotlib.pyplot as plt
pio.renderers.default = 'notebook_connected'
os.environ["TOKENIZERS_PARALLELISM"] = "false"
notebook_login()

## Importing the baseline model and tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
#tokenizer.pad_token = tokenizer.eos_token
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")#, pad_token_id=tokenizer.eos_token_id)

## Importing Data

In [None]:
# try a dataset longer text to resume - 512 max model length
data = load_dataset("CShorten/ML-ArXiv-Papers", split='train') #set to 100000
data

In [None]:
words = [len(x.split()) for x in data["title"]]
px.histogram(words, nbins=100, text_auto=True, labels={"value":"Title Length (words)"})

In [None]:
abstracts = [len(x.split()) for x in data["abstract"]]
px.histogram(abstracts, nbins=400, marginal="rug", labels={"value":"Abstract Length (words)"})

In [None]:
tot=0
for x in data["title"]:
    tot+=len(x.split())
tot/117592

In [None]:
tot=0
for x in data["abstract"]:
    tot+=len(x.split())
tot/117592

## Splitting into Train, Test, Validation

In [None]:
train = data.train_test_split(shuffle = True, seed = 200, test_size=0.2)
validation = train['test'].train_test_split(shuffle = True, seed = 200, test_size=0.5)

data_set = DatasetDict({
    'train': train['train'],
    'test': validation['train'],
    'validation': validation['test']})

data_set

## Tokenize Data with HF Tokenizer

In [None]:
def tokenization(data):
    model_inputs = tokenizer(data["abstract"], max_length=300, truncation=True)     #, padding="max_length")

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(data["title"], max_length=17, truncation=True)   #, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized = data_set.map(tokenization, batched = True, num_proc=4)

In [None]:
tokenized.remove_columns(["Unnamed: 0", "Unnamed: 0.1"])

## Converting Train and Val Sets to TF Datasets

Use DataCollatorForSeq2Seq to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. It also supplies a data collator by default which is appropriate for most tasks.

In [None]:
batch_size = 16 #16

train_dataset = model.prepare_tf_dataset(
    tokenized["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator
)

validation_dataset = model.prepare_tf_dataset(
    tokenized["validation"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator
)

In [None]:
train_dataset

## Compiling, Fitting, and Evaluating the Model

Once we've done that, it's time for our optimizer! We can initialize our `AdamWeightDecay` optimizer directly, or we can use the [`create_optimizer`](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.create_optimizer) function to generate an `AdamWeightDecay` optimizer with a learning rate schedule. In this case, we'll just stick with a constant learning rate for simplicity, so let's just use `AdamWeightDecay`.

In [None]:
# exponential decay scheduler
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=1e-4, #1e-3 or 1e-4 or 3e-4
    decay_steps=500, #refers to iterations, not epochs - 1000 too large
    decay_rate=0.98,
    staircase=True)

In [None]:
optimizer = AdamWeightDecay(learning_rate=lr_schedule, beta_1=0.9, beta_2=0.99, epsilon=1e-9, weight_decay_rate=0.01)

In [None]:
model.compile(optimizer=optimizer)
model.summary()

In [None]:
# This cell is optional
from transformers.keras_callbacks import PushToHubCallback

model_name = "T5"
push_to_hub_model_id = f"{model_name}-finetuned-abstracts-custLR"

push_to_hub_callback = PushToHubCallback(
    output_dir="./clm_model_save_custLR",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
    hub_token="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"
)

In [None]:
#This cell is optional
from numpy import histogram
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./tensorboard_custLR", 
                                    update_freq=1,
                                    histogram_freq=1,
                                    profile_batch="2,10"
                                    )

callbacks = [tensorboard_callback, push_to_hub_callback]

In [None]:
# TO FIX
def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_predictions = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = rouge_score.compute(predictions=decoded_predictions, references=decoded_labels, use_stemmer=True)

    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result

In [None]:
# use 1-2-3 epochs but more data
# Upload the model
#model.fit(train_set, validation_data=validation_set, epochs=1, callbacks=callbacks)

# do not upload the model
model.fit(train_dataset, validation_data=validation_dataset, epochs=3, workers=9, use_multiprocessing=True, callbacks=callbacks)

Evaluate the model and get its cross-entropy loss on the val set

In [None]:
eval_loss = model.evaluate(validation_dataset)

In [None]:
def compute_rouge(eval_set, max_length, min_length, model, tokenizer, n_iter=None):
    summarizer = pipeline(
        "summarization",
        model=model,
        tokenizer=tokenizer,
        framework="tf"
    )
    if n_iter == None:
        n_iter = len(eval_set)
    decoded_preds = []
    decoded_labels = []

    # avoid to load from cache, recompute shuffle
    eval_set = eval_set.shuffle(seed=None, load_from_cache_file=False)

    for i in tqdm(range(n_iter)):
        summary = summarizer(eval_set["abstract"][i], max_length=max_length, min_length=min_length)
        decoded_preds.append(summary[0]["summary_text"])
        labels = re.sub("\n","", eval_set["title"][i])
        decoded_labels.append(labels)
       

    result = rouge_score.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    return result

In [None]:
# results references from: https://paperswithcode.com/sota/abstractive-text-summarization-on-cnn-daily
# computed on all test dataset 
compute_rouge(tokenized["test"], 12, 8, model, tokenizer)

In [None]:
# computed on subset
compute_rouge(tokenized["test"], 13, 5, model, tokenizer, n_iter=20)

## Generating Text Using a Pipeline

In [None]:
summarizer = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    framework="tf"
)

In [None]:
text = """The vision proposed by the Semantic Web and the use of Linked Data allows for large-scale merging and integration  of data, thus giving access to a large amount of information. All this requires open standards and interoperability, properties not easy to achieve. The difficulty when integrating data resides in the fact that databases rarely adopt the same attributes to represent the same objects.
As a result, an object is represented differently in different databases.
To overcome this problem we need to add new definitions to the data. This new information is known as vocabulary (or ontology). Consequently, standardization of different vocabularies is also a practical difficulty.
The benefits, however, concern greater efficiency in terms of search: the more data connections, the richer results are obtained. In addition, the schema flexibility of technologies like RDF  is an advantage over relational databases, where a change to the schema can pose difficulties. Other advantages include scalability and speed.
Another important property of linked data is called inference and refers to the ability to infer new connections between data from the existing ones."""

res = summarizer(text, max_length=100, min_length=20)
res

## Pushing Up Stuffs

In [None]:
from huggingface_hub import Repository
repo = Repository(local_dir="./clm_model_save")

In [None]:
repo.git_pull()
repo.push_to_hub(commit_message="Commit my-awesome-file to the Hub")
repo.git_push()

In [None]:
input_ids = tokenizer.encode(text, return_tensors="tf")
output = model.generate(input_ids, max_length=50, no_repeat_ngram_size=2, early_stopping=True)
tokenizer.decode(output[0], skip_special_tokens=True)

## Using a model from the hub

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("EdBianchi/T5-finetuned-abstracts")
model = TFAutoModelForSeq2SeqLM.from_pretrained("EdBianchi/T5-finetuned-abstracts")

In [None]:
model.config

In [None]:
prompt = """The vision proposed by the Semantic Web and the use of Linked Data allows for large-scale merging and integration  of data, thus giving access to a large amount of information. All this requires open standards and interoperability, properties not easy to achieve. The difficulty when integrating data resides in the fact that databases rarely adopt the same attributes to represent the same objects.
As a result, an object is represented differently in different databases.
To overcome this problem we need to add new definitions to the data. This new information is known as vocabulary (or ontology). Consequently, standardization of different vocabularies is also a practical difficulty.
The benefits, however, concern greater efficiency in terms of search: the more data connections, the richer results are obtained. In addition, the schema flexibility of technologies like RDF  is an advantage over relational databases, where a change to the schema can pose difficulties. Other advantages include scalability and speed.
Another important property of linked data is called inference and refers to the ability to infer new connections between data from the existing ones."""

In [None]:
summarizer1 = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    framework="tf"
)

res = summarizer1(prompt, max_length=12, min_length=8)
res