In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [4]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-large')

In [5]:
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')

In [17]:
my_text = """Summerize text:

US lawmakers are working to secure the votes needed to pass a bipartisan deal that will temporarily suspend the nation's debt ceiling.

Democrat and Republican leaders say they expect it will be approved, but some lawmakers have said they will vote against it.

The package must pass in the narrowly-divided House of Representatives before it is voted on in the Senate.

The US may default on its debt by 5 June without action being taken.

President Joe Biden called the agreement a "compromise" after a deal was reached over the weekend, while Republican House Speaker Kevin McCarthy said it was "worthy of the American people".

Negotiators have been working to sell the package on the Memorial Day federal holiday on Monday, according to US media, with both parties holding separate calls and meetings on the bill.

The House and Senate are expected to return to the Capitol on Tuesday. A vote on the bill in the House of Representative is scheduled for Wednesday, lawmakers said.

The proposed deal comes after long and bitter negotiations between Democrats and Republicans.

It includes suspending the debt ceiling until the first quarter of 2025, rather than raising it by a specific amount, as well as a cap on non-defence spending until 2024.

A text of the bill, titled the Fiscal Responsibility Act, was made public on Sunday.

A simple guide to the US debt ceiling
What's in the US debt ceiling deal?
That same day, Mr Biden told reporters that he does not believe his party made too many concessions in the agreement.

"This is a deal that is good news," Mr Biden said. "It takes the threat of catastrophic default off the table, protects our hard-earned and historic economic recovery."

Hakeem Jeffries, the Democratic House minority leader, told CBS that he believes his party will support it.

"I do expect that there will be Democratic support once we have the ability to actually be fully briefed by the White House," Mr Jeffries said on Sunday. "But I'm not going to predict what those numbers may ultimately look like."

But Ro Khanna, a California Democrat and member of the House Progressive Caucus, told NBC News on Sunday night that a "large majority" of House Democrats are "in flux" on whether they would lend their support.

Meanwhile, Mr McCarthy said on Sunday that he expects over 95% of House Republicans will support the bill.

In an opinion piece published in the Wall Street Journal late on Sunday, Mr McCarthy hailed the agreement as a hard-fought win for Republicans.

"We are changing the direction in Washington through a responsible debt-limit increase that cuts spending, saves taxpayers money and restores economic growth," he wrote.

During negotiations, Republicans had been seeking spending cuts in areas such as education and other social programmes in exchange for raising the $31.4tn (£25tn) debt limit.

As the 99-page proposed agreement was made public, some of the most conservative Republicans voiced concerns that the deal does not cut future spending enough. Republican Chip Roy of Texas said on Twitter that he and some others were going to try to stop it passing.

Some Democrats said they worried about changes in the agreement to the food stamps programme.

Aside from addressing the debt ceiling limit, the bill also proposed raising the age from 50 to 54 for those who are required to work in order to receive food benefits.

At the same time, it proposed eliminating work requirements for veterans and people who are homeless.

Republicans control the House by 222 to 213, while Democrats control the Senate by 51 to 49.

The Treasury had warned the US will run out of money if a deal is not passed.

The US must borrow money to fund the government because it spends more than it raises in taxes.

With the US dollar being the reserve currency of the world, a default would both upend the US economy and disrupt global markets.
"""

In [9]:
inputs = tokenizer(my_text, return_tensors='pt')
outputs = model.generate(**inputs,
                        min_length=200,
                        max_new_tokens=1000,
                        num_beams=16,
                        no_repeat_ngram_size=2,
                        early_stopping=True)
output_text_Flan_t5 = tokenizer.batch_decode(outputs,
                                            skip_special_tokens=True)

In [10]:
print(output_text_Flan_t5)

["Democratic House minority leader Hakeem Jeffries says he believes his party will support the bill. Republican House Speaker Kevin McCarthy hails the deal as a hard-fought win for Republicans. The US may default on its debt by 5 June without action being taken. Republicans had been seeking spending cuts in areas such as education and other social programmes in exchange for raising the $31.4tn debt limit. As the 99-page proposed agreement was made public on Sunday, some of the most conservative Republicans voiced concerns that it does not cut future spending enough. Some Democrats said they worried about changes in the agreement to the food stamps programme. A deal to temporarily suspend the nation's debt ceiling was reached over the weekend, with Democrats and Republicans holding separate calls and meetings on Monday. Negotiators have been working to sell the package on the Memorial Day federal holiday, according to US media. Democrat and Republican leaders say they expect it will be 

In [1]:
dataset_id = "samsum"

In [2]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset samsum (C:/Users/Cremator/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)
100%|██████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1000.39it/s]

Train dataset size: 14732
Test dataset size: 819





In [3]:
from random import randrange


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
Camilla: I'm almost there
Diana: I won't make it before 7, traffic is horrible today
Elisabeth: Just don't race
Diana: too late
Elisabeth: what happened?
Diana: we just had an accident
Elisabeth: God, nothing serious I hope
Camilla: Diana, are you there?
Diana: sorry, no, a little crash but we're talking to this women that hit us
Diana: a very unpleasant situation
Camilla: is she insured 
Diana: she's not even sure
Camilla: what a moron
Diana: indeed
Elisabeth: Diana, we will start without you then, join us when you manage to get out of it
Diana: ok, sorry
---------------
summary: 
Diana had a car accident. Some woman hit them. She'll be late for her meeting with Camilla and Elisabeth. They'll start without her.
---------------


In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-small"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [5]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

                                                                                                                       

Max source length: 512


                                                                                                                       

Max target length: 95


In [6]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

                                                                                                                       

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']




In [8]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-small"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

In [9]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Cremator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=2
)

In [11]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args
training_args = Seq2SeqTrainingArguments(
    fp16=True,
    output_dir=repository_id,
    predict_with_generate=True,
    learning_rate=5e-4,
    num_train_epochs=5,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=False,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
    gradient_accumulation_steps=2,
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

In [12]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.0,,41.7144,17.7106,34.2271,38.0914,16.886447
2,0.0,,41.7144,17.7106,34.2271,38.0914,16.886447
3,0.0,,41.7144,17.7106,34.2271,38.0914,16.886447
4,0.0,,41.7144,17.7106,34.2271,38.0914,16.886447
5,0.0,,41.7144,17.7106,34.2271,38.0914,16.886447


TrainOutput(global_step=4605, training_loss=0.0, metrics={'train_runtime': 2519.2074, 'train_samples_per_second': 29.239, 'train_steps_per_second': 1.828, 'total_flos': 1.369269457649664e+16, 'train_loss': 0.0, 'epoch': 5.0})

In [13]:
trainer.evaluate()

{'eval_loss': nan,
 'eval_rouge1': 41.7144,
 'eval_rouge2': 17.7106,
 'eval_rougeL': 34.2271,
 'eval_rougeLsum': 38.0914,
 'eval_gen_len': 16.886446886446887,
 'eval_runtime': 66.8035,
 'eval_samples_per_second': 12.26,
 'eval_steps_per_second': 1.542,
 'epoch': 5.0}

In [15]:
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()

In [21]:
model_path = repository_id

model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

input_text = my_text[0:500]

input_ids = tokenizer.encode(input_text, return_tensors="pt")

output_ids = model.generate(input_ids)

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [22]:
print(output_text)

Senate bid to suspend the debt ceiling.
