# Data Collection

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import torch
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()

In [4]:
from datasets import load_dataset

In [5]:
dataset = load_dataset("multi_news",trust_remote_code=True)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})


In [8]:
dataset["train"][0]

{'document': 'National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. \n \n Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. \n \n Enjoy the show. ||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. economy 

In [11]:
dataset["test"][0]

{'document': 'GOP Eyes Gains As Voters In 11 States Pick Governors \n \n Enlarge this image toggle caption Jim Cole/AP Jim Cole/AP \n \n Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their numbers by at least one, with the potential to extend their hold to more than two-thirds of the nation\'s top state offices. \n \n Eight of the gubernatorial seats up for grabs are now held by Democrats; three are in Republican hands. Republicans currently hold 29 governorships, Democrats have 20, and Rhode Island\'s Gov. Lincoln Chafee is an Independent. \n \n Polls and race analysts suggest that only three of tonight\'s contests are considered competitive, all in states where incumbent Democratic governors aren\'t running again: Montana, New Hampshire and Washington. \n \n While those state races remain too close to call, Republicans are expected to wrest the North Carolina governorship from Democratic control, and to easily win GOP-held seats in

In [13]:
articles = [x for x in dataset["train"]["document"]]

In [14]:
article_count = [len(y.split('|||||')) for y in articles]

In [15]:
for i in set(article_count):
    print(i,':',article_count.count(i))

1 : 504
2 : 23743
3 : 12577
4 : 4921
5 : 1845
6 : 707
7 : 371
8 : 194
9 : 81
10 : 29


# Data Preprocessing

## Creating Dataframes using the Training and Validation Sets

In [18]:
import pandas as pd

In [20]:
# Using training set
news_train_df = pd.DataFrame(dataset["train"])

In [21]:
news_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44972 entries, 0 to 44971
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   document  44972 non-null  object
 1   summary   44972 non-null  object
dtypes: object(2)
memory usage: 702.8+ KB


In [25]:
# Using Validation Set
news_val_df = pd.DataFrame(dataset["validation"])

In [26]:
news_val_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5622 entries, 0 to 5621
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   document  5622 non-null   object
 1   summary   5622 non-null   object
dtypes: object(2)
memory usage: 88.0+ KB


## Text Cleaning

In [28]:
import re

In [29]:
def clean_text(text):
    text = text.lower()
    text = text.replace("_","-").replace("_","-").replace("…", "...")
    text = re.sub(r"[^a-z0-9\s\.,!?':;\"-]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

### Training Set

In [31]:
news_train_df["document"] = news_train_df["document"].apply(clean_text)
news_train_df["summary"] = news_train_df["summary"].apply(clean_text)

In [32]:
news_train_df["articles"] = news_train_df["document"].str.split(r"\s*\|\|\|\|\|\s*")

In [33]:
news_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44972 entries, 0 to 44971
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   document  44972 non-null  object
 1   summary   44972 non-null  object
 2   articles  44972 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


### Validation Set

In [35]:
news_val_df["document"] = news_val_df["document"].apply(clean_text)
news_val_df["summary"] = news_val_df["summary"].apply(clean_text)

In [36]:
news_val_df["articles"] = news_val_df["document"].str.split(r"\s*\|\|\|\|\|\s*")

In [37]:
news_val_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5622 entries, 0 to 5621
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   document  5622 non-null   object
 1   summary   5622 non-null   object
 2   articles  5622 non-null   object
dtypes: object(3)
memory usage: 131.9+ KB


## Tokenization

In [43]:
from transformers import AutoTokenizer

In [60]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

### Tokenizing Training Set

In [63]:
news_train_df["tokenized_article"] = news_train_df["articles"].apply(
    lambda x : tokenizer.batch_encode_plus(x, truncation=True, padding="max_length", max_length=512)
                            )

In [65]:
news_train_df["target"] = news_train_df["summary"].apply(
    lambda x : tokenizer(x, truncation=True, padding="max_length", max_length=128)
                            )

In [67]:
news_train_df["tokenized_article"][0]

{'input_ids': [[0, 11535, 23697, 4420, 6, 63, 14, 86, 456, 6, 5450, 4, 63, 5, 78, 6664, 21746, 9, 5, 353, 6, 77, 13, 65, 655, 12, 2527, 12, 428, 24062, 1151, 5, 3168, 9, 2204, 2014, 6, 14784, 1054, 8, 1049, 2014, 32, 70, 14485, 15, 65, 631, 35, 1315, 4, 10, 2310, 2935, 15, 5, 1717, 4, 29, 4, 4042, 1068, 13, 10408, 16705, 2323, 5, 22893, 23, 290, 35, 541, 10, 4, 119, 4, 92, 1423, 9657, 86, 1839, 65, 9, 5, 144, 505, 40617, 15, 141, 5, 866, 24779, 148, 5, 986, 353, 4, 2113, 32, 13, 23041, 6, 151, 92, 1315, 7, 28, 1412, 6, 309, 7, 9019, 13829, 30, 38474, 1236, 6909, 340, 605, 7948, 6, 1118, 7, 30398, 6, 151, 1315, 355, 11, 10668, 428, 48540, 4, 5, 5755, 731, 16, 421, 7, 946, 5204, 23, 290, 4, 246, 4, 259, 23, 210, 13825, 1368, 1343, 6, 157, 28, 1839, 3195, 9765, 137, 8, 71, 5, 414, 20238, 5, 22893, 4, 619, 481, 7, 9832, 12, 179, 2512, 6, 1241, 5, 1450, 2810, 4, 8, 150, 47, 241, 259, 6, 596, 33976, 47, 1203, 62, 7, 1407, 201, 15, 7409, 4, 2254, 5, 311, 4, 6334, 2468, 124, 8104, 15, 5947, 94

In [69]:
news_train_df["target"][0]

{'input_ids': [0, 627, 5755, 731, 1882, 7, 290, 4, 176, 94, 353, 6, 53, 5, 866, 129, 355, 5962, 6, 151, 1315, 6, 77, 23041, 6, 151, 92, 1315, 56, 57, 6126, 6, 309, 7, 452, 18, 1315, 266, 4, 4289, 15, 5, 2204, 2014, 8812, 18, 210, 13825, 5059, 21, 14975, 35, 22, 14869, 895, 16506, 1099, 346, 72, 5, 5755, 731, 6, 959, 6, 16, 357, 340, 131, 24, 56, 57, 421, 7, 946, 5204, 23, 290, 4, 246, 4, 53, 5, 6256, 2775, 14, 5, 10645, 16, 2260, 528, 7, 55, 38187, 1253, 1311, 62, 15, 1818, 4042, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [71]:
news_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44972 entries, 0 to 44971
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   document           44972 non-null  object
 1   summary            44972 non-null  object
 2   articles           44972 non-null  object
 3   tokenized_article  44972 non-null  object
 4   target             44972 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB


### Tokenizing Validation Set

In [74]:
news_val_df["tokenized_article"] = news_val_df["articles"].apply(
    lambda x : tokenizer.batch_encode_plus(x, truncation=True, padding="max_length", max_length=512)
                            )

In [77]:
news_val_df["target"] = news_val_df["summary"].apply(
    lambda x : tokenizer(x, truncation=True, padding="max_length", max_length=128)
                            )

# Model Training - BART

In [None]:
from transformers import TFBartForConditionalGeneration, BartTokenizer

In [None]:
import textwrap

In [None]:
def text_summarizer_before_ft(inputs):
    model_name = "facebook/bart-large-cnn"
    model = TFBartForConditionalGeneration.from_pretrained(model_name)
    summary_ids = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=150,
        min_length=50,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    formatted_summary = "\n".join(textwrap.wrap(summary, width=80))
    return formatted_summary

In [None]:
text_summarizer_before_ft(news_train_df["tokenized_article"][5])

## Fine Tuning

### Getting only the tokenized data

In [80]:
tokenize_train_df = news_train_df[["tokenized_article","target"]]

In [82]:
tokenize_val_df = news_val_df[["tokenized_article","target"]]

In [84]:
def flatten_input(data):
    if isinstance(data, list) and len(data) > 0 and isinstance(data[0], list):
        return data[0]
    return data

In [86]:
tokenized_train_df = pd.DataFrame({
    'art_ip_id': tokenize_train_df['tokenized_article'].apply(lambda x: x['input_ids']),
    'art_attn_mask': tokenize_train_df['tokenized_article'].apply(lambda x: x['attention_mask']),
    'targ_ip_id': tokenize_train_df['target'].apply(lambda x: x['input_ids']),
    'targ_attn_mask': tokenize_train_df['target'].apply(lambda x: x['attention_mask'])
})

In [88]:
tokenized_val_df = pd.DataFrame({
    'art_ip_id': tokenize_val_df['tokenized_article'].apply(lambda x: x['input_ids']),
    'art_attn_mask': tokenize_val_df['tokenized_article'].apply(lambda x: x['attention_mask']),
    'targ_ip_id': tokenize_val_df['target'].apply(lambda x: x['input_ids']),
    'targ_attn_mask': tokenize_val_df['target'].apply(lambda x: x['attention_mask'])
})

In [90]:
tokenized_train_df['art_ip_id'] = tokenized_train_df['art_ip_id'].apply(flatten_input)
tokenized_train_df['art_attn_mask'] = tokenized_train_df['art_attn_mask'].apply(flatten_input)
tokenized_train_df['targ_ip_id'] = tokenized_train_df['targ_ip_id'].apply(flatten_input)
tokenized_train_df['targ_attn_mask'] = tokenized_train_df['targ_attn_mask'].apply(flatten_input)

In [92]:
tokenized_val_df['art_ip_id'] = tokenized_val_df['art_ip_id'].apply(flatten_input)
tokenized_val_df['art_attn_mask'] = tokenized_val_df['art_attn_mask'].apply(flatten_input)
tokenized_val_df['targ_ip_id'] = tokenized_val_df['targ_ip_id'].apply(flatten_input)
tokenized_val_df['targ_attn_mask'] = tokenized_val_df['targ_attn_mask'].apply(flatten_input)

In [94]:
tokenized_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44972 entries, 0 to 44971
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   art_ip_id       44972 non-null  object
 1   art_attn_mask   44972 non-null  object
 2   targ_ip_id      44972 non-null  object
 3   targ_attn_mask  44972 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


In [96]:
tokenized_val_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5622 entries, 0 to 5621
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   art_ip_id       5622 non-null   object
 1   art_attn_mask   5622 non-null   object
 2   targ_ip_id      5622 non-null   object
 3   targ_attn_mask  5622 non-null   object
dtypes: object(4)
memory usage: 175.8+ KB


In [98]:
train_sampled = tokenized_train_df.sample(n=1000,random_state=42)
val_sampled = tokenized_val_df.sample(n=1000,random_state=42)

### Converting to a format that can be accepted by Hugging Face Trainer API for fine tuning

In [101]:
from datasets import Dataset

train_hf = Dataset.from_pandas(train_sampled.drop("targ_attn_mask",axis=1))
val_hf = Dataset.from_pandas(val_sampled.drop("targ_attn_mask",axis=1))

In [103]:
train_hf = train_hf.rename_columns({
    "art_ip_id": "input_ids",
    "art_attn_mask": "attention_mask",
    "targ_ip_id": "labels"
})
val_hf = val_hf.rename_columns({
    "art_ip_id": "input_ids",
    "art_attn_mask": "attention_mask",
    "targ_ip_id": "labels"
})

In [105]:
print(train_hf.column_names)
print(val_hf.column_names)

['input_ids', 'attention_mask', 'labels', '__index_level_0__']
['input_ids', 'attention_mask', 'labels', '__index_level_0__']


In [107]:
train_hf = train_hf.remove_columns(["__index_level_0__"])
val_hf = val_hf.remove_columns(["__index_level_0__"])

### The fine tuning process

In [110]:
from transformers import BartForConditionalGeneration, AutoTokenizer

model_name = "facebook/bart-base"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [112]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model = model, padding = 'longest')

In [116]:
import torch
import gc
from transformers import Trainer, TrainingArguments

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = model.to(device)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    fp16=True,              
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,  
    remove_unused_columns=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_hf,
    eval_dataset=val_hf,
    processing_class=tokenizer,
)

for epoch in range(training_args.num_train_epochs):
    print(f"Epoch {epoch+1}/{training_args.num_train_epochs}")
    trainer.train()
    trainer.evaluate()
    torch.cuda.empty_cache()
    gc.collect()
    print(f"Cleared GPU memory after epoch {epoch+1}")

Using device: cuda
Epoch 1/2


Epoch,Training Loss,Validation Loss
1,3.1749,2.852478
2,2.7605,2.815744


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


Cleared GPU memory after epoch 1
Epoch 2/2


Epoch,Training Loss,Validation Loss
1,2.7731,2.818698
2,2.3763,2.814389


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


Cleared GPU memory after epoch 2


## Getting Summaries from the fine-tuned model and Computing Evaluation Metrics

In [118]:
bart_metrics = trainer.evaluate()

In [124]:
for i in bart_metrics.keys():
    print(i,':',bart_metrics[i])

eval_loss : 2.8143887519836426
eval_runtime : 138.6794
eval_samples_per_second : 7.211
eval_steps_per_second : 7.211
epoch : 2.0


In [57]:
from transformers import BartForConditionalGeneration, AutoTokenizer

In [59]:
model_name = "./results/checkpoint-250"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [93]:
def text_summarizer_after_ft(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=768, truncation=True)
    with torch.no_grad():
        summary_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=200,
            min_length=50,
            length_penalty=2.0,
            num_beams=4,
            early_stopping=True
        )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

In [103]:
print(news_val_df['articles'][0])

['whether a sign of a good read; or a comment on the \'pulp\' nature of some genres of fiction, the oxfam second-hand book charts have remained in the da vinci code author\'s favour for the past four years. dan brown has topped oxfam\'s \'most donated\' list again, his fourth consecutive year. having sold more than 80 million copies of the da vinci code and had all four of his novels on the new york times bestseller list in the same week, it\'s hardly surprising that brown\'s hefty tomes are being donated to charity by readers keen to make some room on their shelves. another cult crime writer responsible to heavy-weight hardbacks, stieg larsson, is oxfam\'s \'most sold\' author for the second time in a row. both the \'most donated\' and \'most sold\' lists are dominated by crime fiction, trilogies and fantasy, with jk rowling the only female author listed in either of the top fives. click here or on "view gallery" to see both charts in pictures a woman reads a copy of the newly release

In [105]:
sample_pred = text_summarizer_after_ft(news_val_df['articles'][0])

In [106]:
print(sample_pred)

a new york times bestseller has surpassed the previous record of 80 million sold. dan brown's "the lost symbol" broke its previous one-day sales record for adult fiction, reports the guardian. the book sold over one million hardcover copies across the united states, canada, and the united kingdom after it was released on tuesday, the bbc reports. brown's latest novel, the da vinci code, is set to be made into a film starring tom hanks that grossed more than 758 million, the guardian reports. "we are seeing historic, record-breaking sales across all types of our accounts


In [109]:
print(news_val_df["summary"][0])

the da vinci code has sold so many copiesthat would be at least 80 millionthat it's bound to turn up in book donation piles. but at one charity shop in the uk, it's been donated so heavily that the shop has posted a sign propped up on a tower of da vinci code copies that reads: "you could give us another da vinci code... but we would rather have your vinyl!" the manager of the oxfam shop in swansea tells the telegraph that people are laughing and taking pictures of the sizable display: "i would say that we get one copy of the book every day." he says people buy them "occasionally," but with vinyl sales up 25 in the past year, they'd rather take records. dan brown's book isn't the only one that shops like oxfam struggle to re-sell. last year, oxfam was hit with a large and steady supply of fifty shades of grey, and it similarly begged donors: "pleaseno more." but brown has a particular kind of staying power. the da vinci code was published in 2003, and within six years brown had booted 

In [188]:
test_hf = Dataset.from_pandas(news_val_df[["articles","summary"]].sample(n=100,random_state=42))

In [190]:
test_hf.column_names

['articles', 'summary', '__index_level_0__']

# Model Training - T5

In [51]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [53]:
def preprocess_function(examples):
    inputs = ["summarize: " + " ".join(doc) for doc in examples["articles"]]
    model_inputs = tokenizer(
        inputs, max_length=512, padding="max_length", truncation=True, return_tensors="pt"
    )
    labels = tokenizer(
        examples["summary"], max_length=150, padding="max_length", truncation=True, return_tensors="pt"
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [55]:
train_df_t5 = news_train_df[["articles","summary"]]
val_df_t5 = news_val_df[["articles","summary"]]

In [59]:
from datasets import Dataset
train_hf_t5 = Dataset.from_pandas(train_df_t5)
val_hf_t5 = Dataset.from_pandas(val_df_t5)

In [61]:
train_hf_t5 = train_hf_t5.map(preprocess_function, batched=True)
val_hf_t5 = val_hf_t5.map(preprocess_function, batched=True)

Map:   0%|          | 0/44972 [00:00<?, ? examples/s]

Map:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [63]:
train_hf_t5.column_names

['articles', 'summary', 'input_ids', 'attention_mask', 'labels']

In [65]:
train_hf_t5_2 = train_hf_t5.remove_columns(['articles','summary'])

In [67]:
val_hf_t5_2 = val_hf_t5.remove_columns(['articles','summary'])

In [71]:
import torch
import gc
from transformers import Trainer, TrainingArguments, TrainerCallback

class ClearMemoryCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        print(f"Clearing GPU memory after epoch {state.epoch + 1}")
        torch.cuda.empty_cache()
        gc.collect()

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

training_args = TrainingArguments(
    output_dir="./t5-finetuned-multinews",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,
    num_train_epochs=2,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=100,
    fp16=True,  
    gradient_accumulation_steps=4, 
    report_to="none",  
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_hf_t5_2.shuffle(seed=42).select(range(1000)),
    eval_dataset=val_hf_t5_2.shuffle(seed=42).select(range(1000)),
    processing_class=tokenizer,
    callbacks=[ClearMemoryCallback()],
)

trainer.train()


Using device: cuda


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,3.369,3.032915
2,3.2028,2.987684


Clearing GPU memory after epoch 2.0
Clearing GPU memory after epoch 3.0


TrainOutput(global_step=500, training_loss=3.32998193359375, metrics={'train_runtime': 700.6964, 'train_samples_per_second': 2.854, 'train_steps_per_second': 0.714, 'total_flos': 270683602944000.0, 'train_loss': 3.32998193359375, 'epoch': 2.0})

In [75]:
t5_metrics = trainer.evaluate()

In [79]:
for i in t5_metrics.keys():
    print(i,':',t5_metrics[i])

eval_loss : 2.9876840114593506
eval_runtime : 84.9128
eval_samples_per_second : 11.777
eval_steps_per_second : 11.777
epoch : 2.0


In [113]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [123]:
model_name = "./t5-finetuned-multinews/checkpoint-500"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

In [135]:
def text_summarizer_after_ft_t5(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=768, truncation=True)
    with torch.no_grad():
        summary_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=300,
            min_length=50,
            length_penalty=2.0,
            num_beams=4,
            early_stopping=True
        )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

In [137]:
sample_pred_t5 = text_summarizer_after_ft_t5(news_val_df['articles'][1])

In [138]:
print(sample_pred_t5)

, but the senate is still working to process more than 58,000 student aid claims. a senate bill that would prohibit schools from penalizing student veterans for late va assistance payments has been delayed or gone awol this fall for thousands of veterans and their families relying on the gi bill to get through school. the senate is expected to take this up soon, but the delay is inexcusable. a good start would be swift passage of a senate bill that would allow students with accounts in arrears due to the confluence of outdated systems to perform ever complex tasks. the va has lacked a permanent chief information officer since january, but the senate has yet to confirm that the senate is trying to fix the problem. a senate bill that would prohibit schools from penalizing student veterans for late va assistance payments, but the u.s. senate is expected to take this up soon, but the senate has not yet confirmed a permanent chief information officer, and the senate is a senate bill that wo

In [139]:
print(news_val_df['articles'][1])

['the deaths of three american soldiers in afghanistan this week are a tragic reminder of why its so important for the nation to keep its promises to military men and women, veterans, no matter where they served, have volunteered to put their lives on the line and some make the ultimate sacrifice. those courageous enough to go into battle should face zero delays in getting the education benefits theyve earned. unfortunately, financial aid has been late or gone awol this fall for thousands of veterans and their families relying on the gi bill to get through school. the reason: a big information-technology glitch that surfaced with what seemed like a relatively minor change in how the aid is calculated. while the u.s. department of veterans affairs va is trying to fix this, a lack of communication by the agency about the delays, coupled with inaction by the u.s. senate, has left service members-turned-students facing severe financial hardships. this week, the agency was still working to 

In [140]:
print(news_val_df['summary'][1])

a major snafu has hit benefit payments to student veterans under the gi billand congressional aides tells nbc that they have been told the veterans are never going to be paid back. the aides say they were told by the department of veterans affairs that the va will not be making retroactive payments to veterans who were underpaid for their housing allowance because it would mean reviewing around 2 millions claims, further delaying implementation of a new system, which has already been pushed back to dec. 2019. under the forever gi bill signed into law by president trump last year, students are supposed to be paid housing allowance based on where they take the most classes, not on where the school's main campus is located. tanya ang, vice president of veterans education success, tells the military times that the va's excuse of retroactive payments creating too much work isn't good enough. "that could be hundreds of dollars for some studentsper month," she says. "if this was a disability 