# Create Titles for News Articles

In this jupyter notebook we apply a pre-trained Large Language Model (Flan-T5: Flan Text-To-Text Transfer Transformer) to learn to generate titles for news articles. We apply Flan-T5 (instead of e.g. BERT) since Flan-T5 is more applicable for generating text compared to e.g. BERT. And Flan-T5 is an improvement of T5.

As input we use the Dutch-news-articles data set from Kaggle that contains 255.524 news articles, with title.

## Initialisation

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict, load_from_disk

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## Read the Data
The data set is the Dutch-news-articles dataset that can be found on Kaggle (https://www.kaggle.com/datasets/maxscheijen/dutch-news-articles).

In [3]:
# df_raw = pd.read_csv('/data/bb8-storage/LLM-project/datasets/dutch-news-articles.csv')
# df_raw['datetime'] = pd.to_datetime(df_raw['datetime'])
# df_raw.to_pickle('/data/bb8-storage/LLM-project/datasets/dutch-news-articles.pkl')
df_raw = pd.read_pickle('/data/bb8-storage/LLM-project/datasets/dutch-news-articles.pkl')

Now split the data in 75% train cases and 25% test cases. We also convert the data from a pandas dataframe to a dataset-object as required by the LLM.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df_raw['content'], df_raw['title'], test_size=0.25, random_state=1234)
train_dataset = Dataset.from_pandas(pd.DataFrame({'article': X_train, 'title': y_train}))
test_dataset = Dataset.from_pandas(pd.DataFrame({'article': X_test, 'title': y_test}))

datasets = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

Now we tokenize the data.

In [5]:
# load the tokenizer of T5:
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-small')

# tokenizer-function:
def tokenize_function_t5(examples):
    input_encodings = tokenizer(examples['article'], padding='max_length', truncation=True, max_length=512)
    target_encodings = tokenizer(examples['title'], padding='max_length', truncation=True, max_length=32)
    labels = target_encodings['input_ids']
    input_encodings['labels'] = labels
    return input_encodings

# apply the tokenizer-function. Since this costs some resources, we only do this once and save the data.
# Next time, we just load the tokenized data from disk.
# tokenized_datasets = datasets.map(tokenize_function_t5, batched=True)
# tokenized_datasets.save_to_disk("/data/bb8-storage/LLM-project/datasets/tokenized_datasets-flanT5")
tokenized_datasets = load_from_disk("/data/bb8-storage/LLM-project/datasets/tokenized_datasets-flanT5")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Set up the LLM

Below we load the Flan-T5-small model and set the training configuration

In [6]:
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-small')
model = model.to(device)

training_args = Seq2SeqTrainingArguments(
    output_dir='/data/bb8-storage/LLM-project/datasets/results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16, # looks good for GPU: RTX 2080
    per_device_eval_batch_size=16, # looks good for GPU: RTX 2080
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16 = True # added since we are using the RTX 2080
)





Now we prepare to collate the data and we prepare the trainer

In [7]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator
)


Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## Train the LLM
Now the training starts!

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0,
2,0.0,
3,0.0,


TrainOutput(global_step=35934, training_loss=0.0, metrics={'train_runtime': 7201.6076, 'train_samples_per_second': 79.833, 'train_steps_per_second': 4.99, 'total_flos': 1.0687384197896602e+17, 'train_loss': 0.0, 'epoch': 3.0})

The training consumes quite some resources. For this reason we save the model in the now coming code:

In [9]:
# trainer.save_model("/data/bb8-storage/LLM-project/datasets/model-flan")
# tokenizer.save_pretrained("/data/bb8-storage/LLM-project/datasets/tokenizer-flan")
model = T5ForConditionalGeneration.from_pretrained("/data/bb8-storage/LLM-project/datasets/model-flan")
tokenizer = T5Tokenizer.from_pretrained("/data/bb8-storage/LLM-project/datasets/tokenizer-flan")
model = model.to(device)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Evaluation

In [10]:
for index in range(3):
    artikel = datasets['test'][index]['article']
    titel = datasets['test'][index]['title']

    inputs = tokenizer(artikel, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    outputs = model.generate(**inputs, max_length=32, num_beams=5, early_stopping=True)
    generated_title = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(index, artikel)
    print(index, titel)
    print(index, generated_title)

0 Hij had er niks mee te maken. Dat was de belangrijkste boodschap van Willem Holleeder tijdens de eerste fase van zijn proces. Vijf zittingsdagen lang was hij aan het woord. Vanaf volgende week maandag staan de verhoren van zijn zussen gepland. Holleeder zegt dat hem allerlei zaken in de schoenen worden geschoven. En daarbij is zijn zus Astrid - die volgens hem uit is op geld - de kwade genius. Volgens hem heeft zij de boel aan elkaar gelogen en de andere vrouwen tegen hem opgezet, met het doel om hem achter de tralies te krijgen. Ze vertelde "vieze, smerige leugens", zegt Holleeder. In dit eerste deel van het proces werd Holleeder ondervraagd aan de hand van zijn eigen handgeschreven verklaring van 127 bladzijden. Hij blijkt het dossier heel goed te kennen. Hij weet precies wat er staat, vult dat aan en gebruikt het om zijn punt te maken. Af en toe zijn er kritische vragen van de officieren van justitie, die proberen hem op tegenstrijdigheden te betrappen. Holleeder reageert dan stee