## Machine Learning for Topic News Title Summarization 

- In the era of digital information, the volume of news content available to readers has grown exponentially, making it increasingly challenging for individuals to stay informed without becoming overwhelmed. The project's primary goal is to leverage machine learning (ML) techniques for the effective summarization of news articles, aiming to improve the efficiency, accuracy, and readability of these summaries, allowing readers to grasp the essence of news stories without dedicating extensive time to reading full articles.

- The stakeholders of this project can be individual readers, news organizations, educational sectors, and potentially government bodies reliant on swift and accurate information dissemination. Improved news summarization models can transform media consumption by providing accessible, succinct summaries of complex news stories, thereby enhancing public knowledge and engagement. Additionally, in broader vew, enhanced news summarization techniques could pave the way for similar advancements in summarizing other forms of text, such as academic literature, legal documents, and social media feeds.

- Potential Model We will Explore:
    - Bert summarization
    - Fint tune T5-small
    - GPT 3.5 Turbo

We will fine-tune the T5-small model. At the Overview page of the Hugging Face T5 model, it provides the following tips:

- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
- T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….

In [17]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoTokenizer, EarlyStoppingCallback
from transformers import EvalPrediction
from datasets import load_metric, load_dataset
from bert_score import score
from datasets import Dataset, DatasetDict
import pandas as pd
import evaluate
import numpy as np
from tqdm import tqdm
import torch
import re
import os

rouge = evaluate.load('rouge')

torch.manual_seed(12345)
np.random.seed(12345)

In [8]:
# check if gpu is available
device = 'cpu' 
if torch.backends.mps.is_available():
    device = 'mps'
if torch.cuda.is_available():
    device = 'cuda'
print(f"Using '{device}' device")

Using 'mps' device


#### 1. Load the Tokenizer and Pre-trained Model

In [9]:
model_name = 't5-small'

# TODO: Load the tokenizer using AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, max_length=1024)

# TODO: Load Pre-trained model from HuggingFace Model Hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

In [32]:
## Let's see how many parameters we are going to be changing
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 60506624 || all params: 60506624 || trainable%: 100.0


### 2. Load the data into datasets

In [4]:
train_df = pd.read_csv('data/train_set.csv', sep=',')
test_df = pd.read_csv('data/test_set.csv', sep=',')
dev_df = pd.read_csv('data/dev_set.csv', sep=',')

train_df['content'] = train_df['content'].apply(lambda x: re.sub(r'[\r\n]+', ' ', x))
test_df['content'] = test_df['content'].apply(lambda x: re.sub(r'[\r\n]+', ' ', x))
dev_df['content'] = dev_df['content'].apply(lambda x: re.sub(r'[\r\n]+', ' ', x))

labels = ['title']
target_col = ['data_id', 'content', 'title']    

train_ds = Dataset.from_pandas(train_df[target_col])    
dev_ds = Dataset.from_pandas(dev_df[target_col])
test_ds = Dataset.from_pandas(test_df[target_col].iloc[:1200])  

dataset_dict = DatasetDict({    
    'train': train_ds,
    'dev': dev_ds,
    'test': test_ds
})

In [5]:
def preprocess_function(examples):
    # Prepends the string "summarize: " to each document in the 'text' field of the input examples.
    # This is done to instruct the T5 model on the task it needs to perform, which in this case is summarization.
    inputs = ["Generate title: " + doc for doc in examples["content"]]

    # Tokenizes the prepended input texts to convert them into a format that can be fed into the T5 model.
    # Sets a maximum token length of 1024, and truncates any text longer than this limit.
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, return_tensors="pt", padding='longest')

    # Tokenizes the 'summary' field of the input examples to prepare the target labels for the summarization task.
    # Sets a maximum token length of 128, and truncates any text longer than this limit.
    labels = tokenizer(text_target=examples["title"], max_length=32, truncation=True, return_tensors="pt", padding='longest')

    # Assigns the tokenized labels to the 'labels' field of model_inputs.
    # The 'labels' field is used during training to calculate the loss and guide model learning.
    model_inputs["labels"] = labels["input_ids"]

    # Returns the prepared inputs and labels as a single dictionary, ready for training.
    return model_inputs

In [10]:
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/8400 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [11]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['data_id', 'content', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8400
    })
    dev: Dataset({
        features: ['data_id', 'content', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 600
    })
    test: Dataset({
        features: ['data_id', 'content', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1200
    })
})

In [37]:
example = tokenized_datasets['train'][0]

# We can use the tokenizer to reverse the ID mapping to see what the text was
print(tokenizer.decode(example['input_ids']))
print(tokenizer.decode(example['labels']))

Generate title: (Natural News) Pancreatic cancer is one of the most daunting varieties of the disease, with very few people surviving beyond five years after their diagnosis. With few symptoms, most patients aren’t diagnosed until they’ve already reached the metastatic stage. It’s the fourth leading cause of cancer deaths and the 12th most common type of cancer worldwide, and an effective treatment is desperately needed. Now, a new discovery could bring hope to future patients. Scientists at Tel Aviv University have developed a new treatment that could destroy pancreatic cancer cells. Their treatment involves a small molecule known as PJ34, which was originally developed to help stroke victims. They discovered that when it is injected, it causes human cancer cells to destroy themselves during cell division, or mitosis. They conducted their study using transplanted human pancreatic cancer in immunocompromised mice. They found that after two weeks of daily injection with the molecule, th

#### 3. Set up the model and training steps

In [38]:
training_args = Seq2SeqTrainingArguments(
    seed = 12345, 
    do_eval=True,
    output_dir="my_fine_tuned_t5_model",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    # weight_decay=0.01,
    num_train_epochs=1,
    eval_steps=210,
    save_steps=420,
    evaluation_strategy="steps",       
    save_strategy="steps",  
    load_best_model_at_end=True, 
    predict_with_generate=True,
    metric_for_best_model="rouge2",
    greater_is_better=True,
    # fp16=True,
    report_to='wandb',
    logging_dir='./t5_logs',
    run_name="t5-base-title-summarizer",  
    fp16=True
)

In [39]:
def compute_metrics(eval_pred: EvalPrediction):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}


In [40]:

t5_trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["dev"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)]
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [4]:
t5_trainer.train()

In [None]:
t5_trainer.evaluate()

#### 4. Generate the Predicted Title

In [None]:
test_title = t5_trainer.predict(tokenized_datasets['test']) 
print(test_title)
result = []
for title_ids in test_title.predictions:
    result.append(tokenizer.decode(title_ids, skip_special_tokens=True))
print(result[0])
test_set = test_df[['data_id', 'title']].iloc[:1200].copy()
test_set['predicted_title'] = result
test_set.to_csv('data/test_set_summary_t5_large.csv', index=False) 

#### 5. Evaluate the result

In [14]:
if os.path.exists('data/test_set_summary_t5_large.csv'):
    print('succcessfully load the test_set_summary_t5_large.csv')
    test_set_t5 = pd.read_csv('data/test_set_summary_t5_large.csv')
else:
    print('failed to load the test_set_summary_t5_large.csv')

succcessfully load the test_set_summary_t5_large.csv


In [16]:
rouge.compute(predictions=test_set_t5['predicted_title'], references=test_set_t5['title'], use_stemmer=True)

{'rouge1': 0.3812477152513559,
 'rouge2': 0.1745242266745301,
 'rougeL': 0.33269447741054975,
 'rougeLsum': 0.3324886814077494}

In [18]:
P, R, F1 = score(test_set_t5['predicted_title'].to_list(), test_set_t5['title'].to_list(), lang='en', verbose=True)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/38 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/19 [00:00<?, ?it/s]

done in 56.55 seconds, 21.22 sentences/sec


In [19]:
print(f'Precision: {P.mean():.4f}')
print(f'Recall: {R.mean():.4f}')
print(f'F1: {F1.mean():.4f}')

Precision: 0.8933
Recall: 0.8835
F1: 0.8882
