In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Muthu Palaniappan M - 21011101079 - NLP Model Lab

## Installing Packages

In [2]:
!pip install -U transformers
!pip install -U datasets
!pip install tensorboard
!pip install sentencepiece
!pip install accelerate
!pip install evaluate
!pip install rouge_score

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=91364340d93014222420d64ebd14e7340cd5eb53322d66f6c85a94ae365a17e9
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599

## Importing Pacakges

- evaluate: The evaluate libraries helps us quickly evaluate transformer models from the Hugging Face library for different tasks. It can be text classification, question answering, and even text summarization.

- rouge_score: Text summarization is primarily evaluated through Rouge score. To load the Rouge score metric code using the evaluate library, we need to install it although there isn’t any need to import it separately. We will get into the details of the Rouge score later in the article.


In [3]:
import torch
import pprint
import evaluate
import numpy as np
 
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

2024-04-10 04:13:32.885680: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-10 04:13:32.885783: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-10 04:13:33.008835: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Data Preparation

- There are 1779 samples in the training set and 445 samples in the validation set.
- Using 80% of the samples for training and the rest for validation. The final training and validation splits are stored as dictionaries in dataset_train and dataset_valid

In [4]:
dataset = load_dataset('gopalkalpande/bbc-news-summary', split='train')
full_dataset = dataset.train_test_split(test_size=0.2, shuffle=True)
dataset_train = full_dataset['train']
dataset_valid = full_dataset['test']

Downloading readme:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 7.32M/7.32M [00:00<00:00, 12.9MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(dataset_train)
print(dataset_valid)

Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 1779
})
Dataset({
    features: ['File_path', 'Articles', 'Summaries'],
    num_rows: 445
})


## Dataset Analysis

In [6]:
def find_avg_sentence_length(dataset):
    """
    Find the average sentence in the entire training set.
    """
    sentence_lengths = []
    for text in dataset:
        corpus = [
            word for word in text.split()
        ]
        sentence_lengths.append(len(corpus))
    return sum(sentence_lengths)/len(sentence_lengths)

In [7]:
avg_article_length = find_avg_sentence_length(dataset_train['Articles'])
print(f"Average article length: {avg_article_length} words")
avg_summary_length = find_avg_sentence_length(dataset_train['Summaries'])
print(f"Averrage summary length: {avg_summary_length} words")

Average article length: 384.8555368184373 words
Averrage summary length: 167.48341765036537 words


- Nearly all summaries are below 200 words.
- The average length of the articles is around 384 words.

In [8]:
def find_longest_length(dataset):
    """
    Find the longest article and summary in the entire training set.
    """
    max_length = 0
    counter_4k = 0
    counter_2k = 0
    counter_1k = 0
    counter_500 = 0
    for text in dataset:
        corpus = [
            word for word in text.split()
        ]
        if len(corpus) > 4000:
            counter_4k += 1
        if len(corpus) > 2000:
            counter_2k += 1
        if len(corpus) > 1000:
            counter_1k += 1
        if len(corpus) > 500:
            counter_500 += 1
        if len(corpus) > max_length:
            max_length = len(corpus)
    return max_length, counter_4k, counter_2k, counter_1k, counter_500

In [10]:
longest_article_length, counter_4k, counter_2k, counter_1k, counter_500 = find_longest_length(dataset_train['Articles'])
print(f"Longest article length: {longest_article_length} words")
print(f"Artciles larger than 4000 words: {counter_4k}")
print(f"Artciles larger than 2000 words: {counter_2k}")
print(f"Artciles larger than 1000 words: {counter_1k}")
print(f"Artciles larger than 500 words: {counter_500}")
print("------------------------------")
longest_summary_length, counter_4k, counter_2k, counter_1k, counter_500 = find_longest_length(dataset_train['Summaries'])
print(f"Longest summary length: {longest_summary_length} words")
print(f"Summaries larger than 4000 words: {counter_4k}")
print(f"Summaries larger than 2000 words: {counter_2k}")
print(f"Summaries larger than 1000 words: {counter_1k}")
print(f"Summaries larger than 500 words: {counter_500}")

Longest article length: 4377 words
Artciles larger than 4000 words: 1
Artciles larger than 2000 words: 7
Artciles larger than 1000 words: 18
Artciles larger than 500 words: 369
------------------------------
Longest summary length: 2073 words
Summaries larger than 4000 words: 0
Summaries larger than 2000 words: 1
Summaries larger than 1000 words: 7
Summaries larger than 500 words: 14


- There is just one article above 4000 words and 356 articles above 500 words.

## Configurations

In [22]:
MODEL = 't5-base'
BATCH_SIZE = 4
NUM_PROCS = 4
EPOCHS = 2
OUT_DIR = 'results_t5base'
MAX_LENGTH = 512 # Maximum context length to consider while preparing dataset.

- I choose to fine-tune the t5-base model. 
- The batch size is 4 and the number of processes used for parallel processing is 4 as well. 
- I will train for 2 epochs, and the maximum context length of the articles will be 512. 
- The average length of the articles is 384 words. Hence, any articles below 512 tokens will be padded, and any above 512 tokens will be truncated. 
- I feel this is the right size for this dataset.

## Tokenization
Tokenizing means converting a word into a numerical value. Sometimes a single word may be broken down into multiple ones.

In [12]:
tokenizer = T5Tokenizer.from_pretrained(MODEL)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [13]:
def preprocess_function(examples):
    inputs = [f"summarize: {article}" for article in examples['Articles']]
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        truncation=True,
        padding='max_length'
    )

    # Set up the tokenizer for targets
    targets = [summary for summary in examples['Summaries']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=MAX_LENGTH,
            truncation=True,
            padding='max_length'
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
tokenized_train = dataset_train.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)
tokenized_valid = dataset_valid.map(
    preprocess_function,
    batched=True,
    num_proc=NUM_PROCS
)

Map (num_proc=4):   0%|          | 0/1779 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/445 [00:00<?, ? examples/s]



## Model
- I load the T5 Base model and move it to the computation device. 
- The T5 Base model contains around 223 million parameters. 

In [15]:
model = T5ForConditionalGeneration.from_pretrained(MODEL)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

222,903,552 total parameters.
222,903,552 training parameters.


## Score Metric: Rouge
ROUGE score is one of the most common metrics for evaluating deep learning based text summarization models.


    ROUGE1: It is the ratio of the number of words that match the predictions and ground truth to the number of words in the predictions.
    ROUGE2: It is the ratio of the number bi-grams that match in the predictions and the ground truth to the number of bi-grams in the predictions.
    ROUGEL: It is a score defined by the longest matching sequence between the prediction and the ground truth.


In [16]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [17]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred.predictions[0], eval_pred.label_ids

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(
        predictions=decoded_preds, 
        references=decoded_labels, 
        use_stemmer=True, 
        rouge_types=[
            'rouge1', 
            'rouge2', 
            'rougeL'
        ]
    )

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [18]:
def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak. 
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

## Training the Model

In [23]:
###Training Arguments
training_args = TrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=OUT_DIR,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    save_total_limit=2,
    report_to='tensorboard',
    learning_rate=0.0001,
    dataloader_num_workers=4
)

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [25]:
history = trainer.train()

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Gen Len
200,0.4042,0.278079,0.9188,0.8569,0.9043,222.6202
400,0.4645,0.278967,0.9198,0.8579,0.905,222.6225
600,0.3315,0.277035,0.9198,0.858,0.9054,222.6202
800,0.2614,0.267132,0.922,0.8615,0.9078,222.6202


## Saving Config

In [26]:
tokenizer.save_pretrained(OUT_DIR)

('results_t5base/tokenizer_config.json',
 'results_t5base/special_tokens_map.json',
 'results_t5base/spiece.model',
 'results_t5base/added_tokens.json')

## Post Processing

In [27]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

In [28]:
model_path = f"{OUT_DIR}/checkpoint-890"  # the path where I saved my model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(OUT_DIR)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [30]:
def summarize_text(text, model, tokenizer, max_length=512, num_beams=5):
    ##Preprocess the text
    inputs = tokenizer.encode(
        "summarize: " + text,
        return_tensors='pt',
        max_length=max_length,
        truncation=True
    )

    ##Generate the summary
    summary_ids = model.generate(
        inputs,
        min_length=100,
        max_length=101,
        num_beams=num_beams,
        # early_stopping=True,
    )

    ##Decode and return the summary
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

## Inference

In [46]:
dataset_train['Articles'][200]

"Smith loses US box office crown..New comedy Diary of a Mad Black Woman has ended Will Smith's reign at the top of the North American box office...Based on a play by Tyler Perry, who also stars as a gun-toting grandmother, the film took $22.7m (£11.8m) in its first three days of release. After topping the chart for two consecutive weeks, Smith's romantic comedy Hitch dropped to second place with takings of $21m (£10.9m). Keanu Reeves' supernatural thriller Constantine dropped a place to three. Based on the Hellblazer comics, the film took $11.8m (£6.1m) on its second week of release. Two new entries came next in the chart, with Wes Craven's horror movie Cursed, about a werewolf loose in Los Angeles, in fourth position with $9.5m (£4.9m)...Action comedy Man of the House, starring Tommy Lee Jones as a Texas ranger assigned to protect a cheerleader squad, came in at fifth with $9m (£4.6m). Clint Eastwood's boxing drama Million Dollar Baby - recipient of four Academy Awards, including best

In [54]:
text = dataset_train['Articles'][200]

In [55]:
summary = summarize_text(text, model, tokenizer)

## Downloading the weights

In [50]:
!zip -r {OUT_DIR} {OUT_DIR}

  adding: results_t5base/ (stored 0%)
  adding: results_t5base/spiece.model (deflated 48%)
  adding: results_t5base/checkpoint-445/ (stored 0%)
  adding: results_t5base/checkpoint-445/trainer_state.json (deflated 80%)
  adding: results_t5base/checkpoint-445/generation_config.json (deflated 29%)
  adding: results_t5base/checkpoint-445/config.json (deflated 62%)
  adding: results_t5base/checkpoint-445/training_args.bin (deflated 51%)
  adding: results_t5base/checkpoint-445/model.safetensors (deflated 8%)
  adding: results_t5base/checkpoint-445/optimizer.pt (deflated 8%)
  adding: results_t5base/checkpoint-445/scheduler.pt (deflated 55%)
  adding: results_t5base/checkpoint-445/rng_state.pth (deflated 25%)
  adding: results_t5base/events.out.tfevents.1712724251.1d9c84c08798.34.1 (deflated 67%)
  adding: results_t5base/events.out.tfevents.1712723658.1d9c84c08798.34.0 (deflated 66%)
  adding: results_t5base/special_tokens_map.json (deflated 85%)
  adding: results_t5base/added_tokens.json (de