<h1 align="center" style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Inshorts News Headline Generation using BART</h1>

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Download and Import Libraries</h3>

In [1]:
# install necessaries libraries
# !pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q > null
!pip install -q transformers datasets evaluate rouge_score
!pip install -q --upgrade jupyterlab
!pip install -q --upgrade ipywidgets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
beatrix-jupyterlab 2023.128.151533 requires jupyterlab~=3.6.0, but you have jupyterlab 4.3.2 which is incompatible.[0m[31m
[0m

In [2]:
import os
from tqdm import tqdm

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

import torch
import torch.nn as nn

from datasets import Dataset, DatasetDict

import evaluate

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq, 
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    pipeline
)

import wandb
from kaggle_secrets import UserSecretsClient
import huggingface_hub

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


2024-12-09 00:42:38.054238: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-09 00:42:38.054354: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-09 00:42:38.190183: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Setup API Tokens</h3>

In [3]:
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("inshort-text-summariser")
secret_value_1 = user_secrets.get_secret("wandb")

In [4]:
wandb.login(key = secret_value_1)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [5]:
huggingface_hub.login(token=secret_value_0)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Load Dataset</h3>

In [6]:
df = pd.read_csv('/kaggle/input/inshorts-dataset-english/english_news_dataset.csv')

In [7]:
# Select a subset of 10,000 
df = df.sample(10_000).reset_index(drop=True)

 <h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Dataset Preview</h3>

In [8]:
df.head()

Unnamed: 0,Headline,Content,News Categories,Date
0,Markets gain for 6th straight session amid rat...,Global markets stay focused on a potential Fed...,['business'],2024-08-26
1,"2024 can be the year, you achieve your financi...",If the mass layoffs that happened last year we...,"['education', 'miscellaneous', 'business', 'na...",2024-01-01
2,Wife says husband‚Äôs affair saved their 24-year...,A woman said that the discovery of her husband...,"['hatke', 'miscellaneous']",2024-01-16
3,"Yen holds post-intervention surge, eyes Fed",The yen held its line against the dollar on Tu...,['business'],2024-04-30
4,PGIM India AMC modifies Systematic Investment ...,PGIM India Asset Management (AMC) has announce...,['business'],2024-03-28


In [9]:
# Print five random news headline and content 
for headline, content in df[['Headline', 'Content']].sample(5).values[:5]:
    print(f"News content: {content}")
    print(f"News headline: {headline}", end='\n\n')

News content: Cred founder Kunal Shah recently sparked a heated debate on X with his controversial remark about "mediocre people." On May 25, the credit card bill payment app founder posted, "Mediocre people often have a clear tell: you'll often see them hanging out with other mediocre people, probably because the A+ folks avoid them." 
News headline: Cred founder's 'mediocre people' comment divides internet

News content: Amid rising flight delays and cancellations, the only thing left to push us over the edge is the prospect of damaged baggage. Recently, Shrankhla Srivastava, a passenger on an IndiGo airline, shared a photo of her damaged luggage on social media, blaming the airline for poor handling. If you ever find yourself in this situation, here is what to do
News headline: What to do when airlines damage/lose your baggage

News content: US may ban China-based DJI's drones citing national security threat, NYT reported. Government agencies reportedly found DJI provides informatio

In [10]:
df.shape

(10000, 4)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Headline         10000 non-null  object
 1   Content          10000 non-null  object
 2   News Categories  10000 non-null  object
 3   Date             10000 non-null  object
dtypes: object(4)
memory usage: 312.6+ KB


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Split Dataset Into Training, Validation and Test Set</h3>

In [12]:
X_train, X_val = train_test_split(df, test_size=0.1, random_state=42)
X_train, X_test = train_test_split(X_train, test_size=0.1, random_state=42)

In [13]:
print(f"The shape of the training set is {X_train.shape}")
print(f"The shape of the validation set is {X_val.shape}")
print(f"The shape of the test set is {X_test.shape}")

The shape of the training set is (8100, 4)
The shape of the validation set is (1000, 4)
The shape of the test set is (900, 4)


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Convert Data from Pandas to Hugging Face Dataset</h3>

In [14]:
X_train = Dataset.from_pandas(X_train)
X_val = Dataset.from_pandas(X_val)
X_test = Dataset.from_pandas(X_test)

In [15]:
raw_datasets = DatasetDict({
    'train': X_train,
    'validation': X_val,
    'test': X_test
})

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__'],
        num_rows: 8100
    })
    validation: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__'],
        num_rows: 900
    })
})

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Setup Model Evaluation Metric: Rouge Score</h3>

We‚Äôll use these ROUGE scores to track the performance of our model, but before doing that let‚Äôs do something every good NLP practitioner should do: create a strong, yet simple baseline! A common baseline for text summarization is to simply take the first three sentences of an article, often called the lead-3 baseline. We could use full stops to track the sentence boundaries, but this will fail on acronyms like ‚ÄúU.S.‚Äù or ‚ÄúU.N.‚Äù ‚Äî so instead we‚Äôll use the nltk library, which includes a better algorithm to handle these cases.

In [16]:
rouge_score = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [17]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [18]:
# Get the first three sentences for baseline
print(three_sentence_summary(raw_datasets["train"][1]["Content"]))

In Nepal, registration of a marriage between a transgender woman and a gay man set to open new avenues.
After having been rejected by the Kathmandu district court and then by the Patan High Court in July this year, transgender woman Maya Gurung and gay man Surendra Pandey became the first such couple to get their marriage registered in Nepal.


In [19]:
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["Content"]]
    return metric.compute(predictions=summaries, references=dataset["Content"])

In [20]:
score = evaluate_baseline(raw_datasets["validation"], rouge_score)
score

{'rouge1': 0.9736445619379048,
 'rouge2': 0.973049252853839,
 'rougeL': 0.9734827068392489,
 'rougeLsum': 0.967139341476095}

In [21]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    
    return {k: round(v, 4) for k, v in result.items()}

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Initialize Tokenizer and BART Model</h3>

In [22]:
MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 10  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 32

In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "facebook/bart-base"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Setup Summarization Pipeline</h3>

In [24]:
summarizer = pipeline("summarization", model=model_ckpt)

In [25]:
def print_summary(idx):
    article = raw_datasets["test"][idx]["Content"]
    summary = raw_datasets["test"][idx]["Headline"]
    g_summary = summarizer(raw_datasets["test"][idx]["Content"], max_length=15)[0]["summary_text"]
    score = rouge_score.compute(predictions=[g_summary], references=[summary])
    scores = {k: round(v, 4) for k, v in score.items()}
    print(f"'>>> Article: {article}'")
    print(f"\n'>>> Summary: {summary}'")
    print(f"\n'>>> Generated Summary: {g_summary}'")
    print(f"\n'>>> ROUGE Score: {scores}'")

In [26]:
print_summary(5)

'>>> Article: After Maharashtra Deputy Chief Minister Devendra Fadnavis wrote to NCP rival faction chief and fellow Deputy CM Ajit Pawar, opposing Nawab Malik's induction into the ruling alliance in the state, Ajit Pawar responded saying he would present his point after Nawab Malik made his stand official.'

'>>> Summary: Ajit Pawar Plays Safe On Malik: 'Will Present My Point...'

'>>> Generated Summary: After Maharashtra Deputy Chief Minister Devendra Fadnavis wrote'

'>>> ROUGE Score: {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}'


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Prepare Dataset for Training</h3>

In [27]:
def tokenize_function(text):
    model_inputs = tokenizer(text['Content'], max_length=MAX_INPUT_LENGTH, truncation=True)
    
    # Setup tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(text['Headline'], max_length=MAX_TARGET_LENGTH, truncation=True)
    
    model_inputs['labels'] = labels['input_ids']
    
    return model_inputs

In [28]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/8100 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

In [29]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8100
    })
    validation: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['Headline', 'Content', 'News Categories', 'Date', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 900
    })
})

In [30]:
# A data collator dynamically pads the inputs and the labels
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [31]:
column_names = tokenized_datasets["train"].column_names
column_names

['Headline',
 'Content',
 'News Categories',
 'Date',
 '__index_level_0__',
 'input_ids',
 'attention_mask',
 'labels']

In [32]:
# Remove the columns with strings because the collator won‚Äôt know how to pad these elements
tokenized_datasets = tokenized_datasets.remove_columns(
    raw_datasets["train"].column_names
)

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Model Configuration</h3>

In [33]:
LEARNING_RATE = 5.6e-5
MAX_EPOCHS = 10

In [34]:
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // BATCH_SIZE

model_name = model_ckpt.split("/")[-1]
print(model_name)

bart-base


In [35]:
args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-inshort-news",
    evaluation_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=MAX_EPOCHS,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Model Building</h3>

In [36]:
# os.environ["CUDA_LAUNCH_BLOCKING"] = "0"

In [37]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [38]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33moyebamijimicheal10[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.0796,1.652206,0.4837,0.2736,0.4424,0.4428
2,1.5691,1.567577,0.5168,0.3068,0.4755,0.4759
3,1.2654,1.531981,0.5267,0.3242,0.4859,0.4851
4,1.0453,1.567422,0.5404,0.3374,0.4988,0.4987
5,0.8786,1.591885,0.5431,0.3514,0.5046,0.5049
6,0.7411,1.589573,0.5519,0.3607,0.5146,0.5139
7,0.6453,1.657856,0.5563,0.3704,0.5197,0.5198
8,0.5659,1.648749,0.5612,0.3776,0.5247,0.5245
9,0.5066,1.661315,0.5694,0.3844,0.5322,0.5311
10,0.4693,1.676505,0.5686,0.3861,0.5329,0.5322


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=2540, training_loss=0.9746753441067193, metrics={'train_runtime': 1128.2138, 'train_samples_per_second': 71.795, 'train_steps_per_second': 2.251, 'total_flos': 4967187093872640.0, 'train_loss': 0.9746753441067193, 'epoch': 10.0})

<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">Model Inference</h3>

In [39]:
hub_model_id = "xgboost-lover/bart-base-finetuned-inshort-news"
summarizer = pipeline("summarization", model=hub_model_id)

config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [40]:
def compare_summaries(input_text):
    # Load the original BART model and tokenizer
    bart_ckpt = "facebook/bart-base"
    bart_tokenizer = AutoTokenizer.from_pretrained(bart_ckpt)
    bart_model = AutoModelForSeq2SeqLM.from_pretrained(bart_ckpt).to("cuda" if torch.cuda.is_available() else "cpu")
    
    # Set up the original BART summarizer pipeline
    bart_summarizer = pipeline("summarization", model=bart_model, tokenizer=bart_tokenizer)
    
    # Load fine-tuned BART model from Hugging Face Hub
    finetuned_hub_model_id = "xgboost-lover/bart-base-finetuned-inshort-news"
    finetuned_summarizer = pipeline("summarization", model=finetuned_hub_model_id)
    
    # Generate summaries
    original_summary = bart_summarizer(input_text, max_length=50, min_length=25, do_sample=False)
    finetuned_summary = finetuned_summarizer(input_text, max_length=50, min_length=25, do_sample=False)
    
    # Print the results
    print("=== Original BART Summary ===")
    print(original_summary[0]['summary_text'])
    print("\n=== Fine-tuned BART Summary ===")
    print(finetuned_summary[0]['summary_text'])

In [41]:
input_text = """
Slim PlayStation triples sales..Sony PlayStation 2's slimmer shape has proved popular with UK gamers, with 50,000 sold in its first week on sale...
Sales have tripled since launch, outstripping Microsoft's Xbox, said market analysts Chart-Track. The numbers were also boosted by the release of the PS2-only game Grand Theft Auto: San Andreas.
The title broke the UK sales record for video games in its first weekend of release. Latest figures suggest it has sold more than 677,000 copies...
"It is obviously very, very encouraging for Sony because Microsoft briefly outsold them last week," John Houlihan, editor of Computerandvideogames.com told BBC News.
"And with Halo 2 [for Xbox] out next week, it really is a head-to-head contest between them and Xbox."..
Although Xbox sales over the last week also climbed, PS2 sales were more than double that. The figures mean Sony is reaching the seven million barrier for UK sales of the console.
Edinburgh-based developer, Rockstar, which is behind the GTA titles, has seen San Andreas pull in an estimated ¬£24m in gross revenues over the weekend.
In comparison, blockbuster films like Harry Potter and The Prisoner Of Azkaban took ¬£11.5m in its first three days at the UK box office.
The Lord of the Rings: The Return of the King took nearly ¬£10m over its opening weekend, although games titles are four to five times more expensive than cinema tickets...
Gangster-themed GTA San Andreas is the sequel to Grand Theft Auto Vice City which previously held the record for the fastest-selling video game ever.
The Xbox game Halo 2, released on 11 November in the UK, is also widely tipped to be one of the best-selling games of the year.
The original title won universal acclaim in 2001, and sold more than four million copies...
Mr Houlihan added that Sony had done well with the PS2, but it definitely helped that the release of San Andreas coincided with the slimline PS2 hitting the shelves.
The run-up to Christmas is a huge battlefield for games consoles and titles. Microsoft's Xbox had been winning the race up until last week in sales.
The sales figures also suggest that it may be a largely adult audience driving demand, since GTA San Andreas has an 18 certificate.
Sony and Microsoft have both reduced console prices recently and are preparing the way for the launches of their next generation consoles in 2005.
"Both have hit crucial price points at around ¬£100 and that really does open up new consoles to new audience, plus the release of two really important games in terms of development are also driving those sales," said Mr Houlihan.
"""

compare_summaries(input_text)

=== Original BART Summary ===
Slim PlayStation triples sales..Sony PlayStation 2's slimmer shape has proved popular with UK gamers, with 50,000 sold in its first week on sale... (Image: Sony)Sales have tripled since launch, outstripping

=== Fine-tuned BART Summary ===
Slim PlayStation triples sales in UK, Microsoft's 'Xbox One' rival hits 7 million sales mark: Report


<h3 style="background-color:#2fbbab;color:white;border-radius: 8px; padding:15px">References</h3>

- Text-Summarizer-BART-ROUGE-PyTorch: https://www.kaggle.com/code/mohamedmagdy191/text-summarizer-bart-rouge-pytorch

- Large Language Model's Architectural Diagrams: https://www.kaggle.com/datasets/suraj520/notebook-images

- HuggingFace NLP course: https://huggingface.co/learn/nlp-course/chapter7/5

- Rouge metric: https://huggingface.co/spaces/evaluate-metric/rouge

I'll just leave the test set as it is since I didn't do any hyperparameter tuning

See you in the next one üòâüò¥üò¥