<a href="https://colab.research.google.com/github/Astha32/News-Headline-Generation/blob/main/fine_tuned_T5_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sentencepiece
!pip install transformers
!pip install rich

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 21.9 MB/s eta 0:00:01[K     |▌                               | 20 kB 30.0 MB/s eta 0:00:01[K     |▉                               | 30 kB 34.2 MB/s eta 0:00:01[K     |█                               | 40 kB 22.3 MB/s eta 0:00:01[K     |█▍                              | 51 kB 11.8 MB/s eta 0:00:01[K     |█▋                              | 61 kB 9.4 MB/s eta 0:00:01[K     |██                              | 71 kB 9.1 MB/s eta 0:00:01[K     |██▏                             | 81 kB 10.1 MB/s eta 0:00:01[K     |██▍                             | 92 kB 10.5 MB/s eta 0:00:01[K     |██▊                             | 102 kB 9.7 MB/s eta 0:00:01[K     |███                             | 112 kB 9.7 MB/s eta 0:00:01[K     |███▎                            | 122 kB 9.7 MB/s eta 0:00:01[K     |███▌   

In [2]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import os

from rich.table import Column, Table
from rich import box
from rich.console import Console

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

Data: 
*  The dataset used is the BBC news dataset available at: http://mlg.ucd.ie/datasets/bbc.html .
*   It consists of 2225 documents from the BBC news website corresponding to stories in five topical areas(business, entertainment, politics, sport, tech) from 2004-2005.
*   The files are processed to extract headlines, news articles pairs and stored in .csv format.






In [3]:

df = pd.DataFrame(columns=['Article', 'Headline', 'Cleaned_Article', 'Cleaned_Headline'])
df=pd.read_csv("/content/BBCNewsDataComplete.csv")

df.columns=['Article', 'Headline', 'Cleaned_Article', 'Cleaned_Headline']

In [4]:
df.describe

<bound method NDFrame.describe of                                                 Article  ...                   Cleaned_Headline
0     Nike has reported its best second-quarter earn...  ...  strong quarterly growth for nike 
1     Barcelona's pursuit of the Spanish title took ...  ...    barcelona title hopes hit loss 
2     Celtic's Neil Lennon admits Rangers could be c...  ...  lennon brands rangers favourites 
3     The first convictions for piracy over peer-to-...  ...       peer peer pirates convicted 
4     The release of a film about the Mumbai (Bombay...  ...      mumbai bombs movie postponed 
...                                                 ...  ...                                ...
1428  Oil prices carried on rising on Wednesday afte...  ...     winter freeze keeps oil above 
1429  Leicester withstood a stunning Wasps comeback ...  ...                   wasps leicester 
1430  The majority of young people are interested in...  ...         youth interested politics 
1431  

In [5]:
df["Article"] = "summarize: "+df["Article"]

In [6]:
df=df.drop(columns=['Cleaned_Article', 'Cleaned_Headline'], axis=1)

In [7]:
df.head()

Unnamed: 0,Article,Headline
0,summarize: Nike has reported its best second-q...,Strong quarterly growth for Nike\n
1,summarize: Barcelona's pursuit of the Spanish ...,Barcelona title hopes hit by loss\n
2,summarize: Celtic's Neil Lennon admits Rangers...,Lennon brands Rangers favourites\n
3,summarize: The first convictions for piracy ov...,US peer-to-peer pirates convicted\n
4,summarize: The release of a film about the Mum...,Mumbai bombs movie postponed\n


In [8]:
# defining a rich console logger
console=Console(record=True)

def display_df(df):

  console=Console()
  table = Table(Column("source_text", justify="center" ), Column("target_text", justify="center"), title="Sample Data",pad_edge=False, box=box.ASCII)

  for i, row in enumerate(df.values.tolist()):
    table.add_row(row[0], row[1])

  console.print(table)

training_logger = Table(Column("Epoch", justify="center" ), 
                        Column("Steps", justify="center"),
                        Column("Loss", justify="center"), 
                        title="Training Status",pad_edge=False, box=box.ASCII)


In [9]:
# Setting up the device for GPU 
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

### Prepare the data:

Preprocessing the dataset to prepare it for being fed to the neural network.

The input dataframe is tokenized using T5 tokenizer. It is split into:


*   Training Dataset(80% of originial dataset): used for fine tuning the model.
*    Validation Dataset(20%): helps in evaluating the performance of the model.

Dataloader is used for creating training and validation dataloaders to load a controlled amount of data in the memory and then pass it to the neural network.







In [10]:
class PrepareData(Dataset):

  def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text):
    self.tokenizer = tokenizer
    self.data = dataframe
    self.source_len = source_len
    self.summ_len = target_len
    self.target_text = self.data[target_text]
    self.source_text = self.data[source_text]

  def __len__(self):
    return len(self.target_text)

  def __getitem__(self, index):
    source_text = str(self.source_text[index])
    target_text = str(self.target_text[index])

    source_text = ' '.join(source_text.split())
    target_text = ' '.join(target_text.split())

    #tokenize and prepare a list of sequences for the model
    source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
    target = self.tokenizer.batch_encode_plus([target_text], max_length= self.summ_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')

    source_ids = source['input_ids'].squeeze()
    source_mask = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_mask = target['attention_mask'].squeeze()

    return {
        'source_ids': source_ids.to(dtype=torch.long), 
        'source_mask': source_mask.to(dtype=torch.long), 
        'target_ids': target_ids.to(dtype=torch.long),
        'target_ids_y': target_ids.to(dtype=torch.long)
    }

In [11]:
def createDataLoaders(dataframe, source_text, target_text, tokenizer, model_params):

    dataframe = dataframe[[source_text,target_text]]
    display_df(dataframe.head(2))

    # Data split into training and validation datasets
    train_size = 0.8
    train_dataset=dataframe.sample(frac=train_size,random_state = model_params["SEED"])
    val_dataset=dataframe.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)
    val_dataset = val_dataset.reset_index(drop=True)

    console.print(f"FULL Dataset: {dataframe.shape}")
    console.print(f"TRAIN Dataset: {train_dataset.shape}")
    console.print(f"TEST Dataset: {val_dataset.shape}\n")


    # Preparing the data for dataloaders using PrepareData() function
    training_set = PrepareData(train_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)
    val_set = PrepareData(val_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)


    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': model_params["TRAIN_BATCH_SIZE"],
        'shuffle': True,
        'num_workers': 0
        }


    val_params = {
        'batch_size': model_params["VALID_BATCH_SIZE"],
        'shuffle': False,
        'num_workers': 0
        }


    # Creation of Dataloaders for testing and validation
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)

    return training_loader, val_loader


### Fine Tuning the T5 Model

T5 is known to give state of the art results in many NLP tasks.

In the Text-2-Text Approach, all the NLP tasks such as translation, classification, summarization and question answering are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements.

The news article are prefixed with "summarise:" to inform that the model is to be used for text summarisation.

The T5ForConditionalGeneration.from_pretrained("t5-base") command is used to define the model. The T5ForConditionalGeneration adds a Language Model head to our T5 model which allows us to generate text based on the training of T5 model.


In [12]:
def train(epoch, tokenizer, model, device, loader, optimizer):

  model.train()

  for _,data in enumerate(loader, 0):
    y = data['target_ids'].to(device, dtype = torch.long)     
    y_ids = y[:, :-1].contiguous()
    lm_labels = y[:, 1:].clone().detach()           # language model labels
    lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
    ids = data['source_ids'].to(device, dtype = torch.long)
    mask = data['source_mask'].to(device, dtype = torch.long)

    outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
    loss = outputs[0]

    if _%10==0:
      training_logger.add_row(str(epoch), str(_), str(loss))
      console.print(training_logger)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [13]:
def validate(epoch, tokenizer, model, device, loader):

  model.eval()
  predictions = []
  actuals = []
  articles = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          article = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in ids]
          
          if _%10==0:
              console.print(f'Completed {_}')

          predictions.extend(preds)
          actuals.extend(target)
          articles.extend(article)
  return predictions, actuals, articles

In [14]:
def T5Trainer(dataframe, source_text, target_text, model_params, output_dir="./outputs/" ):

  torch.manual_seed(model_params["SEED"]) 
  np.random.seed(model_params["SEED"]) 
  torch.backends.cudnn.deterministic = True


  # tokenzier for encoding the text
  tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

  # Defining the model. t5-base model used with a Language model layer on top for generation of Summary. 
  model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
  model = model.to(device)


  training_loader, val_loader = createDataLoaders(dataframe, source_text, target_text, tokenizer, model_params)

 

  # defining the Adam optimizer
  optimizer = torch.optim.Adam(params =  model.parameters(), lr=model_params["LEARNING_RATE"])


  # Training loop

  for epoch in range(model_params["TRAIN_EPOCHS"]):
      train(epoch, tokenizer, model, device, training_loader, optimizer)
      

  #Saving the model after training
  path = os.path.join(output_dir, "Finetuned_T5_model")
  model.save_pretrained(path)
  tokenizer.save_pretrained(path)
  console.print(f"""Model saved @ {os.path.join(output_dir, "Finetuned_T5_model")}\n""")

  # evaluating test dataset

  for epoch in range(model_params["VAL_EPOCHS"]):
    predictions, actuals, articles = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Headline':predictions,'Actual Headline':actuals, 'News Article':articles})
    final_df.to_csv(os.path.join(output_dir,'predictions_t5.csv'))
    final_df.head()
  console.save_text(os.path.join(output_dir,'logs.txt'))

  
  console.print(f"""Predicted summaries on Validation data saved @ {os.path.join(output_dir,'predictions_t5.csv')}\n""")
  console.print(f"""Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")
  return final_df

In [15]:
model_params={
    "MODEL":"t5-base",             # model_type: t5-base
    "TRAIN_BATCH_SIZE":1,          # training batch size
    "VALID_BATCH_SIZE":1,          # validation batch size
    "TRAIN_EPOCHS":3,              # number of training epochs
    "VAL_EPOCHS":1,                # number of validation epochs
    "LEARNING_RATE":1e-4,          # learning rate
    "MAX_SOURCE_TEXT_LENGTH":512,  # max length of source text
    "MAX_TARGET_TEXT_LENGTH":50,   # max length of target text
    "SEED": 42                     # set seed for reproducibility 

}

In [16]:
final_df = T5Trainer(dataframe=df[:400], source_text="Article", target_text="Headline", model_params=model_params, output_dir="outputs")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


In [17]:
#predicted summaries
final_df.head()

Unnamed: 0,Generated Headline,Actual Headline,News Article
0,Barca lose 2-0 at home to Atletico Madrid in s...,Barcelona title hopes hit by loss,summarize: Barcelona's pursuit of the Spanish ...
1,Raskin dies at Macintosh. head of Macintosh de...,Creator of first Apple Mac dies,"summarize: Jef Raskin, head of the team behind..."
2,Safin battles back to win Australian Open fina...,Safin relieved at Aussie recovery,summarize: Marat Safin admitted he thought he ...
3,worldCom directors pay $54m to settle class ac...,WorldCom bosses' $54m payout,summarize: Ten former directors at WorldCom ha...
4,Chile's copper industry earns record earnings ...,Record year for Chilean copper,summarize: Chile's copper industry has registe...


### Evaluating the performance:

The performance of the model is evaluated by calculating the BLEU and ROUGE scores.



*   BLEU (Bilingual Evaluation Understudy) score, which indicates how similar the candidate text is to the reference texts, with values closer to one representing more similar texts. It measures precision-how much the words in the machine generated summaries appeared in the human reference summaries.
*   ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation. It measures recall- how much the words in the human reference summaries appeared in the machine generated summaries.




In [18]:
!pip install datasets
!pip install rouge_score

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[?25l[K     |█▎                              | 10 kB 29.1 MB/s eta 0:00:01[K     |██▌                             | 20 kB 29.0 MB/s eta 0:00:01[K     |███▊                            | 30 kB 19.5 MB/s eta 0:00:01[K     |█████                           | 40 kB 16.4 MB/s eta 0:00:01[K     |██████▏                         | 51 kB 7.2 MB/s eta 0:00:01[K     |███████▍                        | 61 kB 8.5 MB/s eta 0:00:01[K     |████████▋                       | 71 kB 8.1 MB/s eta 0:00:01[K     |██████████                      | 81 kB 9.0 MB/s eta 0:00:01[K     |███████████▏                    | 92 kB 9.2 MB/s eta 0:00:01[K     |████████████▍                   | 102 kB 7.2 MB/s eta 0:00:01[K     |█████████████▋                  | 112 kB 7.2 MB/s eta 0:00:01[K     |██████████████▉                 | 122 kB 7.2 MB/s eta 0:00:01[K     |████████████████                | 133 kB 7.2 MB/s eta 0:00:01

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [19]:
from datasets import load_metric
metric = load_metric("rouge")
rogue_score = metric.compute(predictions=final_df['Generated Headline'], references=final_df['Actual Headline'])
print(rogue_score)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2170.0, style=ProgressStyle(description…


{'rouge1': AggregateScore(low=Score(precision=0.23883459509240745, recall=0.37370163690476177, fmeasure=0.28528302929333305), mid=Score(precision=0.2747558517871017, recall=0.4197023809523809, fmeasure=0.32473493461303043), high=Score(precision=0.31417254967254965, recall=0.47140922619047637, fmeasure=0.36808023691242536)), 'rouge2': AggregateScore(low=Score(precision=0.062243144416857674, recall=0.09854166666666668, fmeasure=0.07469778388482433), mid=Score(precision=0.08885492265087855, recall=0.14187500000000003, fmeasure=0.10689571151887328), high=Score(precision=0.11836620206794841, recall=0.1881354166666667, fmeasure=0.14185194365479295)), 'rougeL': AggregateScore(low=Score(precision=0.2195836472555222, recall=0.34488541666666656, fmeasure=0.26432607600303026), mid=Score(precision=0.2550425442612942, recall=0.390550595238095, fmeasure=0.3014207661426119), high=Score(precision=0.29338036616161617, recall=0.4380706845238095, fmeasure=0.3405277870054134)), 'rougeLsum': AggregateScor

In [20]:
!pip install nlp

Collecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[?25l[K     |▏                               | 10 kB 30.3 MB/s eta 0:00:01[K     |▍                               | 20 kB 25.2 MB/s eta 0:00:01[K     |▋                               | 30 kB 17.7 MB/s eta 0:00:01[K     |▉                               | 40 kB 15.6 MB/s eta 0:00:01[K     |█                               | 51 kB 7.9 MB/s eta 0:00:01[K     |█▏                              | 61 kB 7.6 MB/s eta 0:00:01[K     |█▍                              | 71 kB 7.9 MB/s eta 0:00:01[K     |█▋                              | 81 kB 8.9 MB/s eta 0:00:01[K     |█▉                              | 92 kB 9.3 MB/s eta 0:00:01[K     |██                              | 102 kB 7.4 MB/s eta 0:00:01[K     |██▏                             | 112 kB 7.4 MB/s eta 0:00:01[K     |██▍                             | 122 kB 7.4 MB/s eta 0:00:01[K     |██▋                             | 133 kB 7.4 MB/s eta 0:00:01[K     |█

In [21]:
import nlp
metric = nlp.load_metric('bleu')
bleu_score = metric.compute(predictions=final_df['Generated Headline'], references=final_df['Actual Headline'])
print(bleu_score)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5038.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1554.0, style=ProgressStyle(description…


{'bleu': 0.0, 'precisions': [0.2725795217088461, 0.0, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 53.8375, 'translation_length': 4307, 'reference_length': 80}
