#### Transformer

Data source:

https://data.world/opensnippets/cnn-news-dataset


Help and resources:

https://huggingface.co/docs/transformers/en/model_doc/bart

https://towardsdatascience.com/how-to-evaluate-text-generation-models-metrics-for-automatic-evaluation-of-nlp-models-e1c251b04ec1

https://huggingface.co/docs/transformers/notebooks

https://towardsdatascience.com/how-to-evaluate-text-generation-models-metrics-for-automatic-evaluation-of-nlp-models-e1c251b04ec1




In [1]:
from sklearn.model_selection import train_test_split
import json
import pandas as pd
from fastai.text.all import *
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig, Seq2SeqTrainingArguments, Seq2SeqTrainer
import datasets
from datasets import Dataset, DatasetDict
import torch
import nltk
nltk.download('punkt', quiet=True)


True

In [2]:
# Load the dataset
data = "cnn_data.json"
df = pd.read_json(data)

# Check the keys in the dataset
print(df.keys())

Index(['title', 'url', 'published_at', 'last_modified_at', 'author',
       'short_description', 'header_image', 'raw_content', 'content',
       'crawled_at', '_id', 'source'],
      dtype='object')


In [3]:
# Selecting only the 'content' and 'short_description' columns
df = df[['content', 'short_description']]

# Preprocessing: Remove any rows with missing values
df.dropna(inplace=True)

In [4]:
df.head()

Unnamed: 0,content,short_description
0,"Fleetwood and Hoey lead big names in Dunhill LinksBy Updated 2019 GMT (0419 HKT) September 30, 2011 England's Tommy Fleetwood is all smiles after completing a nine-under 63 at KingsbarnsStory highlightsTommy Fleetwood and Michael Hoey share lead at Alfred Dunhill Links Championship in Scotland They are on 12-under 132 with England's Fleetwood carding a second round 63 at Kingsbarns 2010 British Open champion Louis Oosthuizen leads chasers on 11-under Martin Kaymer, Lee Westwood and Rory McIlroy all within five shots of the lead after two roundsUnheralded pair Tommy Fleetwood of England...",Unheralded pair Tommy Fleetwood of England and Northern Ireland's Michael Hoey share the lead at the halfway stage of the Alfred Dunhill Links Championship.
1,"Senna to replace Heidfeld in BelgiumBy Updated 1649 GMT (0049 HKT) August 26, 2011 Bruno Senna is the nephew of Ayrton Senna, the three-time world champion who died in 1994.Story highlightsBruno Senna will replace Nick Heidfeld for Renault at the Belgian Grand PrixBruno is the nephew of late Formula legend Ayrton SennaThe Brazilian has not raced since the final grand prix of the 2010 seasonBrazilian driver Bruno Senna will replace Nick Heidfeld for Renault at this weekend's Formula One grand prix in Belgium.Senna, the nephew of three-time world champion Ayrton Senna, joined the British-bas...",Brazilian driver Bruno Senna will replace Nick Heidfeld for Renault at this weekend's Formula One grand prix in Belgium.
2,"Zvonareva beats Kvitova to reach Tokyo finalBy Updated 2022 GMT (0422 HKT) September 30, 2011 Vera Zvonareva powers a shot during her straight sets win over Petra Kvitova in TokyoStory highlightsVera Zvonareva sees off Wimbledon champion Petra Kvitova in straight sets in Tokyo semis Russian comes from 5-1 down in opening set to win it on a tiebreaker Maria Sharapova pulls out of China Open with ankle injuryAndy Murray into the semifinals of ATP tournament in ThailandRussian Vera Zvonareva came from 5-1 down in the opening set to beat Wimbledon champion Petra Kvitova 7-6 6-0 in the semifin...",Russian Vera Zvonareva came from 5-1 down in the opening set to beat Wimbledon champion Petra Kvitova 7-6 6-0 in the semifinals of the WTA tournament in Tokyo.
3,"Country profile: MacedoniaBy Catriona Davies and Eoghan Macguire for CNNUpdated 0434 GMT (1234 HKT) September 30, 2011 Photos: Ancient ruins – The famous mosaics at the ancient Roman archeological site of Stobi, in southeast Macedonia. One of the country's many ancient relics.Hide Caption 1 of 8 Photos: Lake Ohrid – Two people fish on a boat on Macedonia's Lake Ohrid, one of the deepest and oldest freshwater lakes in Europe.Hide Caption 2 of 8 Photos: Muslim community – Muslim craftsmen perform their prayers in an alley in the Old Bazaar in Skopje. Accordinng to the CIA World Factbook a t...","Macedonia is a small landlocked country bordering Albania, Bulgaria, Greece, Kosovo and Serbia with aspirations of joining the European Union."
4,"Manchester City ban talk of TevezBy Updated 1655 GMT (0055 HKT) September 30, 2011 Carlos Tevez was Manchester City's top goalscorer last year as they won the FA Cup and reached the Champions League.Story highlightsManchester City ban questions on Carlos Tevez at Friday press conference Tevez is suspended after allegedly refusing to come on as substitute against Bayern MunichBayern Munich won Champions League group match 2-0 City turn down approach from West Ham to take Tevez on loanManchester City banned journalists from asking manager Roberto Mancini about striker Carlos Tevez Friday,...","Manchester City banned journalists from asking manager Roberto Mancini about striker Carlos Tevez Friday, as the fallout from the Argentine's apparent refusal to come on as a substitute continued."


In [5]:
# Reduce the size of the data
# This is not good practice and only serves to make the model worse
# however, LLMs take a long time to train and this is the only way it could
# complete before my computer melted.
sampled_df = df.sample(frac=0.1, random_state=37)

# Split the data into train-test sets (90-10 split)
train_df, test_df = train_test_split(sampled_df, test_size=0.1, random_state=37)

print(len(train_df))
print(len(test_df))

315
35


In [6]:
# Tokenizer? I hardly know 'er!
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')


In [7]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [8]:
# Tokenize the input data
def tokenize_data(example):
    # Tokenize the input document
    inputs = tokenizer(example['content'], padding='max_length', truncation=True, max_length=512)
    # Tokenize the target summary
    targets = tokenizer(example['short_description'], padding='max_length', truncation=True, max_length=128)
    
    # Add special tokens to separate input and target
    inputs['labels'] = targets['input_ids']
    
    return inputs

In [9]:

# Apply tokenization to the entire dataset
train_dataset = train_dataset.map(tokenize_data, batched=True)
test_dataset = test_dataset.map(tokenize_data, batched=True)

# Print the first example to check tokenization
print(train_dataset[0])


Map:   0%|          | 0/315 [00:00<?, ? examples/s]

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

{'content': 'Police to investigate \'racist\' referee in Chelsea caseBy Updated 1811 GMT (0211 HKT) October 30, 2012 London\'s Metropolitan Police force have launched an investigation into the alleged comments made by referee Mark Clattenburg. Story highlightsPolice will investigate allegations of racial abuse by Premier League referee Mark ClattenburgClattenburg was in charge of Chelsea\'s 3-2 home defeat by Manchester United on SundaySociety of Black Lawyers referred complaint to the policeSeparate Football Association inquiry will run concurrentlyThe racism row involving a Premier League referee could dominate English football headlines for some months, after police announced on Tuesday that they had launched an investigation into the incident. Mark Clattenburg, who took charge of Chelsea\'s controversial 3-2 home defeat by Manchester United on Sunday, is alleged to have made "inappropriate" comments to two Chelsea players, one of which is claimed to have had a racial nature. The fo

In [10]:
# Fine-tune model
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
# Notes for how to make the model faster (but worse)
training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=4,  # Increase batch size
    per_device_eval_batch_size=4,   # Increase eval batch size
    output_dir='./results',
    logging_dir='./logs',
    num_train_epochs=2,             # Decrease number of epochs
    logging_steps=1000,             # Decrease logging frequency
    evaluation_strategy='steps',
    eval_steps=2000,                # Decrease evaluation frequency
    save_steps=2000,                # Decrease saving frequency
    warmup_steps=200,               # Decrease warmup steps
    weight_decay=0.01,              # Keep this value if you want
    predict_with_generate=True
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()





  0%|          | 0/158 [00:00<?, ?it/s]

{'train_runtime': 7325.7753, 'train_samples_per_second': 0.086, 'train_steps_per_second': 0.022, 'train_loss': 1.984714363194719, 'epoch': 2.0}


TrainOutput(global_step=158, training_loss=1.984714363194719, metrics={'train_runtime': 7325.7753, 'train_samples_per_second': 0.086, 'train_steps_per_second': 0.022, 'train_loss': 1.984714363194719, 'epoch': 2.0})

In [11]:
# Evaluate model
results = trainer.evaluate()
print(results)

  0%|          | 0/9 [00:00<?, ?it/s]

{'eval_loss': 0.15529805421829224, 'eval_runtime': 145.2126, 'eval_samples_per_second': 0.241, 'eval_steps_per_second': 0.062, 'epoch': 2.0}


In [None]:
from nltk.translate.bleu_score import SmoothingFunction, corpus_bleu, sentence_bleu

def bleu(ref, gen):
    ''' 
    calculate pair wise bleu score. uses nltk implementation
    Args:
        references : a list of reference sentences 
        candidates : a list of candidate(generated) sentences
    Returns:
        bleu score(float)
    '''
    ref_bleu = []
    gen_bleu = []
    for l in gen:
        gen_bleu.append(l.split())
    for i,l in enumerate(ref):
        ref_bleu.append([l.split()])
    cc = SmoothingFunction()
    score_bleu = corpus_bleu(ref_bleu, gen_bleu, weights=(0, 1, 0, 0), smoothing_function=cc.method4)
    return score_bleu

In [None]:

#rouge scores for a reference/generated sentence pair
#source google seq2seq source code.

import itertools

#supporting function
def _split_into_words(sentences):
  """Splits multiple sentences into words and flattens the result"""
  return list(itertools.chain(*[_.split(" ") for _ in sentences]))

#supporting function
def _get_word_ngrams(n, sentences):
  """Calculates word n-grams for multiple sentences.
  """
  assert len(sentences) > 0
  assert n > 0

  words = _split_into_words(sentences)
  return _get_ngrams(n, words)

#supporting function
def _get_ngrams(n, text):
  """Calcualtes n-grams.
  Args:
    n: which n-grams to calculate
    text: An array of tokens
  Returns:
    A set of n-grams
  """
  ngram_set = set()
  text_length = len(text)
  max_index_ngram_start = text_length - n
  for i in range(max_index_ngram_start + 1):
    ngram_set.add(tuple(text[i:i + n]))
  return ngram_set

def rouge_n(reference_sentences, evaluated_sentences, n=2):
  """
  Computes ROUGE-N of two text collections of sentences.
  Source: http://research.microsoft.com/en-us/um/people/cyl/download/
  papers/rouge-working-note-v1.3.1.pdf
  Args:
    evaluated_sentences: The sentences that have been picked by the summarizer
    reference_sentences: The sentences from the referene set
    n: Size of ngram.  Defaults to 2.
  Returns:
    recall rouge score(float)
  Raises:
    ValueError: raises exception if a param has len <= 0
  """
  if len(evaluated_sentences) <= 0 or len(reference_sentences) <= 0:
    raise ValueError("Collections must contain at least 1 sentence.")

  evaluated_ngrams = _get_word_ngrams(n, evaluated_sentences)
  reference_ngrams = _get_word_ngrams(n, reference_sentences)
  reference_count = len(reference_ngrams)
  evaluated_count = len(evaluated_ngrams)

  # Gets the overlapping ngrams between evaluated and reference
  overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
  overlapping_count = len(overlapping_ngrams)

  # Handle edge case. This isn't mathematically correct, but it's good enough
  if evaluated_count == 0:
    precision = 0.0
  else:
    precision = overlapping_count / evaluated_count

  if reference_count == 0:
    recall = 0.0
  else:
    recall = overlapping_count / reference_count

  f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))

  #just returning recall count in rouge, useful for our purpose
  return recall

In [None]:
# generate 5 sentences
#compare to 5 real sentences

realSummaries = []

generatedSummaries = []


for i in range(5):

    # Add summary to list
    realSummaries.append(df['short_description'][i])

    # Tokenize input text
    input_ids = tokenizer(df['content'][i], return_tensors="pt").input_ids

    # Generate output
    output_ids = model.generate(input_ids)

    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Add generated summary to list
    generatedSummaries.append(output_text)
