<a href="https://colab.research.google.com/github/HaywhyCoder/text-summarization-model/blob/main/news_headline_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **News Headline Model**

#### Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

from datasets import Dataset, DatasetDict
from evaluate import load
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling, pipeline
from sacrebleu import corpus_bleu
import torch

In [None]:
metric = load('bertscore')

#### Load the Dataset

In [None]:
data = pd.read_csv("/kaggle/input/news-summary/news_summary.csv", encoding='latin-1')
data.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [None]:
data = data[['text', 'headlines']]

In [None]:
data.head()

Unnamed: 0,text,headlines
0,The Administration of Union Territory Daman an...,Daman & Diu revokes mandatory Rakshabandhan in...
1,Malaika Arora slammed an Instagram user who tr...,Malaika slams user who trolled her for 'divorc...
2,The Indira Gandhi Institute of Medical Science...,'Virgin' now corrected to 'Unmarried' in IGIMS...
3,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Aaj aapne pakad liya: LeT man Dujana before be...
4,Hotels in Maharashtra will train their staff t...,Hotel staff to get training to spot signs of s...


In [None]:
data['text'] = data['text'].map(lambda x: x + "\nTL;DR:")
data['text'][5]

'A 32-year-old man on Wednesday was found hanging inside the washroom of a Delhi police station after he was called for interrogation. His family alleged that he could have been emotionally and physically tortured. Police said the man was named as a suspect in the kidnapping case of a married woman with whom he had been in a relationship earlier.\nTL;DR:'

### Prepare Dataset

In [None]:
sample_data = data.sample(n=300, random_state=16, ignore_index=True)
train, test = train_test_split(sample_data, test_size=.2, random_state=42)
train, eval = train_test_split(train, test_size=.2, random_state=42)

datasets = DatasetDict({
    'train': Dataset.from_pandas(train, preserve_index=False),
    'eval': Dataset.from_pandas(eval, preserve_index=False),
    'test': Dataset.from_pandas(test, preserve_index=False)
})
datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'headlines'],
        num_rows: 192
    })
    eval: Dataset({
        features: ['text', 'headlines'],
        num_rows: 48
    })
    test: Dataset({
        features: ['text', 'headlines'],
        num_rows: 60
    })
})

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [None]:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
def preprocess_function(examples):
  inputs = [text for text in examples['text']]
  targets = [summary for summary in examples['headlines']]

  model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
  labels = tokenizer(text_target=targets, max_length=32, truncation=True, padding='max_length', return_tensors='pt')

  model_inputs['labels'] = labels['input_ids']
  return model_inputs

tokenized_datasets = datasets.map(preprocess_function, batched=True, remove_columns=datasets['train'].column_names)

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    weight_decay=.01,
    save_total_limit=1,
    logging_dir='./logs',
    logging_steps=10,
    report_to='none'
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['eval'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,1.5606,3.098544
2,1.5669,3.091365
3,1.4122,3.103242
4,1.2404,3.135389
5,1.2106,3.174445
6,1.065,3.199709
7,0.9896,3.239661
8,1.0384,3.251663
9,1.0331,3.268338
10,1.052,3.278509




TrainOutput(global_step=480, training_loss=1.2203150729338328, metrics={'train_runtime': 74.191, 'train_samples_per_second': 25.879, 'train_steps_per_second': 6.47, 'total_flos': 125420175360000.0, 'train_loss': 1.2203150729338328, 'epoch': 10.0})

In [None]:
sample = datasets['test'][10]

# Detect the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the detected device
model = model.to(device)

# Tokenize the input text and move tensors to the same device
inputs = tokenizer(
    sample['text'],
    return_tensors="pt",
    max_length=128,
    truncation=True,
    padding=True  # Ensures padding is applied
).to(device)

labels = tokenizer(sample['headlines'], max_length=32, truncation=True, padding=True).to(device)


input_ids = inputs['input_ids']
att_mask = inputs['attention_mask']

model.eval()
outputs = model.generate(
    input_ids=input_ids,
    attention_mask=att_mask,
    max_new_tokens= 15,
    min_length=5,  # Minimum length of the summary
    length_penalty=-3.0,  # Encourage shorter summaries
    num_beams=4,  # Use beam search for better results
    early_stopping=True  # Stop once the most probable sequence is completed
)

# print(outputs[0][att_mask.sum():])
summary = tokenizer.decode(outputs[0][att_mask.sum():], skip_special_tokens=True)
target = tokenizer.decode(labels['input_ids'], skip_special_tokens=True)

print("Summary: ",summary, '\n',"Headline: ", target)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary:   A man was lynched by a mob outside Jamia Masjid in 
 Headline:  Police officer lynched in Srinagar outside a mosque


In [None]:
model.eval()

outputs = model.generate(
    input_ids=torch.tensor(tokenized_datasets['test']['input_ids']).to(device),
    attention_mask=torch.tensor(tokenized_datasets['test']['attention_mask']).to(device),
    max_new_tokens= 15,
    min_length=5,  # Minimum length of the summary
    length_penalty=3.0,  # Encourage shorter summaries
    num_beams=4,  # Use beam search for better results
    no_repeat_ngram_size=2, # bigrams can only occur once in sequence
    # do_sample=True,
    early_stopping=True  # Stop once the most probable sequence is completed

)

inputs = tokenized_datasets['test']['input_ids']
summaries = []
targets = []
for idx, output in enumerate(outputs):
    summaries.append(tokenizer.decode(output[len(inputs[idx]):], skip_special_tokens=True))
    targets.append(tokenizer.decode(tokenized_datasets['test']['labels'][idx], skip_special_tokens=True))

df = pd.DataFrame({"Summary": summaries, "Headlines": targets})
df.head()

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Unnamed: 0,Summary,Headlines
0,The accused received an e-mail from an unident...,Man arrested for receiving 'anti-national' Wha...
1,The central government is refusing to reveal t...,Disclose Godse's statement in Gandhi murder tr...
2,Javier Hernandez has been arrested for attacki...,Atlético Madrid defender arrested on assault s...
3,Jammu & Kashmir Congress wants CBI to probe BJ...,Cong complains to EC against BJP on Sharmila's...
4,"The airline is looking to hire up to 50,000 pe...","Air India to serve wine, retrain chefs for 'In..."


In [None]:
for i in range(5):
    print("summary: ", df['Summary'][i])
    print("headline: ", df['Headlines'][i], "\n")

summary:  The accused received an e-mail from an unidentified person stating that he was
headline:  Man arrested for receiving 'anti-national' WhatsApp message 

summary:  The central government is refusing to reveal the identity of the accused. 

headline:  Disclose Godse's statement in Gandhi murder trial: CIC 

summary:  Javier Hernandez has been arrested for attacking girlfriend of ex-MLS player
headline:  Atlético Madrid defender arrested on assault suspicion 

summary:  Jammu & Kashmir Congress wants CBI to probe BJP's offer of ?37
headline:  Cong complains to EC against BJP on Sharmila's claim 

summary:  The airline is looking to hire up to 50,000 people by 2020.
headline:  Air India to serve wine, retrain chefs for 'Indian touch' 



In [None]:
from statistics import mean

# Calculate BLEU score
bleu = corpus_bleu(summaries, targets).score

# Calculate BERTScore
bert_score = metric.compute(predictions=summaries, references=targets, model_type='distilbert-base-uncased')  # use distilbert for semantic analysis
print(f"Precision: {mean(bert_score['precision']):.4f} Recall: {mean(bert_score['recall']):.4f} F1: {mean(bert_score['f1']):.4f} bleu: {bleu:.4f}")

Precision: 0.7092 Recall: 0.7058 F1: 0.7073 bleu: 0.1471


On evaluating the model on the test set, the model got an average bert score of 0.71, indicating that the summary is similar in context to the target headline.