## Importing Libraries and data

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Reference

https://www.geeksforgeeks.org/text-summarizations-using-huggingface-model/?ref=ml_lbp

In [3]:
# !pip install rouge_score
# !pip install --upgrade torch transformers

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=c4817a53bec9ff84732f29cc6e86fa3f06ac713a2a3f3b978b42f862064f6dbb
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [114]:
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm

from transformers import (T5Tokenizer, T5ForConditionalGeneration,
                          BartTokenizer, BartForConditionalGeneration)

from rouge_score import rouge_scorer
from multiprocessing import Pool, cpu_count

import random

In [5]:
# set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

# Clear the CUDA cache
# torch.cuda.empty_cache()

device(type='cpu')

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
df = pd.read_csv("/content/drive/MyDrive/BBCNewsAnalysis/df_cleaned.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,data_path,category,text,clean_text,clean_text_length
0,0,BBCNewsAnalysisTest-main/politics/318.txt,politics,Labour in constituency race row\n\nLabour's ch...,labour in constituency race row labour is choi...,304
1,1,BBCNewsAnalysisTest-main/politics/078.txt,politics,Asylum children to face returns\n\nThe UK gove...,asylum children to face returns the uk governm...,657
2,2,BBCNewsAnalysisTest-main/politics/154.txt,politics,Mayor will not retract Nazi jibe\n\nLondon may...,mayor will not retract nazi jibe london mayor ...,525
3,3,BBCNewsAnalysisTest-main/politics/231.txt,politics,Woolf murder sentence rethink\n\nPlans to give...,woolf murder sentence rethink plans to give mu...,403
4,4,BBCNewsAnalysisTest-main/politics/109.txt,politics,UK firms 'embracing e-commerce'\n\nUK firms ar...,uk firms embracing e commerce uk firms are emb...,312


## Functions

In [95]:
# Load models and tokenizers
def load_t5(device):
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    model = T5ForConditionalGeneration.from_pretrained("t5-base").to(device)
    return tokenizer, model

def load_bart(device):
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base").to(device)
    return tokenizer, model


# # Summarize function
# def summarize(text, tokenizer, model, device):
#     inputs = tokenizer.encode(text, return_tensors="pt", max_length=1024, truncation=True).to(device)
#     summary_ids = model.generate(inputs, max_length=100, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
#     summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
#     return summary

# Define batch summarization function
def summarize_batch(texts, tokenizer, model, device, max_length=1024, summary_max_length=100):
    # Tokenize the batch of texts
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)

    # Generate summaries for each text in the batch
    summary_ids = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=summary_max_length,
        min_length=40,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )

    # Decode summaries
    summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
    return summaries

# Function to apply batch summarization to a DataFrame
def summarize_dataframe(df, tokenizer, model, device, batch_size=8):
    summaries = []
    for i in range(0, len(df), batch_size):
        batch_texts = df['clean_text'][i:i + batch_size].tolist()
        batch_summaries = summarize_batch(batch_texts, tokenizer, model, device)
        summaries.extend(batch_summaries)
    return summaries


def compute_rouge_f1(df, ref_col_name, gen_col_name):

    scores_all = []
    for i in tqdm(range(len(df))):
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(df[ref_col_name][i], df[gen_col_name][i])

        rouge1 = round(scores['rouge1'].fmeasure,2)
        rouge2 = round(scores['rouge2'].fmeasure,2)
        rougeL = round(scores['rougeL'].fmeasure,2)

        scores_all.append([rouge1, rouge2, rougeL])

    return scores_all

## T5-base model

In [None]:
# Clear the CUDA cache
torch.cuda.empty_cache()

In [8]:
# Summarize using T5
t5_tokenizer, t5_model = load_t5(device)

# df['T5_summary'] = df['text'].apply(lambda x: summarize(x, t5_tokenizer, t5_model, device))

df['T5_summary'] = summarize_dataframe(df[['clean_text']], t5_tokenizer, t5_model, device, batch_size=16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
# df.to_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_summary.csv')

In [8]:
df = pd.read_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_summary.csv')

 ## BART-base model

In [22]:
# Clear the CUDA cache
torch.cuda.empty_cache()

In [10]:
# Summarize using BART
bart_tokenizer, bart_model = load_bart(device)

# df['bart_summary'] = df['text'].apply(lambda x: summarize(x, bart_tokenizer, bart_model))

df['bart_summary'] = summarize_dataframe(df[['clean_text']], bart_tokenizer, bart_model, device, batch_size=32)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [18]:
# df.to_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_bart_summary.csv')

In [50]:
df = pd.read_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_bart_summary.csv')

## Model Evaluation

In [98]:
df[['T5_rouge1', 'T5_rouge2', 'T5_rougeL']] = compute_rouge_f1(df, 'clean_text', 'T5_summary')
df[['bart_rouge1', 'bart_rouge2', 'bart_rougeL']] = compute_rouge_f1(df, 'clean_text', 'bart_summary')

100%|██████████| 2225/2225 [00:46<00:00, 48.11it/s]
100%|██████████| 2225/2225 [00:57<00:00, 38.52it/s]


In [108]:
# df.to_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_bart_summary_w_score.csv')

## Mean Rouge scores by news category

In [102]:
df.groupby('category')[['T5_rouge1', 'T5_rouge2', 'T5_rougeL', 'bart_rouge1', 'bart_rouge2', 'bart_rougeL']].mean().round(2).reset_index()

Unnamed: 0,category,T5_rouge1,T5_rouge2,T5_rougeL,bart_rouge1,bart_rouge2,bart_rougeL
0,business,0.31,0.29,0.29,0.41,0.4,0.41
1,entertainment,0.33,0.3,0.31,0.43,0.41,0.42
2,politics,0.24,0.22,0.22,0.34,0.33,0.34
3,sport,0.31,0.28,0.29,0.42,0.4,0.41
4,tech,0.24,0.22,0.23,0.32,0.31,0.31


We can see that mean rouge scores of bart model are comparitively higher than T5 model.

It could also be because the length of generated summaries. Lets check that as well

In [107]:
df['T5_summary_len'] = df['T5_summary'].str.split().apply(len)

df['bart_summary_len'] = df['bart_summary'].str.split().apply(len)

In [110]:
# df.to_csv('/content/drive/MyDrive/BBCNewsAnalysis/df_w_T5_bart_summary_w_score.csv')

In [113]:
df.groupby('category')[['T5_summary_len', 'bart_summary_len']].mean().round(0).reset_index()

Unnamed: 0,category,T5_summary_len,bart_summary_len
0,business,56.0,80.0
1,entertainment,57.0,80.0
2,politics,54.0,82.0
3,sport,54.0,76.0
4,tech,62.0,85.0


We can see that bart summaries are relatively longer than T5 summaries.

Lets randomly check few samples for qualitative analysis

In [122]:
for i in range(1,6):
    idx = random.randint(0,2225)
    print(f'Sample{i}_idx:', idx)
    print('clean_text:\n', df['clean_text'][i])
    print('-'*100)
    print('T5_summary:\n', df['T5_summary'][i])
    print('-'*100)
    print('bart_summary:\n', df['bart_summary'][i])
    print('\n\n')

Sample1_idx: 1305
clean_text:
 asylum children to face returns the uk government is planning to return asylum seeker children without parents to albania. the trial scheme which could start in weeks may be extended to apply to children from other countries. children is charities have reacted with alarm saying the policy amounts to forcible removal and may not guarantee the safety of those affected. but the home office says it may be in the children is best interests if it reunites them with their communities. the pilot included in the government is five year immigration plan aims to return unaccompanied asylum seeking children from albania who have failed in their asylum claims. since 2002 at least 9 000 under 18s have arrived in the uk to seek asylum without other family members. these children automatically become the responsibility of social services. up to now ministers have held back from final removal orders against unaccompanied children until after they are legally adults at 18.

We can see that T5 summaries are more crisp and clear compared to bart summaries.