# **THExtended: Transformer-based Highlights Extraction for News Summarization**

---


**Authors:** Alessio Paone, Flavio Spuri, Luca Agnese, Luca Zilli

# Notebok to use the **THExtended** summarization method

This **Demo** provides the opportunity to test our methodology for performing extractive summarization on the articles.
To set up the scenarios, it is recommended to begin by running the ***Prepare the Repository and Dependencies*** code cells block. Running it ensures that all dependencies and required resources are correctly loaded into your environmentis, necessary for the execution of the following phase.

A brief recap of the project key definitions:

* **Summarization** has received considerable attention in the field of Natural Language Processing (NLP) in recent years. It has been widely applied in various domains, including news summarization and extracting important sections from scientific papers, such as highlights.
* **Extractive summarization** is a specific approach to summarization where the goal is to select and extract the most important sentences or phrases from a given document or text corpus.

## **Prepare the Repository and Dependencies**

This section is provided for downloading the repository, all the necessary dependencies and the dataset samples from [HuggingFace](https://huggingface.co/datasets/cnn_dailymail).

In [1]:
# Download the repository from github
%%capture
!git clone https://github.com/Raffix-14/THExtended.git

In [2]:
# Install all the requested requirements
%%capture
!pip install rouge
!pip install transformers datasets accelerate nvidia-ml-py3 sentencepiece evaluate
!pip install bert_score
!pip install spacy
!python -m spacy download en_core_web_lg
!pip install -U sentence-transformers

In [3]:
# Select the i-th article contained in the test split of our preprocessed dataset
from datasets import load_dataset

dataset_name = "Raffix/cnndm_10k_semantic_rouge_labels"
dataset = load_dataset(dataset_name)['test']

Downloading readme:   0%|          | 0.00/497 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Raffix___parquet/Raffix--cnndm_10k_semantic_rouge_labels-b554703cbe71d0b6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/23.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.64M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.04M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/382188 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/51189 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50920 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Raffix___parquet/Raffix--cnndm_10k_semantic_rouge_labels-b554703cbe71d0b6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## **Run the highlights extraction**

This section is provided for extract the **highlights** from an article example contained in **CNN/DailyMail**.

The following cells contain respectively:


1.   Load a single article from our **cleaned dataset**
2.   Set up the model and load the **fine-tuned weights**
3.   Final **comparison** between the extracted sentences and the article

**Note:** Take into account the the dataset used in our work (and in this very notebook) predominantly contains **abstractive** highlights


In [4]:
%%capture
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer
import torch
import math
import spacy

# Set up the model and the tokenizer (use our best fine-tuned weights) and prepare the model for the test phase
model_name = "Raffix/THExtended_alpha_05"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")
model.to(torch.device("cuda:0"))

In [5]:
# Select the i-th article contained in the test split of the dataset

# Recreate the original article following the original function (mantaining the sentences split)
i = 5
current_context = None
count_article = 0
for example in dataset:

  sentence = example['sentence']
  context = example['context']
  highlights = example['highlights'].split("\n")

  if context != current_context:

    if current_context is not None:
      count_article += 1
      if count_article == i + 1:
        # We have our article sentences
        break

    current_context = context
    current_highlights = highlights
    current_article_sentences = []

  current_article_sentences.append(sentence)

# Keep just the selected article
context = current_context
article_body = current_article_sentences
article_summary = current_highlights

In [6]:
# Import the needed function from our repository and visualize the highlights
from THExtended.utils import get_scores

ranked_sents, ranked_scores = get_scores(article_body, context, model, tokenizer)
ranked_sents = ranked_sents[:len(article_summary)]

print("The selected article is:\n")
for sentence in article_body:
  print(sentence)
print('\n##############################################################################################################################')
print("\nThe predicted highlights are:\n")
for idx, pred_s in enumerate(ranked_sents):
  print(f"{idx+1}. {pred_s.strip()}")
print('\n##############################################################################################################################')

The selected article is:

An Australian doctor is the face of the latest Islamic State propaganda video in which the terrorist organisation announces the launch of its own health service in Syria.
The propaganda video shows a man with an Australian accent who calls himself 'Abu Yusuf' and calls on foreign doctors to travel to the ISIS stronghold Raqqa to help launch the ISHS (the Islamic State Health Service), which appears to be mimicking Britain's National Health Service.
The vision shows Yusuf handling babies in a maternity ward while wearing western-style blue surgical scrubs and a stethoscope.
SCROLL DOWN FOR VIDEO .
An Australian doctor who calls himself 'Abu Yusuf' is geatured in the latest Islamic State propaganda video in which the terrorist organisation announces the launch of the Islamic State Health Service .
The video's poster shows a cropped image of a doctor, wearing an western-style blue surgical scrubs which appear to mimic Britain's National Health Service .
The visio