# **THExtended: Transformer-based Highlights Extraction for News Summarization**

---


**Authors:** Alessio Paone, Flavio Spuri, Luca Agnese, Luca Zilli

# Notebok to use the **THExtended** summarization method

This **Demo** provides the opportunity to execute the code cells presented below for test the articles.
To set up the scenarios, it is recommended to begin by running the ***Prepare the Repository and Dependencies*** code cells. Running the mentioned configurations is necessary to ensure the successful execution of the prepared phase.

A brief recap of the project key definitions:

* **Summarization** has received considerable attention in the field of Natural Language Processing (NLP) in recent years. It has been widely applied in various domains, including news summarization and extracting important sections from scientific papers, such as highlights.
* **Extractive summarization** is a specific approach to summarization where the goal is to select and extract the most important sentences or phrases from a given document or text corpus.

## **Prepare the Repository and Dependencies**

This section is provided for downloading the repository, all the necessary dependencies and the dataset samples from [HuggingFace](https://huggingface.co/datasets/cnn_dailymail).

In [None]:
# Download the repository from github
%%capture
!git clone https://github.com/Raffix-14/THExtended_.git

In [10]:
# CASELLA DA TOGLIERE PRIMA DELLA CONSEGNA - USATA SOLO PER DEBUG (models folder e gia dentro il github di pao)
!mkdir THExtended_/models

In [12]:
# CASELLA DA TOGLIERE PRIMA DELLA CONSEGNA - USATA SOLO PER DEBUG (scaricare wheights drive drive)
!cp /content/drive/MyDrive/THExtEnded/alpha_075/model/checkpoint-5972 -r THExtended_/models

In [14]:
# Install all the requested requirements
%%capture
!pip install rouge
!pip install transformers datasets accelerate nvidia-ml-py3 sentencepiece evaluate
!pip install bert_score
!pip install spacy
!python -m spacy download en_core_web_lg
!pip install -U sentence-transformers

In [None]:
# Install the 'cnn_dailymail' dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0')

## **Run the highlights extraction**

This section is provided for extract the **highlights** from an article example contained in **CNN/DailyMail**.

The following cells contain respectively:


1.   **Load and show** a single article from the dataset
2.   Set up the model and load the **fine-tuned weights**
3.   Declare few functions to process the article **corpus** and **context**
4.   Finally use our model to **perform the extraction** and visualize the results
5.   Final **comparison** between the extracted sentences and the golden highlights

**Note:** Take into account the the dataset used in our work (and in this very notebook) predominantly contains **abstractive** highlights


In [None]:
# Save the first article contained in the test split of the dataset
article_corpus = dataset['test'][0]['article']
article_summary = dataset['test'][0]['highlights']
print(f"The article text corpus contains: \n\n{article_corpus}\n\n")
print(f"The 'gold' higlights are: \n\n{article_summary}")

In [114]:
%%capture
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer
import torch
import math
import spacy

# Set up the model and the tokenizer (use our best fine-tuned weights) and prepare the model for the test phase

model_name = '/content/THExtended_/models/checkpoint-5972'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model.to(torch.device("cuda:0"))

In [115]:
# CHIEDERE A PAO!: Il summary ritornato non e splittato (togliere poi il commento prima di consegnaer!!!)
# Declare 'modified' versions (shorter snippets of code) from our project functions and prepare the article processing

nlp = spacy.load('en_core_web_lg') # Smart 'sentence splitter'

def split_sentence(text):
        return [s.text.strip() for s in nlp(text).sents]  # Splitting into sentences and cleaning

def extract_context(article_sentences):
        # Compute the index for the first section
        cut_off = math.ceil(len(article_sentences) / 3)
        return ' '.join(sent for sent in article_sentences[:cut_off])

def process_row(row):
        # Unpack the parameter tuple
        article, summary = row
        article_sentences = split_sentence(article)
        context = extract_context(article_sentences)
        # Return the processed 'article' object
        return {"sentences": article_sentences, "context": context, "highlights": summary}

In [None]:
# Import the needed function from our repository and visualize the highlights
from THExtended_.utils import get_scores

processed_data = process_row((article_corpus, article_summary))
golden_highlights = processed_data["highlights"].split(".")
#num_highlights = len(golden_highlights)
num_highlights = 2 # The first article contains just 2 highlights
article_sentences, context = processed_data["sentences"], processed_data["context"]
ranked_sents, ranked_scores = get_scores(article_sentences, context, model, tokenizer)
print(f"The first {num_highlights} extracted sentences are: \n")

for sentence, score in zip(ranked_sents[:num_highlights], ranked_scores[:num_highlights]):
  print(f"Sentence > {sentence}")
  print(f"Model score > {round(score, 5)}\n")

In [None]:
# Compare the 'golden' highlights with the extracted one
print("The golden highlights are:\n")
for idx, real_s in enumerate(golden_highlights[:num_highlights]):
  print(f"{idx+1}. {real_s.strip()}")

print("\nThe predicted highlights are:\n")
for idx, pred_s in enumerate(ranked_sents[:num_highlights]):
  print(f"{idx+1}. {pred_s.strip()}")

In [None]:
# KEPT JUST TO REMOVE BEFORE SUBMISSION !!! - Run the test from a pretrained model checkpoint (best version)
!python3 /content/test.py \
--dataset_path=/content/content/cnn_dm_10k \
--save_dataset_on_disk=0  \
--output_dir=/content/output \
--train_batch_size=64 \
--gradient_accumulation_steps=2 \
--num_train_example=10000 \
--num_val_example=1500 \
--num_test_examples=1500 \
--alpha=1.0 \
--model_name_or_path=/content/10k_a1_2epoch/train/model/checkpoint-5972