<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/2022_2023/Practice_5_Automatic_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 5:** Automatic Text Summarization

# Automatic Text Summarization

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning. The summarization task is challenging because it requires a deep understanding of the text, including both its content and its style.

**Extractive Summarization** is the task of selecting a subset of the original text to form the summary. The selected sentences are concatenated to form the summary. Extractive summarization is the most common approach to text summarization as it only requires to understand the content of the text and estimate the importance of each sentence. It does not require to **generate** new text, which is a more complex task and requires a both a deeper understanding of the text and text generation capabilities.

**Abstractive Summarization** is the task of generating a summary that is not a direct copy of the original text. It requires a deeper understanding of the text and the ability to generate new text. The model needs to understand the content of the text and the style of the author to produce a summary that is fluent and coherent with the original text.

## Extractive Text Summarization

The model need to estimate the importance of each sentence in the text and select the most important sentences to form the summary.

![](https://images.deepai.org/machine-learning-models/8f66b1eb608e4eb681b2ec0c0631385c/summarization.jpg)

In this practice we will use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary) to create the extractive summarization models. The following cell downloads the dataset and extracts it in the current directory.

In [None]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P5/bbc_news.zip
! unzip bbc_news.zip

**Important note**

In some of the questions you are asked to use PageRank to estimate the importance of each sentence. However, the graph won't be directed and the PageRank algorithm can not converge. If this happens, you can use the following code to skip the PageRank algorithm and use a trivial importance score for each sentence.

```python
try:
    pr_scores = pagerank(G, max_iter=1000)
except Exception as e:
    print("The pagerank algorithm failed to converge. Returning the top sentences according to their position in the text.")
    pr_scores = {i: N-i for i in range(len(sentences))}
```

### **Question 1: split data collection**

The data collection contains news articles belonging to different categories (e.g., business, sport, tech, etc.) and the corresponding summary. In this question you will split the data collection into training, validation and test sets. The training set will be used to train the model, the validation set will be used to select the best model and the test set will be used to evaluate the final model. The goal is to stratify the data collection by category, so that each category is represented in the same proportion in each set. Be sure to select 10% of the data **of each category** for the test set. The remaining data can be split according to your preference.

**Note 1:** Some files can report UnicodeError, feel free to ignore it (`errors` parameter)
```python
f = open(FILENAME, 'r', encoding='utf-8', errors='ignore')
```

**Note 2:** you can fix encoding after file reading by using [ftfy](https://pypi.org/project/ftfy/) library 

```python
import ftfy
fixed_text = ftfy.fix_text(text)
```

The following cell install the ftfy library that can be used to fix encoding issues.

In [None]:
!pip install ftfy

In [None]:
# Your code here

### **Question 2: Unsupervised Text Summarization (TextRank algorithm)**

[TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) is an unsupervised text summarization approach that exploits graph-based ranking algorithms to estimate the importance of each sentence in the text. The algorithm is based on the following steps:

1. Each sentence is a node in a graph (undirected)
2. A pair of is connected with an edge whose weight is computed according to the number of common words (see Note 1).
3. Once the graph is built, the importance of each sentence is estimated by computing the PageRank score of each node (you can use the [networkx](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html) library to compute the PageRank score).
4. The sentences are ranked according to their importance and the top ranked sentences are selected to form the summary.


Implement a `TextrankSummarizer` class that implements the TextRank algorithm. The class should have the following methods:

- `__init__(self, ...)`: the constructor of the class. You can add any parameter you want to the constructor if you think it is useful.
- `summarize(self, sentences, N)`: the method that computes the summary. The method takes as input the list of sentences `sentences` and the number of sentences `N` to select to form the summary. The method returns the list of selected sentences.
- Any other method you think it is useful. For example, you can add an internal method to compute the similarity between two sentences (see the example below).

**Note 1:** An example of the similarity function that can be used to compute graph weights is reported below. The function takes as input two sentences and returns the number of common words between the two sentences. The log function is used as a smoothing function to consider also the relative length of the sentences.

In [None]:
import math

def compute_similarity(tokens_sent_1, tokens_sent_2):

    n_common_words = len(set(tokens_sent_1) & set(tokens_sent_2))

    log_s1 = math.log10(len(words_sentence_one))
    log_s2 = math.log10(len(words_sentence_two))

    if log_s1 + log_s2 == 0:
        return 0

    return n_common_words / (log_s1 + log_s2)

In [None]:
'''
# class skeleton

class TextrankSummarizer:

    def __init__(self):
        continue
   
    def summarize(self, sentences, N=2):
        continue
        # TODO: implement summarization
'''

In [None]:
# Your code here

### **Question 3: Unsupervised Text Summarization (TextRank + TF-IDF)**

In this question you will improve the TextRank algorithm by using the TF-IDF score of each sentence to compute the graph weights. Similarly to the previous question, the algorithm computes the importance of each sentence by computing the PageRank score of each node. The difference is that the graph weights are computed according to the TF-IDF score of each sentence. 

Implement a `TextrankTFIDFSummarizer` class that contains the same methods of the `TextrankSummarizer` class. The only difference is that the graph weights are computed according to the TF-IDF score of each sentence.

Implement the class similarly to Q2. This version uses a different similarity function to weigh edges connecting sentences. It uses [TF-IDF vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to compute sentence-to-sentence similarity.

The similarity function should take as input two sentences and compute the cosine similarity between the TF-IDF vectors of the two sentences. The similarity function can be implemented as a method of the `TextrankTFIDFSummarizer` class.

1. Compute TF-IDF vectors for each sentence
2. Compute edges' weights using the cosine similarity between TF-IDF vector representations.

In [None]:
# Your code here

### **Question 4: Unsupervised Text Summarization (Pretrained BERT)**

Both the TextRank and the TextRank + TF-IDF algorithms are based on the assumption that the importance of a sentence is related to the number of common words with other sentences. This assumption is not always true, especially when the sentences can express similar ideas using different words (e.g., synonyms). In this question you will use a pretrained BERT model to compute the **semantic similarity** between sentences. You can use `sentence-transformers` library to obtain sentence embeddings (https://www.sbert.net/) and compute the cosine similarity between the embeddings.

Implement a `BERTSummarizer` class that contains the same methods of the `TextrankSummarizer` class. The only difference is that the graph weights are computed according to the semantic similarity of each sentence.

Use Sentence-Transformer library to encode sentences into semantic-aware vectors and compute semantic similarity to connect sentences in the graph.

In [None]:
# Your code here

### **Question 5: ROUGE-based evaluation**

Automatic summarization models are usually evaluated using automatic metrics. The idea is to compare the automatically generated summary with the reference summary provided by humans. The most common metric used to evaluate automatic summarization models is [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)). The ROUGE metric is based on the idea that the automatically generated summary is considered correct if it contains an high number of n-grams that are also present in the reference summary.

Using only the **test set** obtained in Q1 compare the performance of the three summarizers implemented in Q2, Q3 and Q4. 

Report their results in terms of average precision, recall and F1-score for Rouge 2 metrics. Set the number of extracted sentences to 4 for all summarizers.

**Which method obtain the best scores?**

Note 1: You can use the python implementation of ROUGE available [here](https://pypi.org/project/rouge/)

In [None]:
! pip install rouge

In [None]:
# Your code here

## Abstractive Text Summarization

Abstractive summarization models build an internal semantic representation of the original content, and then use this representation to generate a summary. The main difference with extractive summarization is that abstractive summarization models do not select sentences from the original text, but they generate new sentences.
Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction and requires a more sophisticated model.

![https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2)

For this part of the practice we use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary) similarly to the previous part. You don't need to downdload the dataset again, you can use the one you already downloaded in the previous part.

### **Question 6: BART (pretrained) seq2seq model**

[BART](https://arxiv.org/abs/1910.13461) is a sequence-to-sequence model trained with denoising as pretraining objective. BART is based on the [Transformer](https://arxiv.org/abs/1706.03762) architecture and it is trained on a large amount of text data.

The [huggingface transformers](https://huggingface.co/transformers/) library provides pretrained BART models that can be used to generate summaries. For this question you will use [BART model pre-trained on CNN-DailyMail dataset](https://huggingface.co/facebook/bart-large-cnn) to summarize the articles in the BBC test set.
You can use the `pipeline` function to create a summarization pipeline. The pipeline takes as input the text to summarize and returns the summary.

Note 1: for generated summaries set the maximum length to 100 and the minimum length to your preferred value (if you set the minimum length to a very low value, the model may generate summaries that are too short).

Note 2: **to speed up computation**, you can use the distilled version of the BART model (e.g., `sshleifer/distilbart-cnn-12-6` available [here](https://huggingface.co/sshleifer/distilbart-cnn-12-6)). Please note that the distilled version can be less effective than the larger version.

Note 3: You can use the [summarization pipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.SummarizationPipeline). Explictly set truncation to True to avoid index errors (e.g. `summarizer(..., truncation=True)`).

Note 4: If you have a GPU runtime on Colab, you can use it to speed up computation. To use the GPU with the pipeline, you can set the `device` parameter to `0` (e.g. `pipeline(..., device=0)`).


In [None]:
# Your code here

### **Question 7 (bonus): Finetuning seq2seq model**

The BBC dataset is provided with a training set. You can use the training set to finetune a BART model to generate summaries. You can use the [transformers library](https://huggingface.co/transformers/) to finetune the model. You have examples of how to finetune a BART model for summarization in the [introduction to the HF library](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/%F0%9F%A4%97Transformers_Overview.ipynb) discussed in the previous lab.

You can use the [`datasets` library](https://huggingface.co/docs/datasets/) and [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) for finetuning and [`evaluate` library](https://huggingface.co/docs/evaluate/index) to evaluate the model.

Alternatively, even in this case, you can evaluate the model using ROUGE-2 precision, recall and f1-score. You may want to use the [`compute_metrics`](https://huggingface.co/course/chapter3/3?fw=pt#evaluation) function to monitor the ROUGE-2 scores during training and select the best model according to the ROUGE-2 scores on the validation set.

In [None]:
# Your code here

### **Bonus**: Upload **YOUR** model to the [HuggingFace model hub](https://huggingface.co/models) and share the link to discord!

The huggingface model hub is a repository of pretrained models that can be used to perform a wide range of NLP tasks. You can upload your model to the hub and share it with the community. You can find more information about the model hub [here](https://huggingface.co/docs/hub/main). You can also join the [Deep NLP organization](https://huggingface.co/DeepNLP-22-23) to share your model on the organization page. You can use link on the organization page to join the Deep NLP organization.**

**Note 1**: If you want to extend the practice, you can try to finetune a BART model on other data collections that are available online or on the [huggingface datasets hub](https://huggingface.co/datasets) and share the results and the model on the Deep NLP organization page.

**How to upload a model to the HuggingFace model hub**

- **Step 1**: Create a [HuggingFace account](https://huggingface.co/join)
- **Step 2**: Login to your account from the notebook using the token provided in the account page ([more info](https://huggingface.co/settings/tokens))
```python
from huggingface_hub import notebook_login
notebook_login()
```
- **Step 3**: Prepare the model and tokenizer for upload
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
finetuned_model = AutoModelForSeq2SeqLM.from_pretrained("YOUR_LOCAL_FOLDER/checkpoint-...") # load the model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
```
- **Step 4**: Install git-lfs on Colab
```python
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install
!git config --global credential.helper store
```

- **Step 5**: Upload the model to the HuggingFace model hub
```python
MODEL_NAME = `my-awesome-model-name`
tokenizer.push_to_hub(MODEL_NAME,use_temp_dir=True)
finetuned_model.push_to_hub(MODEL_NAME,use_temp_dir=True)
```

The model will be uploaded to the [HuggingFace model hub](https://huggingface.co/models) and you can share the link to the model with the community and on discord. If you want to share the model on the DeepNLP organization page, you can join the organization and upload the model to the organization page (to do so, you need to specify the organization name in the `push_to_hub` function, e.g. `finetuned_model.push_to_hub(f"DeepNLP-21-22/{MODEL_NAME}", use_temp_dir=True)`).