<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/practices/P5/Practice_5_Automatic_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 5:** Automatic Text Summarization

## Extractive Text Summarization

Content is extracted from the original data, but the extracted content is not modified in any way.

![](https://images.deepai.org/machine-learning-models/8f66b1eb608e4eb681b2ec0c0631385c/summarization.jpg)

For this part of the practice we will use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary).

In [None]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P5/bbc_news.zip
! unzip bbc_news.zip

### **Question 1: split data collection**

Read the data collection and split it into train/test/eval. Data are provided with different classes (e.g., business, sport, tech...), be sure to select 10% of data for testing **for each class**.

**Note 1:** Some files can report UnicodeError, feel free to ignore it (`errors` parameter)

**Note 2:** you can fix encoding after file reading by using [ftfy](https://pypi.org/project/ftfy/) library 

In [None]:
!pip install ftfy

In [None]:
# Your code here

### **Question 2: Unsupervised Text Summarization (TextRank)**

[TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) is an unsupervised text summarization approach that relies on graph modelling. Implement a `TextrankSummarizer` class that expose the `summarize(sentences, N)` function to get the `N` most relevant sentences from a list (`sentences`). 

The main steps are reported here:

1. Each sentence is a node in a graph (undirected)
2. A pair of sentence is connected with an edge whose weight is computed according to the number of common words (see Note 1).
3. Pagerank is used to compute a relevance score for each node in the graph (for each sentence in the list)
4. The `summarize` function return the summary concatenating the `N`  most relevant sentences (according to the score computed at step 3).

**Note 1:** An example of the similarity function that can be used to compute graph weights is repoted below.

In [None]:
import math

def compute_similarity(tokens_sent_1, tokens_sent_2):

    n_common_words = len(set(tokens_sent_1) & set(tokens_sent_2))

    log_s1 = math.log10(len(words_sentence_one))
    log_s2 = math.log10(len(words_sentence_two))

    if log_s1 + log_s2 == 0:
        return 0

    return n_common_words / (log_s1 + log_s2)

In [None]:
'''
# class skeleton

class TextrankSummarizer:

    def __init__(self):
        continue
   
    def summarize(self, sentences, N=2):
        continue
        # TODO: implement summarization
'''

In [None]:
# Your code here

### **Question 3: Unsupervised Text Summarization (TextRank + TF-IDF)**

Implement a `TextrankTFIDFSummarizer` class that expose the `summarize(sentences, N)` function to get the `N` most relevant sentences from a list (`sentences`). 

Implement the class similarly to Q2. This version uses a different similarity function to weigh edges connecting sentences. It uses [TF-IDF vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to compute sentence-to-sentence similarity.

- Compute TF-IDF vectors for each sentence
- Compute edges' weights using the cosine similarity between TF-IDF vector representations.

In [None]:
# Your code here

### **Question 4: Unsupervised Text Summarization (Pretrained BERT)**

Both Textrank and Lexrank relies on syntactic scores to compute sentence similarity. 
Use Sentence-Transformer library to encode sentences into semantic-aware vectors and compute semantic similarity to interconnect sentences (e.g., use cosine similarity of bert encodings). Implement `BERTSummarizer` class similarly to Q2 and Q3.

Note 1: use `sentence-transformers` library to obtain sentence embeddings (https://www.sbert.net/).

In [None]:
# Your code here

### **Question 5: ROUGE-based evaluation**

Using only the **test set** obtained in Q1 compare the performance of the three summarizers implemented in Q2, Q3 and Q4. 

Report their results in terms of average precision, recall and F1-score for Rouge 2 metrics. Set the number of extracted sentences to 4 for all summarizers.

**Which method obtain the best scores?**

Note 1: You can use the python implementation of ROUGE available [here](https://pypi.org/project/rouge/)

In [None]:
! pip install rouge

In [None]:
# Your code here

## Abstractive Text Summarization

Abstractive methods build an internal semantic representation of the original content, and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction.

![https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2)

Also for this part of the practice we use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary).

### **Question 6: BART (pretrained) seq2seq model**

Exploit [BART](https://huggingface.co/facebook/bart-large-cnn) pretrained on CNN Daily Mail dataset to summarize the article in the BBC test set. Compute the obtained scores in terms of average precision, recall and F1-score for Rouge 2 metrics.

Note 1: for generated summaries set the maximum length to 100 and the minimum length to your preferred value.

Note 2: **to speed up computation**, you can use the distilled version of the BART model (e.g., `sshleifer/distilbart-cnn-12-6` available [here](https://huggingface.co/sshleifer/distilbart-cnn-12-6))

Note 3: You can use the [summarization pipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.SummarizationPipeline). Explictly set truncation to True to avoid index errors (e.g. `summarizer(..., truncation=True)`)

Note 4: Explictly set the device to use GPU acceleration (colab runtime should be also set to GPU) while creating the pipeline object (e.g., `pipeline(..., device=0)`)

In [None]:
# Your code here

### **Question 7 (bonus): Finetuning seq2seq model**

Exploit the BBC dataset to finetune BART-based model on the proposed dataset. Create a fine-tuning procedure using the article text as input and the ground-truth summary as output of the model.

Exploit the [Datasets framework](https://huggingface.co/docs/datasets/) and [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) for training and evaluating the model.

Even in this case, evaluate the model using ROUGE-2 precision, recall and f1-score. At this time, you may want to use [metrics python library](https://huggingface.co/metrics) to set the [`compute_metrics`](https://huggingface.co/transformers/main_classes/trainer.html#id1) parameter in Trainer.

In [None]:
# Your code here