# Text Summarization Techniques: Overview

## Intro

Automatic text summarization is the task of producing a concise and fluent summary of a text while preserving key information content and overall meaning.

The internet age has brought massive amounts of information, mostly in the form of non-structured textual data, but we don't have enough time to read it all. 
For this reason,  there is a need to develop ML algorithms that can automatically shorten longer texts and deliver accurate summaries that can fluently pass the intended messages.

Applying automatic text summarization can:
- Enhance the readability of documents
- Reduce reading time 
- Accelerate the process of researching for information 
- Increase the amount of information that can fit in a particular area.

## Approaches

There are two approaches to automatic text summarization: 
- **Extractive** summarization systems aim to extract a subset of words, sentences or paragraphs which best represents a summary of the text. 
 - Pros: They are quite robust to semantic inconsistencies since they use existing sentences that are taken straight from the input.
 - Cons: They lack in flexibility since they cannot use novel words or paraphrase.
- **Abstractive** summarization systems aim to concisely paraphrase the content of the documents and generate a summary that captures the salient ideas of the source text. The generated summaries potentially contain new phrases and sentences that may not appear in the source text. 
 - Pros: They can use words that were not in the original input. This enables to make more fluent and natural summaries.
 - Cons: It is a much harder NLP problem, still under active research.

The great majority of existing approaches to automatic summarization are extractive, mostly because it is much easier to select text than it is to generate text from scratch.

On the other hand, the extractive approach is too restrictive to produce human-like summaries – especially of longer, more complex text where just selecting and rearranging sentences is not enough.

Therefore, abstractive summarization may be difficult, but it’s an essential field of research.

## Extractive Techniques

The most widely used method for extractive text summarization consists in sentence extraction, in which the sentences contained in the text are ranked based on their relevance and the top N sentences are then selected and returned.

Extractive summarization techniques can be divided into several categories. The most famous are the following:
- **Graph-based** - This approach consists in building a graph having sentences as nodes and edges weighted by some notion of similarity between sentences. Then graph algorithms are applied to detect sentences that appear "central" to the document. The idea is that, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences.
- **Feature-based** - This approach extracts the features of the sentence and then evaluate its importance based on its features. Several features can be considered, depending on the use case, such as:
 - Position of the sentence in the input document (e.g. the first sentence of a document usually is very important)
 - Presence of the verb in the sentence or some other Part of Speech (POS)
 - Length of the sentence
 - Term frequency (TF, TF-IDF)
 - Presence of named entity tags (NER)
- **Topic-based** - This approach calculates the topics of the document and evaluates each sentence by how representative it is for any of these topics.

Let's see some examples of extractive techniques.

### TextRank

In [None]:
# TODO

### LSA

In [None]:
# TODO

## Abstractive Techniques

Abstraction-based summarization approaches must address a wide variety of NLP problems, such as natural language generation and semantic representation.

In general, building abstract summaries is a challenging task, as it requires complicated deep learning techniques and sophisticated language modeling. Therefore, they are still far away from reaching human-level quality in summary generation, despite recent progress in the deep learning domain.

The most common deep learning approaches to abstractive summarization are encoder-decoder architectures, from traditional RNNs to more recent transformers.

### Encoder-Decoder architecture with RNN and attention mechanism

In [4]:
# TODO

### Transformer architectures

In [None]:
# TODO