# Summarization

Summarization requires a range of abilites, such as understanding long passages, reasoning about the contens, and producing fluent text that incroprates the main topics from original document. Morever summaraizing a new article is different compared to legal document and each of them requires certain degree of domain generalization. For these reasons summarizing is a difficult task for natural language models, including transformers. Despite these challenges, text summarization has huge applications,

* Helps domain experts to speed up the workflow
* Helps enterprise to summarize contracts, internal knowledge
* Generate content for social media releases and more.

Summarization is a classic sequence-to-sequence(seq2seq) task with an input text and a target text. This is where encoder-decoder transformer excel.

In this notebook we'll cover,

* The challenges involved
* Pretrained transformers to summarize documents, (i.e) our own encoder-decoder model to condense dialouges between several people into a crisp summary.

***Dataset to be used in this notebook*** --> [CNN/DailyMail corpus](https://huggingface.co/datasets/cnn_dailymail)

## The CNN/DailyMail Dataset

This dataset consisits of 300,000 pairs of articles and their summaries from CNN and DailyMail. The Summaries are bullet points for the articles provided by the papers. The summaries are abstractive and not extracives, meaning summaries are new text and not simple excerpts from the article. We'll be using the dataset from hub and version 3.0.0  which is a nonanonymized version set up for summarization. We can select the version using `version` keyword.

In [20]:
!pip install transformers datasets nltk -q

In [21]:
import transformers
import datasets
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device used in this notebook: {device}")

Device used in this notebook: cpu


In [22]:
from datasets import load_dataset

dataset = load_dataset(
    path="cnn_dailymail",
    version="3.0.0"
)
print(f"Features: {dataset['train'].column_names}")

Features: ['article', 'highlights', 'id']


In [23]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

Dataset contains three columns: `article`, which contains the news articles, `highlights` with the summaries, and `id` to uniquely identify each article. Let's look at an excerpt from an article. Let's check out an single sample.

In [24]:
sample = dataset["train"][0]
print(f"""Article (excerpt of 500 characters, total length: {len(sample['article'])})""")
print(sample["article"][:500])
print(f"\nSummary (length: {len(sample['highlights'])})")
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 2527)
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

Summary (length: 217)
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


In [25]:
2525 // 217

11

Articles are very long compared to the target summary, in this particular case the differnce is 11-fold.
Long articles pose a challenge to transformer models since their context size limited to 1,000 tokens or so, which is equivavalent to a few paragraphs of text. The standard, yet crude way to deal with this is to truncate the text beyond model's context size. Obviously ther could be important information for the summary towards the text, but we've to live with this limitation of model architectures.

## Text summarization Pipelines

Let's checkout how the popular summarization transformer models work on our single sample. Even though model's input sizes may vary, let's keep the sample size to 2000 characters to make the model outputs comparable.

In [26]:
sample_text = dataset["train"][0]["article"][:2000]
summaries = {} # Dict to store summaries from different models
sample_text

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [27]:
import nltk
# sent_tokenize to split sequence based on sentnces
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

A convention in summarization is to summarize text in full sequence until new line. We can add new line at every dot but this heurisitic will fail for abbreviations with dot like U.N. The Natural Language Toolkit(NLTK) has more sophisticated alogirthms that can differnetiate the end of sequence from the punctutation in abbreviations.

In [28]:
# let's verify the above point
string = "The U.N is an organization. The U.s are a country."
sent_tokenize(string)

['The U.N is an organization.', 'The U.s are a country.']

### Summarization Baseline

A comman baseline for summarizing news articles is to simply take the first three sentences of the article. We can doi this using NLTK's tokenizer.

In [29]:
def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

In [30]:
summaries["baseline"] = three_sentence_summary(
    text=sample_text
)
summaries["baseline"]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.'

#### GPT-2

In [5-text-genreation.ipynb](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notebooks/6_Summarization.ipynb) we checkout how GPT-2 can generate text with some input prompt. Because GPT-2 models are trained on web which includes reddit articles it's good on summarization as well. We can trigger summarization with GPT-2 by appending "TL:DR" at the end of propmt. "TL;DR" is often used to indicate blogs that are too long didn't read to indicate a short version of the long post.

We'll start the summarization experiment by with `pipeline()` from transformers.

In [31]:
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-large")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(
    gpt2_query,
    max_length=512,
    clean_up_tokenization_spaces=True,
)
pipe_out

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box off

In [32]:
summaries["gpt2"] = pipe_out[0]["generated_text"][len(gpt2_query):]
summaries["gpt2"]

"DANIEL RADCLIFFE: I'm not going to be an extravagant young man. I really want to do my best for his education, for his entertainment, and for him being a better person. I'm not going to be the one spending all of that on a Ferrari or"

#### T5

T5 developers performed a comprehensive study of transfer learning and found they could create a universal transformer by formulating all tasks as text-to-text-tasks.

The T5 checkpoints are pretrained on mixture of unsupervised data(to reconstruct masked words) and supervissed data for several tasks, including summarization. These checkpoints can be directly used for inference using the same prompts used during pretraining. Few input format's as follow,

* To summarize `"Summarize: <ARTICLE>"`
* To translate `translate English to German: <TEXT>`

This capabality makes T5 extremely versatile and with single model we can solve many tasks.

Let's load this and test it out.

In [34]:
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
pipe_out

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'summary_text': "Harry Potter star Daniel Radcliffe turns 18 on monday . the young actor says he has no plans to fritter his cash away . details of how he'll mark his landmark birthday are under wraps ."}]

In [35]:
pipe_out

[{'summary_text': "Harry Potter star Daniel Radcliffe turns 18 on monday . the young actor says he has no plans to fritter his cash away . details of how he'll mark his landmark birthday are under wraps ."}]

In [36]:
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))
summaries

{'baseline': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.',
 'gpt2': "DANIEL RADCLIFFE: I'm not going to be an extravagant young man. I really want to do my best for his education, for his entertainment, and for him being a better person. I'm not going to be the one spending all of that on a Ferrari or",
 't5': "Harry Potter star Daniel Radcliffe turns 18 on monday .\nthe young actor

*All T5 capabalities*
![alt t5]("../notes/images/6-summarization/t5-capabalities.png")