# 6. 요약

## 6.1. The CNN/DailyMail Dataset

* 3.0.0은 익명화 처리를 하지않은 버전
* 요약에서는 관례적으로 문장을 줄바꿈으로 나눈다.

In [1]:
#hide_output
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"Features: {dataset['train'].column_names}")

Found cached dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Features: ['article', 'highlights', 'id']


* article, highlights, id

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [3]:
sample = dataset["train"][20]

* 전체 내용

In [4]:
sample['article']

'LAS VEGAS, Nevada (CNN)  -- Former football star O.J. Simpson will be held without bail after his arrest on robbery and assault charges, police announced late Sunday. Police released this mug shot of O.J. Simpson after his arrest. Simpson is accused of having directed several other men in an alleged armed robbery of sports memorabilia in a room at a Las Vegas hotel room. Las Vegas authorities said they have no information leading them to believe Simpson was carrying a firearm during the alleged incident at the Palace Station Hotel and Casino. Police said Simpson and other men burst into the room and walked out with the memorabilia, including some that was unrelated to Simpson, police said. "We don\'t believe that anyone was roughed up, but there were firearms involved," Lt. Clint Nichols told reporters. Nichols said the firearms were pointed at the victims. A reporter asked Nichols: Was "O.J. was the boss in that room?" Nichols responded, "That is what we believe, yes."  Watch Simpson

* 요약

In [5]:
print(sample['highlights'])

No bail for ex-NFL star accused of directing men in alleged armed robbery .
Simpson faces charges of robbery, assault, burglary and conspiracy .
Alleged robbery involved sports-related items, police say .
Simpson arrested Sunday in Las Vegas, but he says items were his .


## 6.2. 텍스트 요약 파이프라인

In [6]:
import nltk

In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

* nltk의 문장 분리 툴

In [8]:
from nltk.tokenize import sent_tokenize

In [9]:
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

['The U.S. are a country.', 'The U.N. is an organization.']

* transformer 모델에 넣기에는 너무 기니까, 문장 일부 선택
    * 문장을 분리하고, 2000 문자열을 넘지 않는 경우만 선택

In [10]:
import numpy as np

In [11]:
sents = sent_tokenize(sample['article']) # 문장 분리
sents_len_sum = np.cumsum(list(map(len, sents)))
last_idx = np.where(sents_len_sum < 2000)[0][-1].item() # 2000자 넘지 않는 문장까지 선택
sample = ' '.join(sents[:last_idx])

### 6.2.1 요약 모델 베이스라인 - 첫 세 문장을 선택

In [12]:
summaries = dict()

In [13]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [14]:
summaries['baseline'] = three_sentence_summary(sample)

In [15]:
print(three_sentence_summary(sample))

LAS VEGAS, Nevada (CNN)  -- Former football star O.J.
Simpson will be held without bail after his arrest on robbery and assault charges, police announced late Sunday.
Police released this mug shot of O.J.


### 6.2.2 GPT-2

* GPT 모델을 요약 task에 활용하는 방법은 실제 글의 요약의 서두에 많이 쓰이는 TL;DR을 샘플 문장 뒤에 붙여서 summarization을 생성으로 흉내내는 방법을 사용할 수 있다.

In [16]:
#hide_output
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")

* gpt-2로 생성된 문장을 보면 TL;DR 뒤에 요약으로 추정되는 문장이 생성된 것을 볼 수 있다.

In [17]:
gpt2_query = sample + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
pipe_out

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'LAS VEGAS, Nevada (CNN)  -- Former football star O.J. Simpson will be held without bail after his arrest on robbery and assault charges, police announced late Sunday. Police released this mug shot of O.J. Simpson after his arrest. Simpson is accused of having directed several other men in an alleged armed robbery of sports memorabilia in a room at a Las Vegas hotel room. Las Vegas authorities said they have no information leading them to believe Simpson was carrying a firearm during the alleged incident at the Palace Station Hotel and Casino. Police said Simpson and other men burst into the room and walked out with the memorabilia, including some that was unrelated to Simpson, police said. "We don\'t believe that anyone was roughed up, but there were firearms involved," Lt. Clint Nichols told reporters. Nichols said the firearms were pointed at the victims. A reporter asked Nichols: Was "O.J. was the boss in that room?" Nichols responded, "That is what we believe, 

* 불완전 생성으로 추정되는 마지막 문장은 제외하고, 제대로 생성된 문장만 저장

In [18]:
summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query):])[:-1])

### 6.2.3 T5
* t5는 pretrain을 할 때, 이미 요약관련 task를 학습했음.
* `summarize: ~~~` 형태로 문장을 넣게 되면 요약을 해줌.

In [19]:
#hide_output
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))     

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


### 6.2.4. Bart
* bart는 손상된 문장을 복원하는 능력이 뛰어남.
* 요약도 일종의 저런 느낌으로 생각하고 fine-tune 가능. 
* cnndm으로 fine-tune된 모델 사용. 

In [20]:
#hide_output
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

### 6.2.5. Pegasus
* 텍스트 요약에서 좋은 성능을 얻기위한 목적으로 사전학습을 진행함.
* 실제 Donwstream Task에서 좋은 성능을 보임.

In [21]:
#hide_output
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

## 6.3. 요약 결과 비교

* 정성적으로 GPT-2 이외의 다른 요약모델의 결과는 그럴 듯 함을 알 수 있다.
* PEGASUS가 가장 훌륭함.

In [22]:
print("GROUND TRUTH")
print(dataset["train"][20]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")
     

GROUND TRUTH
No bail for ex-NFL star accused of directing men in alleged armed robbery .
Simpson faces charges of robbery, assault, burglary and conspiracy .
Alleged robbery involved sports-related items, police say .
Simpson arrested Sunday in Las Vegas, but he says items were his .

BASELINE
LAS VEGAS, Nevada (CNN)  -- Former football star O.J.
Simpson will be held without bail after his arrest on robbery and assault charges, police announced late Sunday.
Police released this mug shot of O.J.

GPT2
Man arrested at airport in connection with OJ Simpson robbery.
I don't see anything illegal; I see something very peculiar.
We're in Vegas at the airport here now.
And when we see somebody at McCarran International with a bunch of guns, it really gets our juices flowing.
[on air]: Yeah, a lot of people have said it's like somebody got out of their room in a big rush of cocaine or something.
[on air] The TSA are having a bit of a meltdown right now with the incident.
Not from me, but I do g

## 6.4. 생성된 텍스트 품질 평가하기

### 6.4.1 BLEU
* 생성된 텍스트에서 얼마나 많은 토큰이 참조 텍스트 토큰과 단어 또는 n-gram이 존재하는가?
* BLEU는 정밀도를 근간으로 하는 지표

* BLEU 값과 precision의 차이
* 예시
    * GT: `the cat is on the mat`
    * GP: `the the the the the the`

* vanila precision
    * (실제 정답 유무 / 모델의 예측 값) = 6 / 6 = 1
    * 실제 모델의 성능을 높게 평가한다.

* __modified precision__
    * (실제 정답 유무 clip(실제 정답 개수만) / 모델의 예측 값) = 2 / 6 = 1/3
    * 실제 reference에 있는 이상의 키워드가 의도치 않게 precision을 높히는 것을 방지할 수 있음.

$$p_n = {\sum_{n-gram \in snt} Count_{clip}(n-gram) \over \sum_{n-gram \in snt} Count(n-gram)}$$

* __modified precision for C sent__
    * 1 prediction에 대한 C개의 정답셋이 있으므로 평균

$$p_n = {\sum_{snt \in C} \sum_{n-gram \in snt} Count_{clip}(n-gram) \over \sum_{snt \in C} \sum_{n-gram \in snt} Count(n-gram)}$$

* 하지만 위의 문제는 재현율울 고려하지 않기 때문에 짧지만 정밀하게 생성된 시퀀스가 긴 문장보다 유리함.
    *  짧은 애들을 penalty를 주기 위해, brevity penalty를 부여
    * 생성된 문장이 원래 문장보다 짧을 경우에만 penalty 부여    
$$BR = min(1, e^{1 - {l_{ref} \over l_{gen} }  })$$



* BLEU 정리하면...
    * 1, N그램까지 수정 정밀도의 기하평균
    * 주로 BLEU-4가 많이 사용됨.

$$BLEU-N = BR \times (\prod^N_{n=1} p_n)^{1 \over N} $$

**BLEU 한계**
* 동의어를 고려하지 않음.
* 다른 한계는 여기 참고
  * https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213
* 토큰화된 텍스트를 기대 
  * SacreBLEU 해결

### BLEU 계산

In [23]:
# hide_output
from datasets import load_metric

bleu_metric = load_metric("sacrebleu")

  bleu_metric = load_metric("sacrebleu")


In [24]:
import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Unnamed: 0,Value
score,0.0
counts,"[2, 0, 0, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[33.33, 0.0, 0.0, 0.0]"
bp,1.0
sys_len,6
ref_len,6


In [25]:
bleu_metric.add(
    prediction="the cat is on mat", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Unnamed: 0,Value
score,57.893007
counts,"[5, 3, 2, 1]"
totals,"[5, 4, 3, 2]"
precisions,"[100.0, 75.0, 66.67, 50.0]"
bp,0.818731
sys_len,5
ref_len,6


In [29]:
# hide_output
rouge_metric = load_metric("rouge")

In [None]:
* 

    * 단순 Precision - (실제 정답에 존재하 유무 / 모델의 예측값)
    * 

---

### 궁금.


* Abstractive Extraction과 Summarization의 task의 차이? 용어만 다른건가?

* Summarization은 주로 하나의 context가 긴 하나의 문서(document)를 요약하는 것인데, 여러 짧은 sentence를 요약하는 것도 동일한 방식으로 해결할 수 있나?

* BLEU의 한계를 p.208에서 유도된 식의 많은 단계가 임시방편이고 깨지기 쉬움.