# Summarization

## Deep Reinforced Model for Abstractive Summarization

[A Deep Reinforced Model for Abstractive Summarization (Romain Paulus, Caiming Xiong, Richard Socher, 2017)](https://arxiv.org/abs/1705.04304) - модель суммаризации на основе encoder-decoder с использованием reinforcement learning для обучения.

### Источники
1. Блог с описанием статьи: https://blog.einstein.ai/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization/
2. Имплементация: https://github.com/rohithreddy024/Text-Summarizer-Pytorch

In [None]:
!git clone https://github.com/rohithreddy024/Text-Summarizer-Pytorch

Cloning into 'Text-Summarizer-Pytorch'...
remote: Enumerating objects: 98, done.[K
remote: Total 98 (delta 0), reused 0 (delta 0), pack-reused 98[K
Unpacking objects: 100% (98/98), done.


In [None]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#!pip install tensorflow==2.3.0
from tensorflow import core

### Данные 
Модель обучается на данных Gigaword dataset: https://data.deepai.org/gigaword.zip

In [None]:
# find the share link of the file/folder on Google Drive
#file_share_link = "https://drive.google.com/open?id=0B6N7tANPyVeBNmlSX19Ld2xDU1E"

# extract the ID of the file
#file_id = file_share_link[file_share_link.find("=") + 1:]

# append the id to this REST command
#file_download_link = "https://docs.google.com/uc?export=download&id=" + file_id 

In [None]:
%cd /content/

/content


In [None]:
!wget https://data.deepai.org/gigaword.zip 

--2022-11-26 16:10:47--  https://data.deepai.org/gigaword.zip
Resolving data.deepai.org (data.deepai.org)... 5.9.140.253
Connecting to data.deepai.org (data.deepai.org)|5.9.140.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 297054860 (283M) [application/x-zip-compressed]
Saving to: ‘gigaword.zip.1’


2022-11-26 16:10:59 (23.7 MB/s) - ‘gigaword.zip.1’ saved [297054860/297054860]



In [None]:
#!tar -tvf summary.tar.gz
!unzip gigaword.zip

Archive:  gigaword.zip
  inflating: sumdata/DUC2003/input.txt  
  inflating: sumdata/DUC2003/task1_ref0.txt  
  inflating: sumdata/DUC2003/task1_ref1.txt  
  inflating: sumdata/DUC2003/task1_ref2.txt  
  inflating: sumdata/DUC2003/task1_ref3.txt  
  inflating: sumdata/DUC2004/input.txt  
  inflating: sumdata/DUC2004/task1_ref0.txt  
  inflating: sumdata/DUC2004/task1_ref1.txt  
  inflating: sumdata/DUC2004/task1_ref2.txt  
  inflating: sumdata/DUC2004/task1_ref3.txt  
  inflating: sumdata/Giga/input.txt  
  inflating: sumdata/Giga/task1_ref0.txt  
  inflating: sumdata/train/train.article.txt  
  inflating: sumdata/train/train.title.txt  
  inflating: sumdata/train/valid.article.filter.txt  
  inflating: sumdata/train/valid.title.filter.txt  


In [None]:
#!mkdir Text-Summarizer-Pytorch
#!mkdir Text-Summarizer-Pytorch/data
#!mkdir Text-Summarizer-Pytorch/data/unfinished

In [None]:
!mv sumdata/train/* Text-Summarizer-Pytorch/data/unfinished
!rm -rf sumdata/

In [None]:
%cd Text-Summarizer-Pytorch

/content/Text-Summarizer-Pytorch


In [None]:
!pip3 install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Создание .bin файлов с данными для обучения модели
# Добавить в make_data_files.py в 59 и 60 строке [abstract.encode('utf-8)] и [article.encode('utf-8')]
!python make_data_files.py

Completed shuffling train & valid text files
3803957it [02:24, 26342.37it/s]
189651it [00:03, 57382.63it/s]
Completed creating bin file for train & valid
Completed chunking main bin files into smaller ones


Примеры обучающих данных (заголовки и абстракты статей):

In [None]:
N = 20
with open("data/unfinished/train.article.txt") as f:
    head = [next(f) for x in range(N)]

In [None]:
print(head[1])

at least two people were killed in a suspected bomb attack on a passenger bus in the strife-torn southern philippines on monday , the military said .



In [None]:
N = 20
with open("data/unfinished/train.title.txt") as f:
    head = [next(f) for x in range(N)]
print(head[1])

at least two dead in southern philippines blast



### Модель
Идея состоит в обучении encoder-decoder архитектуры для генерации summary входного текста. 

В декодере дважды используется механизм attention:
1. attention на состояния энкодера (intra-temporal attention) определяет вес слов входной последовательности для текущей позиции в выходной последовательности summary
2. attention на предыдущие состояния декодера (intra-decoder attention) для того, чтобы не допускать повторения слов в выходе декодера.

В процессе обучения модели используется teacher-forcing, чтобы учитывать ошибку на уровне каждого генерируемого слова (Negative Log Likelihood Loss), и reinforcement learning для оценки качества сгенерированного текста целиком в сравнении с target summary. 

Для reinforcement learning в качестве метрики используется ROUGE score. ROUGE считает совпадение н-грамм слов в таргете и сгенерированной последовательности (ROUGE-1 для униграмм, ROUGE-2 для биграмм слов, ...). 


![summ_attentions](summ-attentions.svg)

In [None]:
# rouge для подчета метрики Rouge
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


Обучение модели (train.py)

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"    #Set cuda device

import time

import torch as T
import torch.nn as nn
import torch.nn.functional as F
from model import Model

from data_util import config, data
from data_util.batcher import Batcher
from data_util.data import Vocab
from train_util import *
from torch.distributions import Categorical
from rouge import Rouge
from numpy import random
import argparse

random.seed(123)
T.manual_seed(123)
if T.cuda.is_available():
    T.cuda.manual_seed_all(123)

class Train(object):
    def __init__(self, opt):
        self.vocab = Vocab(config.vocab_path, config.vocab_size)
        self.batcher = Batcher(config.train_data_path, self.vocab, mode='train',
                               batch_size=config.batch_size, single_pass=False)
        self.opt = opt
        self.start_id = self.vocab.word2id(data.START_DECODING)
        self.end_id = self.vocab.word2id(data.STOP_DECODING)
        self.pad_id = self.vocab.word2id(data.PAD_TOKEN)
        self.unk_id = self.vocab.word2id(data.UNKNOWN_TOKEN)
        time.sleep(5)

    def save_model(self, iter):
        save_path = config.save_model_path + "/%07d.tar" % iter
        T.save({
            "iter": iter + 1,
            "model_dict": self.model.state_dict(),
            "trainer_dict": self.trainer.state_dict()
        }, save_path)

    def setup_train(self):
        self.model = Model()
        self.model = get_cuda(self.model)
        self.trainer = T.optim.Adam(self.model.parameters(), lr=config.lr)
        start_iter = 0
        if self.opt.load_model is not None:
            load_model_path = os.path.join(config.save_model_path, self.opt.load_model)
            checkpoint = T.load(load_model_path)
            start_iter = checkpoint["iter"]
            self.model.load_state_dict(checkpoint["model_dict"])
            self.trainer.load_state_dict(checkpoint["trainer_dict"])
            print("Loaded model at " + load_model_path)
        if self.opt.new_lr is not None:
            self.trainer = T.optim.Adam(self.model.parameters(), lr=self.opt.new_lr)
        return start_iter

    def train_batch_MLE(self, enc_out, enc_hidden, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, batch):
        ''' Calculate Negative Log Likelihood Loss for the given batch. In order to reduce exposure bias,
                pass the previous generated token as input with a probability of 0.25 instead of ground truth label
        Args:
        :param enc_out: Outputs of the encoder for all time steps (batch_size, length_input_sequence, 2*hidden_size)
        :param enc_hidden: Tuple containing final hidden state & cell state of encoder. Shape of h & c: (batch_size, hidden_size)
        :param enc_padding_mask: Mask for encoder input; Tensor of size (batch_size, length_input_sequence) with values of 0 for pad tokens & 1 for others
        :param ct_e: encoder context vector for time_step=0 (eq 5 in https://arxiv.org/pdf/1705.04304.pdf)
        :param extra_zeros: Tensor used to extend vocab distribution for pointer mechanism
        :param enc_batch_extend_vocab: Input batch that stores OOV ids
        :param batch: batch object
        '''
        dec_batch, max_dec_len, dec_lens, target_batch = get_dec_data(batch)                        #Get input and target batchs for training decoder
        step_losses = []
        s_t = (enc_hidden[0], enc_hidden[1])                                                        #Decoder hidden states
        x_t = get_cuda(T.LongTensor(len(enc_out)).fill_(self.start_id))                             #Input to the decoder
        prev_s = None                                                                               #Used for intra-decoder attention (section 2.2 in https://arxiv.org/pdf/1705.04304.pdf)
        sum_temporal_srcs = None                                                                    #Used for intra-temporal attention (section 2.1 in https://arxiv.org/pdf/1705.04304.pdf)
        for t in range(min(max_dec_len, config.max_dec_steps)):
            use_gound_truth = get_cuda((T.rand(len(enc_out)) > 0.25)).long()                        #Probabilities indicating whether to use ground truth labels instead of previous decoded tokens
            x_t = use_gound_truth * dec_batch[:, t] + (1 - use_gound_truth) * x_t                   #Select decoder input based on use_ground_truth probabilities
            x_t = self.model.embeds(x_t)
            final_dist, s_t, ct_e, sum_temporal_srcs, prev_s = self.model.decoder(x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s)
            target = target_batch[:, t]
            log_probs = T.log(final_dist + config.eps)
            step_loss = F.nll_loss(log_probs, target, reduction="none", ignore_index=self.pad_id)
            step_losses.append(step_loss)
            x_t = T.multinomial(final_dist, 1).squeeze()                                            #Sample words from final distribution which can be used as input in next time step
            is_oov = (x_t >= config.vocab_size).long()                                              #Mask indicating whether sampled word is OOV
            x_t = (1 - is_oov) * x_t.detach() + (is_oov) * self.unk_id                              #Replace OOVs with [UNK] token

        losses = T.sum(T.stack(step_losses, 1), 1)                                                  #unnormalized losses for each example in the batch; (batch_size)
        batch_avg_loss = losses / dec_lens                                                          #Normalized losses; (batch_size)
        mle_loss = T.mean(batch_avg_loss)                                                           #Average batch loss
        return mle_loss

    def train_batch_RL(self, enc_out, enc_hidden, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, article_oovs, greedy):
        '''Generate sentences from decoder entirely using sampled tokens as input. These sentences are used for ROUGE evaluation
        Args
        :param enc_out: Outputs of the encoder for all time steps (batch_size, length_input_sequence, 2*hidden_size)
        :param enc_hidden: Tuple containing final hidden state & cell state of encoder. Shape of h & c: (batch_size, hidden_size)
        :param enc_padding_mask: Mask for encoder input; Tensor of size (batch_size, length_input_sequence) with values of 0 for pad tokens & 1 for others
        :param ct_e: encoder context vector for time_step=0 (eq 5 in https://arxiv.org/pdf/1705.04304.pdf)
        :param extra_zeros: Tensor used to extend vocab distribution for pointer mechanism
        :param enc_batch_extend_vocab: Input batch that stores OOV ids
        :param article_oovs: Batch containing list of OOVs in each example
        :param greedy: If true, performs greedy based sampling, else performs multinomial sampling
        Returns:
        :decoded_strs: List of decoded sentences
        :log_probs: Log probabilities of sampled words
        '''
        s_t = enc_hidden                                                                            #Decoder hidden states
        x_t = get_cuda(T.LongTensor(len(enc_out)).fill_(self.start_id))                             #Input to the decoder
        prev_s = None                                                                               #Used for intra-decoder attention (section 2.2 in https://arxiv.org/pdf/1705.04304.pdf)
        sum_temporal_srcs = None                                                                    #Used for intra-temporal attention (section 2.1 in https://arxiv.org/pdf/1705.04304.pdf)
        inds = []                                                                                   #Stores sampled indices for each time step
        decoder_padding_mask = []                                                                   #Stores padding masks of generated samples
        log_probs = []                                                                              #Stores log probabilites of generated samples
        mask = get_cuda(T.LongTensor(len(enc_out)).fill_(1))                                        #Values that indicate whether [STOP] token has already been encountered; 1 => Not encountered, 0 otherwise

        for t in range(config.max_dec_steps):
            x_t = self.model.embeds(x_t)
            probs, s_t, ct_e, sum_temporal_srcs, prev_s = self.model.decoder(x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s)
            if greedy is False:
                multi_dist = Categorical(probs)
                x_t = multi_dist.sample()                                                           #perform multinomial sampling
                log_prob = multi_dist.log_prob(x_t)
                log_probs.append(log_prob)
            else:
                _, x_t = T.max(probs, dim=1)                                                        #perform greedy sampling
            x_t = x_t.detach()
            inds.append(x_t)
            mask_t = get_cuda(T.zeros(len(enc_out)))                                                #Padding mask of batch for current time step
            mask_t[mask == 1] = 1                                                                   #If [STOP] is not encountered till previous time step, mask_t = 1 else mask_t = 0
            mask[(mask == 1) + (x_t == self.end_id) == 2] = 0                                       #If [STOP] is not encountered till previous time step and current word is [STOP], make mask = 0
            decoder_padding_mask.append(mask_t)
            is_oov = (x_t>=config.vocab_size).long()                                                #Mask indicating whether sampled word is OOV
            x_t = (1-is_oov)*x_t + (is_oov)*self.unk_id                                             #Replace OOVs with [UNK] token

        inds = T.stack(inds, dim=1)
        decoder_padding_mask = T.stack(decoder_padding_mask, dim=1)
        if greedy is False:                                                                         #If multinomial based sampling, compute log probabilites of sampled words
            log_probs = T.stack(log_probs, dim=1)
            log_probs = log_probs * decoder_padding_mask                                            #Not considering sampled words with padding mask = 0
            lens = T.sum(decoder_padding_mask, dim=1)                                               #Length of sampled sentence
            log_probs = T.sum(log_probs, dim=1) / lens  # (bs,)                                     #compute normalizied log probability of a sentence
        decoded_strs = []
        for i in range(len(enc_out)):
            id_list = inds[i].cpu().numpy()
            oovs = article_oovs[i]
            S = data.outputids2words(id_list, self.vocab, oovs)                                     #Generate sentence corresponding to sampled words
            try:
                end_idx = S.index(data.STOP_DECODING)
                S = S[:end_idx]
            except ValueError:
                S = S
            if len(S) < 2:                                                                           #If length of sentence is less than 2 words, replace it with "xxx"; Avoids setences like "." which throws error while calculating ROUGE
                S = ["xxx"]
            S = " ".join(S)
            decoded_strs.append(S)

        return decoded_strs, log_probs

    def reward_function(self, decoded_sents, original_sents):
        rouge = Rouge()
        try:
            scores = rouge.get_scores(decoded_sents, original_sents)
        except Exception:
            print("Rouge failed for multi sentence evaluation.. Finding exact pair")
            scores = []
            for i in range(len(decoded_sents)):
                try:
                    score = rouge.get_scores(decoded_sents[i], original_sents[i])
                except Exception:
                    print("Error occured at:")
                    print("decoded_sents:", decoded_sents[i])
                    print("original_sents:", original_sents[i])
                    score = [{"rouge-l":{"f":0.0}}]
                scores.append(score[0])
        rouge_l_f1 = [score["rouge-l"]["f"] for score in scores]
        rouge_l_f1 = get_cuda(T.FloatTensor(rouge_l_f1))
        return rouge_l_f1

    # def write_to_file(self, decoded, max, original, sample_r, baseline_r, iter):
    #     with open("temp.txt", "w") as f:
    #         f.write("iter:"+str(iter)+"\n")
    #         for i in range(len(original)):
    #             f.write("dec: "+decoded[i]+"\n")
    #             f.write("max: "+max[i]+"\n")
    #             f.write("org: "+original[i]+"\n")
    #             f.write("Sample_R: %.4f, Baseline_R: %.4f\n\n"%(sample_r[i].item(), baseline_r[i].item()))


    def train_one_batch(self, batch, iter):
        enc_batch, enc_lens, enc_padding_mask, enc_batch_extend_vocab, extra_zeros, context = get_enc_data(batch)

        enc_batch = self.model.embeds(enc_batch)                                                    #Get embeddings for encoder input
        enc_out, enc_hidden = self.model.encoder(enc_batch, enc_lens)

        # -------------------------------Summarization-----------------------
        if self.opt.train_mle == "yes":                                                             #perform MLE training
            mle_loss = self.train_batch_MLE(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch)
        else:
            mle_loss = get_cuda(T.FloatTensor([0]))
        # --------------RL training-----------------------------------------------------
        if self.opt.train_rl == "yes":                                                              #perform reinforcement learning training
            # multinomial sampling
            sample_sents, RL_log_probs = self.train_batch_RL(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch.art_oovs, greedy=False)
            with T.autograd.no_grad():
                # greedy sampling
                greedy_sents, _ = self.train_batch_RL(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch.art_oovs, greedy=True)

            sample_reward = self.reward_function(sample_sents, batch.original_abstracts)
            baseline_reward = self.reward_function(greedy_sents, batch.original_abstracts)
            # if iter%200 == 0:
            #     self.write_to_file(sample_sents, greedy_sents, batch.original_abstracts, sample_reward, baseline_reward, iter)
            rl_loss = -(sample_reward - baseline_reward) * RL_log_probs                             #Self-critic policy gradient training (eq 15 in https://arxiv.org/pdf/1705.04304.pdf)
            rl_loss = T.mean(rl_loss)

            batch_reward = T.mean(sample_reward).item()
        else:
            rl_loss = get_cuda(T.FloatTensor([0]))
            batch_reward = 0

    # ------------------------------------------------------------------------------------
        self.trainer.zero_grad()
        (self.opt.mle_weight * mle_loss + self.opt.rl_weight * rl_loss).backward()
        self.trainer.step()

        return mle_loss.item(), batch_reward

    def trainIters(self):
        iter = self.setup_train()
        count = mle_total = r_total = 0
        while iter <= config.max_iterations:
            batch = self.batcher.next_batch()
            try:
                mle_loss, r = self.train_one_batch(batch, iter)
            except KeyboardInterrupt:
                print("-------------------Keyboard Interrupt------------------")
                exit(0)

            mle_total += mle_loss
            r_total += r
            count += 1
            iter += 1

            if iter % 1000 == 0:
                mle_avg = mle_total / count
                r_avg = r_total / count
                print("iter:", iter, "mle_loss:", "%.3f" % mle_avg, "reward:", "%.4f" % r_avg)
                count = mle_total = r_total = 0

            if iter % 5000 == 0:
                self.save_model(iter)


# parser = argparse.ArgumentParser()
# parser.add_argument('--train_mle', type=str, default="yes")
# parser.add_argument('--train_rl', type=str, default="no")
# parser.add_argument('--mle_weight', type=float, default=1.0)
# parser.add_argument('--load_model', type=str, default=None)
# parser.add_argument('--new_lr', type=float, default=None)
# opt = parser.parse_args()
# opt.rl_weight = 1 - opt.mle_weight
# print("Training mle: %s, Training rl: %s, mle weight: %.2f, rl weight: %.2f"%(opt.train_mle, opt.train_rl, opt.mle_weight, opt.rl_weight))
# print("intra_encoder:", config.intra_encoder, "intra_decoder:", config.intra_decoder)

# train_processor = Train(opt)
# train_processor.trainIters()

Сначала encoder-decoder модель обучается без reinforcement learning. 

In [None]:
!python train.py --train_mle=yes --train_rl=no --mle_weight=1.0

2022-11-26 07:11:25.842555: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2022-11-26 07:11:25.842596: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Training mle: yes, Training rl: no, mle weight: 1.00, rl weight: 0.00
intra_encoder: True intra_decoder: True
Traceback (most recent call last):
  File "train.py", line 269, in <module>
    train_processor = Train(opt)
  File "train.py", line 27, in __init__
    self.vocab = Vocab(config.vocab_path, config.vocab_size)
  File "/content/Text-Summarizer-Pytorch/data_util/data.py", line 35, in __init__
    with open(vocab_file, 'r') as vocab_f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/vocab'


Выбирается лучшая модель (из обученных с разным числом итераций) по значению ROUGE на валидационной выборке.

In [None]:
!python eval.py --task=validate --start_from=0005000.tar

Лучшая модель дообучается с использованием RL.

In [None]:
# MLE + RL training
!python train.py --train_mle=yes --train_rl=yes --mle_weight=0.25 --load_model=0100000.tar --new_lr=0.0001 

# RL training
!python train.py --train_mle=no --train_rl=yes --mle_weight=0.0 --load_model=0100000.tar --new_lr=0.0001

Модель, обученная только на RL, достигает более высоких показателей ROUGE, но генерирует менее хорошие тексты с точки зрения связности и естественности, поэтому авторы статьи рекомендуют комбинированную стратегию обучения. 

* Результаты, приведенные авторами репозитория:

Rouge scores obtained by using best MLE trained model on test set:

{
'rouge-1': {'f': 0.4412018559893622, 'p': 0.4814799494024485, 'r': 0.4232331027817015}, 

'rouge-2': {'f': 0.23238981595683728, 'p': 0.2531296070596062, 'r': 0.22407861554997008},

'rouge-l': {'f': 0.40477682528278364, 'p': 0.4584684491434479, 'r': 0.40351107200202596}
}


Rouge scores obtained by using best MLE + RL trained model on test set:

{
'rouge-1': {'f': 0.4499047033247696, 'p': 0.4853756369556345, 'r': 0.43544461386607497},

'rouge-2': {'f': 0.24037014314625643, 'p': 0.25903387205387235, 'r': 0.23362662645146298},

'rouge-l': {'f': 0.41320241732946406, 'p': 0.4616655167980162, 'r': 0.4144419466382236}
}

* Примеры (article - исходный текст, ref - target summary, dec - сгенерированный моделью текст):

article: russia 's lower house of parliament was scheduled friday to debate an appeal to the prime minister that challenged the right of u.s.-funded radio liberty to operate in russia following its introduction of broadcasts targeting chechnya .

ref: russia 's lower house of parliament mulls challenge to radio liberty

dec: russian parliament to debate on banning radio liberty


article: continued dialogue with the democratic people 's republic of korea is important although australia 's plan to open its embassy in pyongyang has been shelved because of the crisis over the dprk 's nuclear weapons program , australian foreign minister alexander downer said on friday .

ref: dialogue with dprk important says australian foreign minister

dec: australian fm says dialogue with dprk important

article: water levels in the zambezi river are rising due to heavy rains in its catchment area , prompting zimbabwe 's civil protection unit -lrb- cpu -rrb- to issue a flood alert for people living in the zambezi valley , the herald reported on friday .

ref: floods loom in zambezi valley

dec: water levels rising in zambezi river

## Transformers seq2seq models for summarization
[Источник](https://rubikscode.net/2022/04/25/text-summarization-with-huggingface-transformers/)

In [1]:
!pip install transformers
import transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/



###Pegasus
Pegasus is standard Transformer encoder-decoder but in Pegasus’ pre-training task we have a similar approach as an extractive summary – important sentences are extracted from an input document and joined together as one output sequence from the remaining sentences.

This actually means that the encoder outputs masked tokens and decoder generates gap sentences. Paper regarding the Pegasus model introduces generating gap-sentences and explains strategies for selecting those sentences. More info about the Pegasus model can be found in the scientific paper in [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization  written by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.](https://arxiv.org/pdf/1912.08777.pdf)

###BART
This model is a sequence-to-sequence model trained as a denoising autoencoder. This indicates that BART can take as an input sequence in one language and return output sequence in a different language. BART found applications in many tasks besides text summarization, such as question answering, machine translation, etc.

BART model is pre-trained on the English language and it is fine-tuned on CNN Daily Mail. More information regarding the model can be found in paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. The Paper is written by Lewis et al.](https://arxiv.org/abs/1910.13461)

###T5
XL-Sum represents a dataset which contains 1 million annotated pairs article-summary from BBC. The dataset covers 44 different languages and it is the largest dataset based on the number of collected data from a single source.

mT5 is a fine-tuned pre-trained multilingual T5 model on the XL-SUM dataset. More details can be found in [XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages](https://aclanthology.org/2021.findings-acl.413.pdf).


*The tower is 324 meters (1,063 ft) tall, about the same height 
as an 81-storey building, and the tallest structure in Paris. Its base is square, 
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel 
Tower surpassed the Washington Monument to become the tallest man-made structure 
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters. 
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is 
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, 
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.*

In [7]:
text_example = 'The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 meters. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.'

text_example_ru = 'В первом тайме голландцы забили дважды: на 10-й минуте отличился форвард «Барселоны» Мемфис Депай, а на 45-й — полузащитник Дэйли Блинд из «Аякса». Американцы отыграли один мяч на 76-й минуте после точного удара Хаджи Райта из «Антальяспора», а окончательный счет спустя пять минут установил Дензел Дюмфрис.'

Using Pipeline

In [4]:
from transformers import pipeline

In [5]:
summarizer = pipeline("summarization", model = "google/pegasus-xsum")
summarizer(text_example)

Downloading:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

[{'summary_text': 'The Eiffel Tower is a free-standing structure in Paris, France.'}]

In [None]:
summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")
summarizer(text_example)

In [5]:
summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
summarizer(text_example)



[{'summary_text': 'The Eiffel Tower has become the tallest free-standing building in the world.'}]

In [8]:
summarizer(text_example_ru)

[{'summary_text': 'В матче группового этапа Лиги чемпионов "Барселоны" и "Аякса" победили голландские клубы.'}]

Using Automodel

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/pegasus-xsum')
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-xsum')

tokens_input = tokenizer.encode("summarize: "+ text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)

print(summary)

In [10]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

tokens_input = tokenizer.encode("summarize: "+text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)

print(summary)

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world. It held the title for 41 years until the Chrysler Building in New York City was finished in 1930.


In [12]:
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")

model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")

tokens_input = tokenizer.encode("summarize: "+text_example, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

В матче группового этапа чемпионата мира по футболу в Лиге чемпионов Голландия и США одержали победу над английским «Аякс» со счетом 2:1. В первом тайме голландцы забили два мяча, а американцы - один мяч, но их сыграли вничью.


In [None]:
tokens_input = tokenizer.encode("summarize: "+text_example_ru, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

## BERT Extractive Summarization

### Источник:
https://deeplearninganalytics.org/text-summarization/

https://github.com/nlpyang/BertSum


Идея: использовать BERT эмбеддинги предложений исходного текста в задаче бинарной классификации для отбора самых значимых предложений, которые войдут в summary.

Для получения эмбеддингов нескольких предложений текста перед каждым предложением текста вставляется свой токен начала предложения **[CLS]**, после каждого предложения - символ **[SEP]**. В качестве эмбеддингов сегмента предложения (которые используются для того, чтобы различать первое и второе предложения в парах предложений при обучении  BERT) для последовательности предложений чередуются единичные и нулевые вектора.

_[sent1, sent2, sent3, sent4, sent5] -> [EA, EB, EA, EB, EA]._

Вектора токенов [CLS] на последнем слое BERT используются в качестве векторов предложений текста. Вектора предложений подаются на вход классификатору (в статье 3 варианта классификации): 
1. linear layer + sigmoid
2. Transformer + sigmoid
3. LSTM + sigmoid

![bertsum](bertsum.png)

In [None]:
!pip install --force-reinstall torch==1.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.1.0
  Downloading torch-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (676.9 MB)
[K     |████████████████████████████████| 676.9 MB 3.9 kB/s 
[?25hCollecting numpy
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 73.0 MB/s 
[?25hInstalling collected packages: numpy, torch
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.1

In [None]:
!pip install pytorch-pretrained-bert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-pretrained-bert
  Downloading pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 4.6 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.26.16-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 26.7 MB/s 
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 7.0 MB/s 
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting botocore<1.30.0,>=1.29.16
  Downloading botocore-1.29.16-py3-none-any.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 60.7 MB/s 
[?25hCollecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 63.1 MB/s 
  Downloading urllib3-1.25.11-py2.py3-

In [None]:
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorboardX
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.7 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.5.1


In [None]:
!pip install pyrouge


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyrouge
  Downloading pyrouge-0.1.3.tar.gz (60 kB)
[K     |████████████████████████████████| 60 kB 3.4 MB/s 
[?25hBuilding wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pyrouge: filename=pyrouge-0.1.3-py3-none-any.whl size=191620 sha256=a44c123f69db1a144c9a25223f1a35e793a32896aa1d34fc9120291f22b0d7f1
  Stored in directory: /root/.cache/pip/wheels/68/35/6a/ffb9a1f51b2b00fee42e7f67f5a5d8e10c67d048cda09ccd57
Successfully built pyrouge
Installing collected packages: pyrouge
Successfully installed pyrouge-0.1.3


In [None]:
!pip install multiprocess

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 4.6 MB/s 
Installing collected packages: multiprocess
Successfully installed multiprocess-0.70.14


### Данные

Датасет CNN and Daily Mail 

Загрузим предобработанные данные.

In [None]:
%cd /content

/content


In [None]:
!wget --no-check-certificate --load-cookies /tmp/cookies.txt "http://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'http://docs.google.com/uc?export=download&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6" -O bertsum_data.zip && rm -rf /tmp/cookies.txt

URL transformed to HTTPS due to an HSTS policy
--2022-11-26 15:52:17--  https://docs.google.com/uc?export=download&confirm=t&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6
Resolving docs.google.com (docs.google.com)... 173.194.195.101, 173.194.195.113, 173.194.195.138, ...
Connecting to docs.google.com (docs.google.com)|173.194.195.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6phpe5e0ifccad2m7rk402md4bsm1tq6/1669477875000/02403291851892694101/*/1x0d61LP9UAN389YN00z0Pv-7jQgirVg6?e=download&uuid=4b99cf05-e603-4f81-820a-d25d9728f198 [following]
--2022-11-26 15:52:17--  https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6phpe5e0ifccad2m7rk402md4bsm1tq6/1669477875000/02403291851892694101/*/1x0d61LP9UAN389YN00z0Pv-7jQgirVg6?e=download&uuid=4b99cf05-e603-4f81-820a-d25d9728f198
Resolving doc-04-0g-docs.googleusercontent.com (doc-04

In [None]:
!git clone https://github.com/nlpyang/BertSum

Cloning into 'BertSum'...
remote: Enumerating objects: 301, done.[K
remote: Counting objects: 100% (293/293), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 301 (delta 165), reused 290 (delta 164), pack-reused 8[K
Receiving objects: 100% (301/301), 15.05 MiB | 18.90 MiB/s, done.
Resolving deltas: 100% (165/165), done.


In [None]:
%cd BertSum

/content/BertSum


In [None]:
!unzip ../bertsum_data.zip -d ./bert_data

Archive:  ../bertsum_data.zip
  inflating: ./bert_data/cnndm.test.0.bert.pt  
  inflating: ./bert_data/cnndm.test.1.bert.pt  
  inflating: ./bert_data/cnndm.test.2.bert.pt  
  inflating: ./bert_data/cnndm.test.3.bert.pt  
  inflating: ./bert_data/cnndm.test.4.bert.pt  
  inflating: ./bert_data/cnndm.test.5.bert.pt  
  inflating: ./bert_data/cnndm.train.0.bert.pt  
  inflating: ./bert_data/cnndm.train.100.bert.pt  
  inflating: ./bert_data/cnndm.train.101.bert.pt  
  inflating: ./bert_data/cnndm.train.102.bert.pt  
  inflating: ./bert_data/cnndm.train.103.bert.pt  
  inflating: ./bert_data/cnndm.train.104.bert.pt  
  inflating: ./bert_data/cnndm.train.105.bert.pt  
  inflating: ./bert_data/cnndm.train.106.bert.pt  
  inflating: ./bert_data/cnndm.train.107.bert.pt  
  inflating: ./bert_data/cnndm.train.108.bert.pt  
  inflating: ./bert_data/cnndm.train.109.bert.pt  
  inflating: ./bert_data/cnndm.train.10.bert.pt  
  inflating: ./bert_data/cnndm.train.110.bert.pt  
  inflating: ./bert_da

In [None]:
%cd src

/content/BertSum/src


Пример входных данных

In [None]:
import torch
cnn_test_samp = torch.load("/content/BertSum/bert_data/cnndm.test.0.bert.pt")

In [None]:
cnn_test_samp0 = cnn_test_samp[0]

In [None]:
cnn_test_samp0.keys()

dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])

In [None]:
print(cnn_test_samp0['clss']) # индексы CLS токенов для предложений входного текста 
print(cnn_test_samp0['labels']) # таргет метки для предложений (1 - входит в summary, 0 - не входит)
print(cnn_test_samp0['segs']) # id сегментов предложений 
print(cnn_test_samp0['src']) # id слов

[0, 25, 57, 78, 112, 136, 174, 197, 223, 245, 285, 301, 337, 358, 382, 416, 452]
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,

In [None]:
cnn_test_samp0['src_txt'] # входной текст


['a university of iowa student has died nearly three months after a fall in rome in a suspected robbery attack in rome .',
 'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .',
 'he was flown back to chicago via air ambulance on march 20 , but he died on sunday .',
 'andrew mogni , 20 , from glen ellyn , illinois , a university of iowa student has died nearly three months after a fall in rome in a suspected robbery',
 'he was taken to a medical facility in the chicago area , close to his family home in glen ellyn .',
 "he died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest .",
 'initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed .',
 "on sunday , his cousin abby wrote online : ` this morning my cous

In [None]:
cnn_test_samp0['tgt_txt'] # target summary

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january<q>he was flown back to chicago via air on march 20 but he died on sunday<q>initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed<q>his cousin claims he was attacked and thrown 40ft from a bridge'

Обучение модели

In [None]:
!python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0  -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 2000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000

[2022-11-26 15:53:46,518 INFO] Device ID 0
[2022-11-26 15:53:46,518 INFO] Device cuda
[2022-11-26 15:53:46,746 INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz not found in cache, downloading to /tmp/tmp4i8q0k4f
100% 407873900/407873900 [00:06<00:00, 66661322.98B/s]
[2022-11-26 15:53:53,041 INFO] copying /tmp/tmp4i8q0k4f to cache at ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
[2022-11-26 15:53:54,347 INFO] creating metadata file for ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
[2022-11-26 15:53:54,348 INFO] removing temp file /tmp/tmp4i8q0k4f
[2022-11-26 15:53:54,415 INFO] loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca

Тестирование на валидационных и тестовых данных

In [None]:
!python train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file ../logs/bert_classifier_valid  -result_path ../results/cnndm -test_all -block_trigram true

Примеры summary

In [None]:
# extracted summary
N = 20
with open("/BertSum/results/cnndm_step2000.candidate") as f:
    dec = [next(f) for x in range(N)]

In [None]:
# target summary
N = 20
with open("/BertSum/results/cnndm_step2000.gold") as f:
    ref = [next(f) for x in range(N)]

In [None]:
ref[0].split('<q>')

['the 79th masters tournament gets underway at augusta national on thursday',
 'rory mcilroy and tiger woods will be the star attractions in the field bidding for the green jacket at 2015 masters',
 'mcilroy , justin rose , ian poulter , graeme mcdowell and more gave sportsmail the verdict on each hole at augusta',
 'click on the brilliant interactive graphic below for details on each hole of the masters 2015 course',
 'click here for all the latest news from the masters 2015\n']

In [None]:
dec[0].split('<q>')

['to help get you in the mood for the first major of the year , rory mcilroy , ian poulter , graeme mcdowell and justin rose , plus past masters champions nick faldo and charl schwartzel , give the lowdown on every hole at the world-famous augusta national golf club .',
 'the masters 2015 is almost here .',
 'click on the graphic below to get a closer look at what the biggest names in the game will face when they tee off on thursday .\n']

In [None]:
ref[1].split('<q>')

["jeff powell looks ahead to saturday 's fight at the mgm grand",
 'floyd mayweather takes on manny pacquiao in $ 300m showdown',
 'both fighters arrived in las vegas on tuesday with public appearances',
 'read : mayweather makes official arrival ahead of manny pacquiao fight',
 'al haymon : the man behind mayweather who is revolutionising boxing',
 "mayweather vs pacquiao takes centre stage ... but who 's on the undercard ?\n"]

In [None]:
dec[1].split('<q>')

["powell reflects on the pair 's arrivals on the las vegas strip and looks forward to the rest of the week .",
 'both boxers made public appearances on tuesday as their $ 300million showdown draws ever closer , and our man powell was there .',
 "sportsmail 's boxing correspondent jeff powell looks ahead to saturday 's mega-fight at the mgm grand after witnessing floyd mayweather and manny pacquiao 's grand arrivals in las vegas .\n"]

In [None]:
ref[2].split('<q>')

['gary locke has been interim manager since start of february',
 'locke has won two and drawn four of his seven games in charge',
 'the 37-year-old took over when allan johnston quit\n']

In [None]:
dec[2].split('<q>')

['the former hearts boss joined the club as assistant boss to allan johnston last summer but took control of the team when his ex-tynecastle team-mate quit at the start of february .',
 'gary locke has been given the job at kilmarnock on a permanent basis after a successful interim spell',
 'the 39-year-old - who will speak at a press conference on friday morning - has lost just once in seven games since taking over at rugby park .\n']

##Sumy
The Sumy package is the most complete and maintained library for extractive summarization. It contains various algorithm implementations, has a command line interface, and a [web demo](https://huggingface.co/spaces/issam9/sumy_space) which you can experiment with. Also, it deals with both raw text sources and web links. Sumy includes all the necessary preprocessing methods — parsers, tokenizers, and stemmers, and provides support for many languages.

In [None]:
!pip install gensim spacy numpy nltk sumy rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 3.8 MB/s 
[?25hCollecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting docopt<0.7,>=0.6.1
  Downloading docopt-0.6.2.tar.gz (25 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 29.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
Building wheels for collected packages: breadability, docopt, pycountry
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=breadability-0.1.20-py2.py3-none-any.whl size=21712 sha

In [None]:
!pip install datasets
import datasets
import numpy as np
import rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 4.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 52.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 64.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: xxhash, responses, huggingface-hub, datasets
Successfully installed datasets-2.7.1 huggingface-hub-0.11.0 responses-0.18.0 xxhash-3.1.0


In [None]:
dataset = datasets.load_dataset("cnn_dailymail", '3.0.0')
first_entry = dataset['train'][0]

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

In [None]:
print(first_entry)

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

### Luhn's summarizer

Luhn's summarizer was one of the first attempts in the field of text summarization. In his 1958 paper ''The Automatic Creation of Literature Abstracts", Luhn proposes that word frequency determines the word's significance.

At the preprocessing stage, words are stemmed and the stop words are removed. Then, the list of stems is compiled, and sorted by decreasing frequency, with indexes indicating the stem's significance. The sentence is representative of the context if the greater number of frequent words are grouped together with a distance of 4 or 5 non-significant words between them. Thus, only the portions limited by significant terms are considered instead of the whole sentence, which introduces the significance factor for the portions. If there are multiple portions in the sentence, the sentence is assigned the maximum significance factor. Finally, the top-scoring sentences are included in the summary.

In [None]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.parsers.plaintext import PlaintextParser
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def summarize_luhn(article: str, sentence_count: int) -> str:
    ''' Utility function to perform Luhn's summarization.

        By default, LuhnSummarizer will select 100% of non-stop post-processed words as
        significant, but you can overwrite the significant_percantage attribute as a 
        fraction: summarizerLuhn.significant_percentage = 1/3

    '''
        
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    summarizerLuhn = LuhnSummarizer(Stemmer('english'))
    summarizerLuhn.stop_words = get_stop_words('english')
    luhn_summary = summarizerLuhn(parser.document, sentences_count = sentence_count)
    return ' '.join([str(sentence) for sentence in luhn_summary])

In [None]:
summarize_luhn(first_entry['article'], sentence_count = 2)

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.'

### Task 1.
Compute ROUGE1-3 for this article.

### Task 2.
Implement LSA summarization.