<a href="https://colab.research.google.com/github/Anita1017/Abstractive-Text-Summarization/blob/main/Bert%20Classifier%20Extractive%20Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extractive Summarization on CNN/DM Dataset using Transformer Version of BertSum


### Summary

This notebook demonstrates how to fine tune Transformers for extractive text summarization. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.

BertSum refers to  [Fine-tune BERT for Extractive Summarization](https://arxiv.org/pdf/1903.10318.pdf) with [published example](https://github.com/nlpyang/BertSum/). And the Transformer version of Bertsum refers to our modification of BertSum and the source code can be accessed at (https://github.com/daden-ms/BertSum/). 

Extractive summarization are usually used in document summarization where each input document consists of mutiple sentences. The preprocessing of the input training data involves assigning label 0 or 1 to the document sentences based on the give summary. The summarization problem is also simplfied to classifying whether a document sentence should be included in the summary. 

The figure below illustrates how BERTSum can be fine tuned for extractive summarization task. [CLS] token is inserted at the beginning of each sentence, so is [SEP] token at the end. Interval segment embedding and positional embedding are added upon the token embedding as the input of the BERT model. The [CLS] token representation is used as sentence embedding and only the [CLS] tokens are used as the input for the summarization model. The summarization layer predicts the probability for each  sentence being included in the summary. Techniques like trigram blocking can be used to improve model accuarcy.   

<img src="https://nlpbp.blob.core.windows.net/images/BertSum.PNG">


### Before You Start

The running time shown in this notebook is on a Standard_NC24s_v3 Azure Ubuntu Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

Using only 1 NVIDIA Tesla V100 GPUs, 16GB GPU memory configuration,
- for data preprocessing, it takes around 1 minutes to preprocess the data for quick run. Otherwise it takes ~20 minutes to finish the data preprocessing. This time estimation assumes that the chosen transformer model is "distilbert-base-uncased" and the sentence selection method is "greedy", which is the default. The preprocessing time can be significantly longer if the sentence selection method is "combination", which can achieve better model performance.

- for model fine tuning, it takes around 2 minutes for quick run. Otherwise, it takes around ~3 hours to finish. This estimation assumes the chosen encoder method is "transformer". The model fine tuning time can be shorter if other encoder method is chosen, which may result in worse model performance. 

### Additional Notes

* **ROUGE Evalation**: To run rouge evaluation, please refer to the section of compute_rouge_perl in [summarization_evaluation.ipynb](./summarization_evaluation.ipynb) for setup.

* **Distributed Training**:
Please note that the jupyter notebook only allows to use pytorch [DataParallel](https://pytorch.org/docs/master/nn.html#dataparallel). Faster speed and larger batch size can be achieved with pytorch [DistributedDataParallel](https://pytorch.org/docs/master/notes/ddp.html)(DDP). Script [extractive_summarization_cnndm_distributed_train.py](./extractive_summarization_cnndm_distributed_train.py) shows an example of how to use DDP.



###Installation


---



####Run these and then restart the Runtime

In [None]:
!pip install -e git+https://github.com/microsoft/nlp-recipes.git@master#egg=utils_nlp



Obtaining utils_nlp from git+https://github.com/microsoft/nlp-recipes.git@master#egg=utils_nlp
  Cloning https://github.com/microsoft/nlp-recipes.git (to revision master) to ./src/utils-nlp
  Running command git clone -q https://github.com/microsoft/nlp-recipes.git /content/src/utils-nlp
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Installing collected packages: utils-nlp
  Running setup.py develop for utils-nlp
Successfully installed utils-nlp


In [None]:
!pip install -e git+https://github.com/pytorch/text.git@master#egg=torchtext


Obtaining torchtext from git+https://github.com/pytorch/text.git@master#egg=torchtext
  Cloning https://github.com/pytorch/text.git (to revision master) to ./src/torchtext
  Running command git clone -q https://github.com/pytorch/text.git /content/src/torchtext
  Running command git submodule update --init --recursive -q
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
  Running setup.py develop for torchtext
  Rolling back uninstall of torchtext
  Moving to /usr/local/lib/python3.6/dist-packages/test/common/
   from /usr/local/lib/python3.6/dist-packages/test/~ommon
  Moving to /usr/local/lib/python3.6/dist-packages/test/data/
   from /usr/local/lib/python3.6/dist-packages/test/~ata
  Moving to /usr/local/lib/python3.6/dist-packages/torchtext-0.3.1.dist-info/
   from /usr/local/lib/python3.6/dist-packages/~orchtext-0.3.1.dist-info
  Moving to /usr/local/lib/python3.6

In [None]:
!pip install -e git+https://github.com/wbolster/jsonlines#egg=jsonlines

Obtaining jsonlines from git+https://github.com/wbolster/jsonlines#egg=jsonlines
  Cloning https://github.com/wbolster/jsonlines to ./src/jsonlines
  Running command git clone -q https://github.com/wbolster/jsonlines /content/src/jsonlines
Installing collected packages: jsonlines
  Running setup.py develop for jsonlines
Successfully installed jsonlines


####After restart Runtime

In [None]:
pip install pyrouge


Collecting pyrouge
[?25l  Downloading https://files.pythonhosted.org/packages/11/85/e522dd6b36880ca19dcf7f262b22365748f56edc6f455e7b6a37d0382c32/pyrouge-0.1.3.tar.gz (60kB)
[K     |█████▍                          | 10kB 20.6MB/s eta 0:00:01[K     |██████████▉                     | 20kB 22.5MB/s eta 0:00:01[K     |████████████████▎               | 30kB 11.8MB/s eta 0:00:01[K     |█████████████████████▋          | 40kB 9.8MB/s eta 0:00:01[K     |███████████████████████████     | 51kB 8.4MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 5.3MB/s 
[?25hBuilding wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pyrouge: filename=pyrouge-0.1.3-cp36-none-any.whl size=191614 sha256=36d90184b3dbdc68b325dd350c16792bf0ba10c0a6cbc7544ea376ac850ea770
  Stored in directory: /root/.cache/pip/wheels/75/d3/0c/e5b04e15b6b87c42e980de3931d2686e14d36e045058983599
Successfully built pyrouge
Installing collected p

In [None]:
pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [None]:

!pip install indic-nlp-library


Collecting indic-nlp-library
  Downloading https://files.pythonhosted.org/packages/2f/51/f4e4542a226055b73a621ad442c16ae2c913d6b497283c99cae7a9661e6c/indic_nlp_library-0.71-py3-none-any.whl
Collecting morfessor
  Downloading https://files.pythonhosted.org/packages/39/e6/7afea30be2ee4d29ce9de0fa53acbb033163615f849515c0b1956ad074ee/Morfessor-2.0.6-py3-none-any.whl
Installing collected packages: morfessor, indic-nlp-library
Successfully installed indic-nlp-library-0.71 morfessor-2.0.6


In [None]:
pip install transformers


Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |▏                               | 10kB 24.0MB/s eta 0:00:01[K     |▍                               | 20kB 15.8MB/s eta 0:00:01[K     |▋                               | 30kB 13.4MB/s eta 0:00:01[K     |▉                               | 40kB 12.7MB/s eta 0:00:01[K     |█                               | 51kB 8.7MB/s eta 0:00:01[K     |█▎                              | 61kB 8.1MB/s eta 0:00:01[K     |█▌                              | 71kB 9.2MB/s eta 0:00:01[K     |█▊                              | 81kB 10.1MB/s eta 0:00:01[K     |██                              | 92kB 10.6MB/s eta 0:00:01[K     |██▏                             | 102kB 8.4MB/s eta 0:00:01[K     |██▍                             | 112kB 8.4MB/s eta 0:00:01[K     |██▋                             | 122kB

In [None]:
pip install tensorboardX


Collecting tensorboardX
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K     |█                               | 10kB 25.2MB/s eta 0:00:01[K     |██▏                             | 20kB 30.8MB/s eta 0:00:01[K     |███▏                            | 30kB 14.8MB/s eta 0:00:01[K     |████▎                           | 40kB 10.7MB/s eta 0:00:01[K     |█████▎                          | 51kB 7.4MB/s eta 0:00:01[K     |██████▍                         | 61kB 8.4MB/s eta 0:00:01[K     |███████▍                        | 71kB 7.9MB/s eta 0:00:01[K     |████████▌                       | 81kB 8.8MB/s eta 0:00:01[K     |█████████▌                      | 92kB 8.2MB/s eta 0:00:01[K     |██████████▋                     | 102kB 8.4MB/s eta 0:00:01[K     |███████████▊                    | 112kB 8.4MB/s eta 0:00:01[K     |████████████▊                   | 122kB

In [None]:
pip install scrapbook

Collecting scrapbook
  Downloading https://files.pythonhosted.org/packages/eb/22/5c55a90934780daedc2747c8e40bae814adff393da1aca0fae577d13c00d/scrapbook-0.2.0.tar.gz
Collecting parsel>=1.2.0
  Downloading https://files.pythonhosted.org/packages/23/1e/9b39d64cbab79d4362cdd7be7f5e9623d45c4a53b3f7522cd8210df52d8e/parsel-1.6.0-py2.py3-none-any.whl
Collecting cssselect>=0.9
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting w3lib>=1.19.0
  Downloading https://files.pythonhosted.org/packages/a3/59/b6b14521090e7f42669cafdb84b0ab89301a42f1f1a82fcf5856661ea3a7/w3lib-1.22.0-py2.py3-none-any.whl
Building wheels for collected packages: scrapbook
  Building wheel for scrapbook (setup.py) ... [?25l[?25hdone
  Created wheel for scrapbook: filename=scrapbook-0.2.0-cp36-none-any.whl size=4886 sha256=dba2d4d70ba4f3d7560ecd87fb7e49d9c90e94b4f42ce72c715d043832495429
  Stored in directory:

In [None]:
pip install py-rouge

Collecting py-rouge
[?25l  Downloading https://files.pythonhosted.org/packages/9c/1d/0bdbaf559fb7afe32308ebc84a2028600988212d7eb7fb9f69c4e829e4a0/py_rouge-1.1-py3-none-any.whl (56kB)
[K     |█████▊                          | 10kB 20.2MB/s eta 0:00:01[K     |███████████▌                    | 20kB 18.2MB/s eta 0:00:01[K     |█████████████████▎              | 30kB 14.6MB/s eta 0:00:01[K     |███████████████████████         | 40kB 13.2MB/s eta 0:00:01[K     |████████████████████████████▉   | 51kB 8.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 5.5MB/s 
[?25hInstalling collected packages: py-rouge
Successfully installed py-rouge-1.1


In [None]:
pip install https://github.com/pytorch/text/archive/master.zip


Collecting https://github.com/pytorch/text/archive/master.zip
  Using cached https://github.com/pytorch/text/archive/master.zip
Building wheels for collected packages: torchtext
  Building wheel for torchtext (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for torchtext[0m
[?25h  Running setup.py clean for torchtext
Failed to build torchtext


In [None]:
pip install requests==2.21.0




In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

###First

In [None]:
%load_ext autoreload

%autoreload 2

In [None]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = True
## Set USE_PREPROCSSED_DATA = True to skip the data preprocessing
USE_PREPROCSSED_DATA = True


### Configuration: choose the transformer model to be used

In [None]:
import os
import shutil
import sys
from tempfile import TemporaryDirectory
import torch

nlp_path = os.path.abspath("../../")
print(nlp_path)
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.dataset.cnndm import CNNDMBertSumProcessedData, CNNDMSummarizationDataset
from utils_nlp.eval import compute_rouge_python, compute_rouge_perl
from utils_nlp.models.transformers.extractive_summarization import (
    ExtractiveSummarizer,
    ExtSumProcessedData,
    ExtSumProcessor,)

from utils_nlp.models.transformers.datasets import SummarizationDataset
import nltk
from nltk import tokenize

import pandas as pd
import scrapbook as sb
import pprint

/


In [None]:
''' if error generated as  'extract_files not defined ' then copy the extract_files fucntion from
    https://github.com/pytorch/text/blob/master/torchtext/utils.py and paste it in cnndm.py file'''

Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For extractive summarization, the following pretrained models are supported. 

In [None]:
pd.DataFrame({"model_name": ExtractiveSummarizer.list_supported_models()})

Unnamed: 0,model_name
0,bert-base-uncased
1,distilbert-base-uncased


In [None]:
# Transformer model being used
MODEL_NAME = "distilbert-base-uncased"

In [None]:
# notebook parameters
# the cache data path during find tuning
CACHE_DIR = TemporaryDirectory().name
processor = ExtSumProcessor(model_name=MODEL_NAME, cache_dir=CACHE_DIR)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




### Data Preprocessing

The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary. The dataset consits of ~289K training examples, ~11K valiation examples and ~11K test examples.  You can choose the [Option 1] below preprocess the data or [Option 2] to use the preprocessed version at [BERTSum published example](https://github.com/nlpyang/BertSum/). You don't need to manually download any of these two data sets as the code below will handle downloading. Functions defined specific in [cnndm.py](../../utils_nlp/dataset/cnndm.py) are unique to CNN/DM dataset that's preprocessed by harvardnlp. However, it provides a skeleton of how to preprocessing text into the format that model preprocessor takes: sentence tokenization and work tokenization. 

##### Details of Data Preprocessing

The purpose of preprocessing is to process the input articles to the format that model finetuning needed. Assuming you have (1) all articles and (2) target summaries, each in a file and line-breaker separated, the steps to preprocess the data are:
1. sentence tokenization
2. word tokenization
3. **label** the sentences in the article with 1 meaning the sentence is selected and 0 meaning the sentence is not selected. The algorithms for the sentence selection are "greedy" and "combination" and can be found in [sentence_selection.py](../../utils_nlp/dataset/sentence_selection.py)
3. convert each example to  the desired format for extractive summarization
    - filter the sentences in the example based on the min_src_ntokens argument. If the lefted total sentence number is less than min_nsents, the example is discarded.
    - truncate the sentences in the example if the length is greater than max_src_ntokens
    - truncate the sentences in the example and the labels if the total number of sentences is greater than max_nsents
    - [CLS] and [SEP] are inserted before and after each sentence
    - wordPiece tokenization or Byte Pair Encoding (BPE) subword tokenization
    - truncate the example to 512 tokens
    - convert the tokens into token indices corresponding to the transformer tokenizer's vocabulary.
    - segment ids are generated and added
    - [CLS] token positions are logged
    - [CLS] token labels are truncated if it's greater than 512, which is the maximum input length that can be taken by the transformer model.
    
    
Note that the original BERTSum paper use Stanford CoreNLP for data preprocessing, here we use NLTK for data preprocessing. 

##### [Option 1] Preprocess  data (Please skil this part if you choose to use preprocessed data)
The code in following cell will download the CNN/DM dataset listed at https://github.com/harvardnlp/sent-summary/.

In [None]:
# the data path used to save the downloaded data file
DATA_PATH = TemporaryDirectory().name
# The number of lines at the head of data file used for preprocessing. -1 means all the lines.
TOP_N = 1000
if not QUICK_RUN:
    TOP_N = -1

In [None]:
train_dataset, test_dataset = CNNDMSummarizationDataset(top_n=TOP_N, local_cache_path=DATA_PATH)

Preprocess the data.

In [None]:
ext_sum_train = processor.preprocess(train_dataset, oracle_mode="greedy")
ext_sum_test = processor.preprocess(test_dataset, oracle_mode="greedy")


In [None]:
"""
# save and load preprocessed data
save_path = os.path.join(DATA_PATH, "processed")
torch.save(ext_sum_train, os.path.join(save_path, "train_full.pt"))
torch.save(ext_sum_test, os.path.join(save_path, "test_full.pt"))

"""
# ext_sum_train = torch.load(os.path.join(save_path, "train_full.pt"))
# ext_sum_test = torch.load(os.path.join(save_path, "test_full.pt"))

'\n# save and load preprocessed data\nsave_path = os.path.join(DATA_PATH, "processed")\ntorch.save(ext_sum_train, os.path.join(save_path, "train_full.pt"))\ntorch.save(ext_sum_test, os.path.join(save_path, "test_full.pt"))\n\n'

In [None]:
len(ext_sum_train)

995

In [None]:
len(ext_sum_test)

1000

#### Inspect Data

In [None]:
ext_sum_train[0]

In [None]:
ext_sum_train[0].keys()

dict_keys(['src', 'src_txt', 'tgt', 'tgt_txt', 'oracle_ids'])

##### [Option 2] Reuse Preprocessed  data from [BERTSUM Repo](https://github.com/nlpyang/BertSum)

In [None]:
# the data path used to downloaded the preprocessed data from BERTSUM Repo.
# if you have downloaded the dataset, change the code to use that path where the dataset is.
PROCESSED_DATA_PATH = TemporaryDirectory().name
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
#data_path = "./temp_data5/"
#PROCESSED_DATA_PATH = data_path

In [None]:
if USE_PREPROCSSED_DATA:
    download_path = CNNDMBertSumProcessedData.download(local_path=PROCESSED_DATA_PATH)
    ext_sum_train, ext_sum_test = ExtSumProcessedData().splits(root=download_path, train_iterable=True)
    

Downloading 1x0d61LP9UAN389YN00z0Pv-7jQgirVg6 into /tmp/tmp9edwa9hb/bertsum_data.zip... Done.


### Model training
To start model training, we need to create a instance of ExtractiveSummarizer.
#### Choose the transformer model.
Currently ExtractiveSummarizer support two models:
- distilbert-base-uncase, 
- bert-base-uncase

Potentionally, roberta-based model and xlnet can be supported but needs to be tested.
#### Choose the encoder algorithm.
There are four options:
- baseline: it used a smaller transformer model to replace the bert model and with transformer summarization layer
- classifier: it uses pretrained BERT and fine-tune BERT with **simple logistic classification** summarization layer
- transformer: it uses pretrained BERT and fine-tune BERT with **transformer** summarization layer
- RNN: it uses pretrained BERT and fine-tune BERT with **LSTM** summarization layer

In [None]:
BATCH_SIZE = 5 # batch size, unit is the number of samples
MAX_POS_LENGTH = 512
if USE_PREPROCSSED_DATA: #if bertsum published data is used
    BATCH_SIZE = 3000 # batch size, unit is the number of tokens
    MAX_POS_LENGTH = 512
    


# GPU used for training
NUM_GPUS = torch.cuda.device_count()

# Encoder name. Options are: 1. baseline, classifier, transformer, rnn.
ENCODER = "classifier"

# Learning rate
LEARNING_RATE=2e-3

# How often the statistics reports show up in training, unit is step.
REPORT_EVERY=100

# total number of steps for training
MAX_STEPS=1e2
# number of steps for warm up
WARMUP_STEPS=5e2
    
if not QUICK_RUN:
    MAX_STEPS=5e4
    WARMUP_STEPS=5e3
 

In [None]:
summarizer = ExtractiveSummarizer(processor, MODEL_NAME, ENCODER, MAX_POS_LENGTH, CACHE_DIR)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




In [None]:

summarizer.fit(
            ext_sum_train,
            num_gpus=NUM_GPUS,
            batch_size=BATCH_SIZE,
            gradient_accumulation_steps=2,
            max_steps=MAX_STEPS,
            learning_rate=LEARNING_RATE,
            warmup_steps=WARMUP_STEPS,
            verbose=True,
            report_every=REPORT_EVERY,
            clip_grad_norm=False,
            use_preprocessed_data=USE_PREPROCSSED_DATA
        )

Iteration: 200it [01:00,  3.30it/s]

timestamp: 16/12/2020 13:57:48, average loss: 41.630460, time duration: 60.163095,
                            number of examples in current reporting: 1006, step 100
                            out of total 100


Iteration: 201it [01:00,  3.31it/s]


In [None]:
summarizer.save_model(
    os.path.join(
        CACHE_DIR,
        "extsum_modelname_{0}_usepreprocess{1}_steps_{2}.pt".format(
            MODEL_NAME, USE_PREPROCSSED_DATA, MAX_STEPS
        ),
    )
)

saving through pytorch


In [None]:
# for loading a previous saved model
"""
import torch
model_path = os.path.join(
        CACHE_DIR,
        "extsum_modelname_{0}_usepreprocess{1}_steps_{2}.pt".format(
            MODEL_NAME, USE_PREPROCSSED_DATA, MAX_STEPS
        ))
summarizer = ExtractiveSummarizer(processor, MODEL_NAME, ENCODER, MAX_POS_LENGTH, CACHE_DIR)
summarizer.model.load_state_dict(torch.load(model_path, map_location="cpu"))
"""

### Model Evaluation

[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation has been commonly used for evaluating text summarization.

In [None]:
ext_sum_test[0].keys()

dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])

In [None]:
if "segs" in ext_sum_test[0]: # preprocessed_data
    source = [i['src_txt'] for i in ext_sum_test]
    target = ["\n".join(i['tgt_txt'].split("<q>")) for i in ext_sum_test]
else:
    source = []
    temp_target = []
    for i in ext_sum_test:
        source.append(i["src_txt"]) 
        temp_target.append(" ".join(j) for j in i['tgt']) 
    target = [''.join(i) for i in list(temp_target)]

In [None]:
%%time
sentence_separator = "\n"
prediction = summarizer.predict(ext_sum_test, num_gpus=NUM_GPUS, batch_size=5, sentence_separator=sentence_separator)

Scoring: 100%|██████████| 2298/2298 [03:40<00:00, 10.43it/s]


CPU times: user 2min 23s, sys: 1min 18s, total: 3min 42s
Wall time: 3min 42s


In [None]:
len(prediction)

11489

In [None]:
rouge_scores = compute_rouge_python(cand=prediction, ref=target)
pprint.pprint(rouge_scores)


Number of candidates: 11489
Number of references: 11489
{'rouge-1': {'f': 0.37951371919343274,
             'p': 0.33519564920293937,
             'r': 0.4763957218118309},
 'rouge-2': {'f': 0.15674161668321718,
             'p': 0.13895800749069379,
             'r': 0.19613756647926556},
 'rouge-l': {'f': 0.34404438545078103,
             'p': 0.304160037979591,
             'r': 0.4314214293014429}}


In [None]:
pprint.pprint("ROUGE-1: {}".format(rouge_scores["rouge-1"]))
pprint.pprint("ROUGE-2: {}".format(rouge_scores["rouge-2"]))
pprint.pprint("ROUGE-L: {}".format(rouge_scores["rouge-l"]))

###Inspection

In [None]:
source[0]

['a university of iowa student has died nearly three months after a fall in rome in a suspected robbery attack in rome .',
 'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .',
 'he was flown back to chicago via air ambulance on march 20 , but he died on sunday .',
 'andrew mogni , 20 , from glen ellyn , illinois , a university of iowa student has died nearly three months after a fall in rome in a suspected robbery',
 'he was taken to a medical facility in the chicago area , close to his family home in glen ellyn .',
 "he died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest .",
 'initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed .',
 "on sunday , his cousin abby wrote online : ` this morning my cous

In [None]:
target[0]

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january\nhe was flown back to chicago via air on march 20 but he died on sunday\ninitial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed\nhis cousin claims he was attacked and thrown 40ft from a bridge'

In [None]:
prediction[0]

"andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .\nhe died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest .\ninitial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed ."

In [None]:
# for testing
sb.glue("rouge_2_f_score", rouge_scores['rouge-2']['f'])

## Prediction on a single input sample

In [None]:
source = """
But under the new rule, set to be announced in the next 48 hours, Border Patrol agents would immediately return anyone to Mexico — without any detainment and without any due process — who attempts to cross the southwestern border between the legal ports of entry. The person would not be held for any length of time in an American facility.

Although they advised that details could change before the announcement, administration officials said the measure was needed to avert what they fear could be a systemwide outbreak of the coronavirus inside detention facilities along the border. Such an outbreak could spread quickly through the immigrant population and could infect large numbers of Border Patrol agents, leaving the southwestern border defenses weakened, the officials argued.
The Trump administration plans to immediately turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally, saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents, four administration officials said.
The administration officials said the ports of entry would remain open to American citizens, green-card holders and foreigners with proper documentation. Some foreigners would be blocked, including Europeans currently subject to earlier travel restrictions imposed by the administration. The points of entry will also be open to commercial traffic."""

In [None]:
test_dataset = SummarizationDataset(
    None,
    source=[source],
    source_preprocessing=[tokenize.sent_tokenize],
    word_tokenize=nltk.word_tokenize,
)
processor = ExtSumProcessor(model_name=MODEL_NAME,  cache_dir=CACHE_DIR)
preprocessed_dataset = processor.preprocess(test_dataset)

In [None]:
preprocessed_dataset[0].keys()

dict_keys(['src', 'src_txt'])

In [None]:
prediction = summarizer.predict(preprocessed_dataset, num_gpus=0, batch_size=1, sentence_separator="\n")

In [None]:
prediction

## Clean up temporary folders

In [None]:
if os.path.exists(DATA_PATH):
    shutil.rmtree(DATA_PATH, ignore_errors=True)
if os.path.exists(CACHE_DIR):
    shutil.rmtree(CACHE_DIR, ignore_errors=True)
if USE_PREPROCSSED_DATA:
    if os.path.exists(PROCESSED_DATA_PATH):
        shutil.rmtree(PROCESSED_DATA_PATH, ignore_errors=True)

##Importing BERT Extractive Output into a *File*

In [None]:
import numpy as np
import pandas as pd 
import re

In [None]:
def text_cleaner(text):
    newString = text.lower()
    newString= re.sub(r'\n',' ',newString)
    return newString

In [None]:
stri=text_cleaner(prediction[0])
print(stri)

andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january . he died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest . initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed .


In [None]:
cleaned_text = []
for t in range(len(prediction)):
    cleaned_text.append(text_cleaner(prediction[t]))

In [None]:
cleaned_summary = []
for t in target:
    cleaned_summary.append(text_cleaner(t))

In [None]:
text =np.array(prediction)
summary=np.array(target)
df=pd.DataFrame({'text':text,'summary':summary})

In [None]:
len(df)

11489

In [None]:
for i in range(len(df)):
  df['text'][i]=re.sub('\n',' ',df['text'][i])

In [None]:
for i in range(len(df)):
  df['summary'][i]=re.sub('\n',' ',df['summary'][i])

In [None]:
df.head(50)


Unnamed: 0,text,summary
0,"andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january . he died on sunday at northwestern memorial hospital -...","andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january he was flown back to chicago via air on march 20 but he died on ..."
1,"and for those looking to step back in time during a weekend getaway , there are plenty of options available at your fingertips . whether you prefer the earthy aesthetic of the atomic age or the bo...",the landgate cottage in rye is outfitted in an earthy atomic age aesthetic travel to brighton to experience the era of brit-pop at wonderland cottage and nothing says 1970s nostalgia like renting ...
2,"the 27-year-old models her new summer sleepwear collection , which is full of mix and match pieces featuring sophisticated hues of slate blue and silver with colour pops of peach and floral prints...","rosie , 27 , shows off her new lingerie and sleepwear range shares her daily diet and says eating organic is an important investment stars in mad max - fury road film , which is out this year"
3,manchester city are coming up against statistically the deadliest striker in the premier league when they face crystal palace on easter monday . harry kane and diego costa are locked on 19 goals i...,"glenn murray has scored four goals in 364 minutes this season crystal palace striker has best minutes-per-goal ratio in premier league olivier giroud third on the list , harry kane fourth , diego ..."
4,"on a day when a new quinnipiac university poll found only 38 per cent of americans believe she is honest and trustworthy , the leading democratic party candidate for president has a mess to clean ...","hillary clinton seen as honest and trustworthy by just 38 per cent of americans , new poll shows new headaches include revelations about foreign funds flowing into the clintons ' family foundation..."
5,"however , you may be able to wave goodbye tp sleepless nights thanks to one father who has come up with a trick that can get your little one to sleep in less than one minute . at the beginning of ...",nathan dailo has found a way to get his son to sleep in 42 seconds in a youtube video he demonstrates how stroking his 3-month-old son 's face with a white piece of tissue paper sends him to sleep...
6,louis van gaal has found the best way to keep manchester united fans happy is to win at home . powering past aston villa was their 13th victory in 16 premier league matches at old trafford and the...,ander herrera put manchester united ahead just before half-time with a low left-footed effort wayne rooney doubled united 's lead with a beautiful half-volley on 79 minutes at old trafford christi...
7,"a prisoner who shared a ride to jail with freddie gray claims the 25-year-old was trying to injure himself inside a police van before he died from unexplained spinal cord injuries , according to a...",the prisoner who rode in a police van with freddie gray on april 12 in baltimore says gray was trying to hurt himself prisoner 's statement to investigators was part of an affidavit obtained wedne...
8,"on wednesday , mel posted a blog post on her website titled ` are we embarrassed of ivf ? ' where she detailed how she was undergoing the treatment , and that it was nothing for women to be ashame...",australian radio host stands up for women going through ivf treatment mel started her own ivf treatment and was showing solidarity with others she uploaded photographs of injecting herself for the...
9,""" i saw the top of the funnel cloud , and it was absolutely massive , "" she said . she watched the hulking gray twister grind past her town thursday , tearing up its fringes . "" we 're a community .","at least one person died as a result of storms in illinois , an official says fire department : rescuers searching for trapped victims in kirkland , illinois"


In [None]:
from google.colab import files

df.to_csv('bert_sum.csv')
files.download('bert_sum.csv')

In [None]:
sources=[]
for s in range(len(source)):
  new=' '.join(source[s])
  sources.append(new)

In [None]:
df=pd.DataFrame({'source':sources})
from google.colab import files

df.to_csv('source.csv')
files.download('source.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>