Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## Extractive Text Summerization on CNN/DM Dataset using BertSum


### Summary

This notebook demonstrates how to fine tune BERT for extractive text summarization. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result post-processing, and model evaluation.

BertSum refers to  [Fine-tune BERT for Extractive Summarization (https://arxiv.org/pdf/1903.10318.pdf) with [published example](https://github.com/nlpyang/BertSum/). Extractive summarization are usually used in document summarization where each input document consists of mutiple sentences. The preprocessing of the input training data involves assigning label 0 or 1 to the document sentences based on the give summary. The summarization problem is also simplfied to classifying whether each document sentence should be included in the summary. 

The figure below illustrates how BERTSum can be fine tuned for extractive summarization task. Each sentence is inserted with [CLS] token at the beginning and  [SEP] at the end. Interval segment embedding and positional embedding are added upon the token embedding before input to the BERT model. The [CLS] token representation is used as sentence embedding and only the [CLS] tokens are used as input for the summarization model. The summarization layer predicts whether the probability of each each sentence token should be included in the summary or not. Techniques like trigram blocking can be used to improve model accuarcy.   

<img src="https://nlpbp.blob.core.windows.net/images/BertSum.PNG">


### Before You Start

The running time shown in this notebook is on a Standard_NC24s_v3 Azure Deep Learning Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** and **`USE_PREPROCESSED_DATA`** to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|USE_PREPROCESSED_DATA|encoder|Machine Configurations|Running time|
|:---------|:---------|:---------|:----------------------|:------------|
|True|True|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 20 minutes |
|False|True|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 60 minutes |
|True|False|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 20 minutes |
|True|True|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 80 minutes |
|False|True|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 6.5hours |
|True|False|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 80 minutes |
|False|False|any| 1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| > 24 hours|


In [1]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = True
USE_PREPROCESSED_DATA = True

### Configuration

First you need to clone a modified version of BertSum so that it works for prediction cases and can run on any GPU device ID on your machine

In [2]:
!wget https://raw.githubusercontent.com/nlpyang/BertSum/master/bert_config_uncased_base.json

--2019-10-15 17:28:02--  https://raw.githubusercontent.com/nlpyang/BertSum/master/bert_config_uncased_base.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.124.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.124.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 313 [text/plain]
Saving to: ‘bert_config_uncased_base.json’


2019-10-15 17:28:02 (59.7 MB/s) - ‘bert_config_uncased_base.json’ saved [313/313]



In [3]:
BERT_CONFIG_PATH = "./bert_config_uncased_base.json"

We also need to set the environments variable to make sure you can access the GPUs on your machine.

In [4]:
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

In [5]:
import sys
import os

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)
sys.path.insert(0, "./")

Also, we need to install the dependencies for pyrouge.

In [6]:
# dependencies for ROUGE-1.5.5.pl
!sudo apt-get update
!sudo apt-get install expat
!sudo apt-get install libexpat-dev -y

Hit:1 http://azure.archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://azure.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:4 http://azure.archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]    
Hit:6 https://packages.microsoft.com/repos/microsoft-ubuntu-xenial-prod xenial InRelease
Hit:7 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Fetched 252 kB in 1s (350 kB/s)                    
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
expat is already the newest version (2.2.5-3ubuntu0.2).
The following packages were automatically installed and are no longer required:
  linux-azure-cloud-tools-5.0.0-1018 linux-azure-headers-5.0.0-1018
  linux-azure-tools-5.0.0-1018
Use 'sudo apt 

Run the following command in your terminal to install pre-requiste for using pyrouge.
1. sudo cpan install XML::Parser
1. sudo cpan install XML::Parser::PerlSAX
1. sudo cpan install XML::DOM

Download ROUGE-1.5.5 from https://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-1.5.5.
Run the following command in your terminal.
* pyrouge_set_rouge_path $ABSOLUTE_DIRECTORY_TO_ROUGE-1.5.5.pl


### Data Preprossing

The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary in this experiment. The dataset consits of ~289K training examples, ~11K valiation and ~11K test dataset.  You can choose to use the preprocessed version at [BERTSum published example](https://github.com/nlpyang/BertSum/) or use the following section to preprocess the data. Since it takes up to 28 hours to preprocess the training data  to run on 10  Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, we suggest you continue with set USE_PREPROCESSED_DATA as True and experiment with data preprocessing  with QUICKRUN set as True.



#### Download Preprocessed Data
Please go to  [BERTSum published example](https://github.com/nlpyang/BertSum/) to download preprocessed data and unzip it to the folder "./bert_data" at the current path.

In [7]:
USE_PREPROCESSED_DATA = True
if USE_PREPROCESSED_DATA:
    BERT_DATA_PATH = "./bert_data/"

If you choose to use preprocessed data, continue to section Model training.


#### Run Data Preprocessing
To continue with the data preprocessing, run the following command to download from https://github.com/harvardnlp/sent-summary and unzip the data to folder ./harvardnlp_cnndm.

In [8]:
!wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz &&\
    mkdir -p harvardnlp_cnndm &&\
    mv cnndm.tar.gz ./harvardnlp_cnndm && cd ./harvardnlp_cnndm &&\
    tar -xvf cnndm.tar.gz 

--2019-10-15 17:28:19--  https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.238.221
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.238.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500375629 (477M) [application/x-gzip]
Saving to: ‘cnndm.tar.gz’


2019-10-15 17:28:24 (91.4 MB/s) - ‘cnndm.tar.gz’ saved [500375629/500375629]

test.txt.src
test.txt.tgt.tagged
train.txt.src
train.txt.tgt.tagged
val.txt.src
val.txt.tgt.tagged


##### Details of Data Preprocessing

The purpose of preprocessing is to process the input articles to the format that BertSum takes.  Functions defined in harvardnlp_cnndm_preprocess function are specific to CNN/DM dataset that's processed by harvardnlp. However, it provides a skeleton of how to preprocessing data into the format that BertSum takes. Assuming you have all articles and target summery each in a file, line-breaker seperated, the steps to preprocess the data are:
1. sentence tokenization
2. word tokenization
3. label the sentences in the article with 1 meaning the sentence is selected and 0 meaning the sentence is not selected. The options for the selection algorithms are "greedy" and "combination"
3. convert each example to  BertSum format
    - filter the sentences in the example based on the min_src_ntokens argument. If the total sentence number after filtering is less than min_nsents, the example is discarded.
    - truncate the sentences in the example if the length is greater than max_src_ntokens
    - truncate the sentences in the example and the labels if the totle number of sentences is greater than max_nsents
    - [CLS] and [SEP] are inserted before and after each sentence
    - wordPiece tokenization
    - truncate the example to 512 tokens
    - convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary.
    - segment ids are generated
    - [CLS] token positions are logged
    - [CLS] token labels are truncated if it's greater than 512, which is the maximum input length that can be taken by the BERT model.
    
    
Note that the original BERTSum paper use Stanford CoreNLP for data proprocessing, here we'll how to use NLTK version as there is no addtional setup required to use NLTK.

In [9]:
from utils_nlp.dataset.harvardnlp_cnndm import harvardnlp_cnndm_preprocess
from utils_nlp.models.bert.extractive_text_summarization import bertsum_formatting

[nltk_data] Downloading package punkt to /home/daden/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
%%time
max_train_job_number = -1
max_test_job_number = -1
if QUICK_RUN:
    max_train_job_number = 100
    max_test_job_number = 100

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 7.87 µs


##### Preprocess training data

In [11]:
%%time
TRAIN_SRC_FILE = "./harvardnlp_cnndm/train.txt.src"
TRAIN_TGT_FILE = "./harvardnlp_cnndm/train.txt.tgt.tagged"
PROCESSED_TRAIN_FILE = f"./harvardnlp_cnndm/train.bertdata_{QUICK_RUN}"


import multiprocessing

n_cpus = multiprocessing.cpu_count() - 1
jobs = harvardnlp_cnndm_preprocess(
    n_cpus, TRAIN_SRC_FILE, TRAIN_TGT_FILE, max_train_job_number
)
print("total length of training data:", len(jobs))
from bertsum.prepro.data_builder import BertData
from utils_nlp.models.bert.extractive_text_summarization import Bunch

default_preprocessing_parameters = {
    "max_nsents": 200,
    "max_src_ntokens": 2000,
    "min_nsents": 3,
    "min_src_ntokens": 5,
    "use_interval": True,
}
args = Bunch(default_preprocessing_parameters)
bertdata = BertData(args)
bertsum_formatting(
    n_cpus, bertdata, "combination", jobs[0:max_train_job_number], PROCESSED_TRAIN_FILE
)

total length of training data: 100
CPU times: user 2.98 s, sys: 2.49 s, total: 5.46 s
Wall time: 32.5 s


##### Preprocessing test data

In [12]:
%%time
TEST_SRC_FILE = "./harvardnlp_cnndm/test.txt.src"
TEST_TGT_FILE = "./harvardnlp_cnndm/test.txt.tgt.tagged"
PROCESSED_TEST_FILE = f"./harvardnlp_cnndm/test.bertdata_{QUICK_RUN}"

import multiprocessing

n_cpus = multiprocessing.cpu_count() - 1
jobs = harvardnlp_cnndm_preprocess(
    n_cpus, TEST_SRC_FILE, TEST_TGT_FILE, max_test_job_number
)
print("total length of training data:", len(jobs))
from bertsum.prepro.data_builder import BertData
from utils_nlp.models.bert.extractive_text_summarization import Bunch

default_preprocessing_parameters = {
    "max_nsents": 200,
    "max_src_ntokens": 2000,
    "min_nsents": 3,
    "min_src_ntokens": 5,
    "use_interval": True,
}
args = Bunch(default_preprocessing_parameters)
bertdata = BertData(args)
bertsum_formatting(
    n_cpus, bertdata, "combination", jobs[0:max_test_job_number], PROCESSED_TEST_FILE
)

total length of training data: 100
CPU times: user 586 ms, sys: 1.01 s, total: 1.59 s
Wall time: 15.6 s


In [13]:
jobs[0]

{'src': [['marseille',
   ',',
   'france',
   '(',
   'cnn',
   ')',
   'the',
   'french',
   'prosecutor',
   'leading',
   'an',
   'investigation',
   'into',
   'the',
   'crash',
   'of',
   'germanwings',
   'flight',
   '9525',
   'insisted',
   'wednesday',
   'that',
   'he',
   'was',
   'not',
   'aware',
   'of',
   'any',
   'video',
   'footage',
   'from',
   'on',
   'board',
   'the',
   'plane',
   '.'],
  ['marseille',
   'prosecutor',
   'brice',
   'robin',
   'told',
   'cnn',
   'that',
   '``',
   'so',
   'far',
   'no',
   'videos',
   'were',
   'used',
   'in',
   'the',
   'crash',
   'investigation',
   '.',
   '``'],
  ['he',
   'added',
   ',',
   '``',
   'a',
   'person',
   'who',
   'has',
   'such',
   'a',
   'video',
   'needs',
   'to',
   'immediately',
   'give',
   'it',
   'to',
   'the',
   'investigators',
   '.',
   '``'],
  ['robin',
   "'s",
   'comments',
   'follow',
   'claims',
   'by',
   'two',
   'magazines',
   ',',
   'german'

#### Inspect the data

In [14]:
import torch

bert_format_data = torch.load(PROCESSED_TRAIN_FILE)
print(len(bert_format_data))
bert_format_data[0].keys()

100


dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])

In [15]:
bert_format_data[5]["src"]

[101,
 1006,
 13229,
 1007,
 1011,
 1011,
 1996,
 2120,
 2374,
 2223,
 2038,
 20733,
 6731,
 5865,
 14929,
 9074,
 2745,
 10967,
 2243,
 2302,
 3477,
 1010,
 4584,
 2007,
 1996,
 2223,
 2056,
 5958,
 1012,
 102,
 101,
 5088,
 2732,
 2745,
 10967,
 2243,
 2003,
 2275,
 2000,
 3711,
 1999,
 2457,
 6928,
 1012,
 102,
 101,
 1037,
 3648,
 2097,
 2031,
 1996,
 2345,
 2360,
 2006,
 1037,
 14865,
 3066,
 1012,
 102,
 101,
 3041,
 1010,
 10967,
 2243,
 4914,
 2000,
 8019,
 1999,
 1037,
 3899,
 22158,
 3614,
 2004,
 2112,
 1997,
 1037,
 14865,
 3820,
 2007,
 2976,
 19608,
 1999,
 3448,
 1012,
 1036,
 1036,
 102,
 101,
 2115,
 4914,
 6204,
 2001,
 2025,
 2069,
 6206,
 1010,
 2021,
 2036,
 10311,
 1998,
 16360,
 2890,
 10222,
 19307,
 1012,
 102,
 101,
 2115,
 2136,
 1010,
 1996,
 5088,
 1010,
 1998,
 5088,
 4599,
 2031,
 2035,
 2042,
 3480,
 2011,
 2115,
 4506,
 1010,
 1036,
 1036,
 5088,
 5849,
 5074,
 2204,
 5349,
 2056,
 1999,
 1037,
 3661,
 2000,
 10967,
 2243,
 1012,
 102,
 101,
 2204,
 534

In [16]:
bert_format_data[5]["tgt_txt"]

"new : nfl chief , atlanta falcons owner critical of michael vick 's conduct .<q>nfl suspends falcons quarterback indefinitely without pay .<q>vick admits funding dogfighting operation but says he did not gamble .<q>vick due in federal court monday ; future in nfl remains uncertain .<q>"

In [17]:
bert_format_data[5]["labels"]

[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

In [18]:
bert_format_data[5]["src_txt"]

['( cnn ) -- the national football league has indefinitely suspended atlanta falcons quarterback michael vick without pay , officials with the league said friday .',
 'nfl star michael vick is set to appear in court monday .',
 'a judge will have the final say on a plea deal .',
 'earlier , vick admitted to participating in a dogfighting ring as part of a plea agreement with federal prosecutors in virginia . ``',
 'your admitted conduct was not only illegal , but also cruel and reprehensible .',
 'your team , the nfl , and nfl fans have all been hurt by your actions , `` nfl commissioner roger goodell said in a letter to vick .',
 'goodell said he would review the status of the suspension after the legal proceedings are over .',
 'in papers filed friday with a federal court in virginia , vick also admitted that he and two co-conspirators killed dogs that did not fight well .',
 "falcons owner arthur blank said vick 's admissions describe actions that are `` incomprehensible and unaccep

### Model training
To start model training, we need to create a instance of BertSumExtractiveSummarizer, a wrapper for running BertSum-based finetuning. You can select any device ID on your machine, but make sure that you include the string version of the device ID in the gpu_ranks argument.




In [19]:
## choose which GPU device to use
device_id = 0
gpu_ranks = str(device_id)

#### Choose the encoder algorithm. There are four options:
- baseline: it used a smaller transformer model to replace the bert model and with transformer summarization layer
- classifier: it uses pretrained BERT and fine-tune BERT with **simple logistic classification** summarization layer
- transformer: it uses pretrained BERT and fine-tune BERT with **transformer** summarization layer
- RNN: it uses pretrained BERT and fine-tune BERT with **LSTM** summarization layer

In [20]:
encoder = "transformer"
model_base_path = "./models/"
log_base_path = "./logs/"
result_base_path = "./results"

import os

if not os.path.exists(model_base_path):
    os.makedirs(model_base_path)
if not os.path.exists(log_base_path):
    os.makedirs(log_base_path)
if not os.path.exists(result_base_path):
    os.makedirs(result_base_path)

from random import random

random_number = random()

In [21]:
from utils_nlp.models.bert.extractive_text_summarization import (
    BertSumExtractiveSummarizer,
)

bertsum_model = BertSumExtractiveSummarizer(
    encoder=encoder,
    model_path=model_base_path + encoder + str(random_number),
    log_file=log_base_path + encoder + str(random_number),
    bert_config_path=BERT_CONFIG_PATH,
    gpu_ranks=gpu_ranks,
)

['0']
{0: 0}


Here we use the fully processed CNN/DM dataset to train the model. During the training, you can stop any time and retrain from the previous saved checkpoint.

In [22]:
if USE_PREPROCESSED_DATA is False:
    training_data_files = [PROCESSED_TRAIN_FILE]
else:
    import glob

    pts = sorted(glob.glob(BERT_DATA_PATH + "cnndm.train" + ".[0-9]*.pt"))
    training_data_files = pts

In [23]:
training_data_files

['./bert_data/cnndm.train.0.bert.pt',
 './bert_data/cnndm.train.1.bert.pt',
 './bert_data/cnndm.train.10.bert.pt',
 './bert_data/cnndm.train.100.bert.pt',
 './bert_data/cnndm.train.101.bert.pt',
 './bert_data/cnndm.train.102.bert.pt',
 './bert_data/cnndm.train.103.bert.pt',
 './bert_data/cnndm.train.104.bert.pt',
 './bert_data/cnndm.train.105.bert.pt',
 './bert_data/cnndm.train.106.bert.pt',
 './bert_data/cnndm.train.107.bert.pt',
 './bert_data/cnndm.train.108.bert.pt',
 './bert_data/cnndm.train.109.bert.pt',
 './bert_data/cnndm.train.11.bert.pt',
 './bert_data/cnndm.train.110.bert.pt',
 './bert_data/cnndm.train.111.bert.pt',
 './bert_data/cnndm.train.112.bert.pt',
 './bert_data/cnndm.train.113.bert.pt',
 './bert_data/cnndm.train.114.bert.pt',
 './bert_data/cnndm.train.115.bert.pt',
 './bert_data/cnndm.train.116.bert.pt',
 './bert_data/cnndm.train.117.bert.pt',
 './bert_data/cnndm.train.118.bert.pt',
 './bert_data/cnndm.train.119.bert.pt',
 './bert_data/cnndm.train.12.bert.pt',
 './ber

In [24]:
## training_steps is (number of epoches * the total number of batches in the training data )/ accumulation_steps
## batch_size is the maximum number of tokens among all the training examples * number of training examples,
## training steps used by each GPU should be divided by number of GPU used for fair comparison.
## usually 10K steps can yield a model with decent performance
if QUICK_RUN:
    train_steps = 10000
else:
    train_steps = 50000

In [25]:
bertsum_model.fit(
    device_id, training_data_files, train_steps=train_steps, train_from=""
)

[2019-10-15 17:30:45,152 INFO] loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ./temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
[2019-10-15 17:30:45,154 INFO] extracting archive file ./temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpjp15yzc0


{'accum_count': 2, 'batch_size': 3000, 'beta1': 0.9, 'beta2': 0.999, 'block_trigram': True, 'decay_method': 'noam', 'dropout': 0.1, 'encoder': 'transformer', 'ff_size': 512, 'gpu_ranks': '0', 'heads': 4, 'hidden_size': 128, 'inter_layers': 2, 'lr': 0.002, 'max_grad_norm': 0, 'max_nsents': 100, 'max_src_ntokens': 200, 'min_nsents': 3, 'min_src_ntokens': 10, 'optim': 'adam', 'oracle_mode': 'combination', 'param_init': 0.0, 'param_init_glorot': True, 'recall_eval': False, 'report_every': 50, 'report_rouge': True, 'rnn_size': 512, 'save_checkpoint_steps': 500, 'seed': 42, 'temp_dir': './temp', 'test_all': False, 'test_from': '', 'train_from': '', 'use_interval': True, 'visible_gpus': '0', 'warmup_steps': 2000, 'world_size': 1, 'mode': 'train', 'model_path': './models/transformer0.37907517824181713', 'log_file': './logs/transformer0.37907517824181713', 'bert_config_path': './bert_config_uncased_base.json', 'gpu_ranks_map': {0: 0}}


[2019-10-15 17:30:48,988 INFO] Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

[2019-10-15 17:30:54,399 INFO] * number of parameters: 115790849
[2019-10-15 17:30:54,400 INFO] Start training...


device_id 0
gpu_rank 0


[2019-10-15 17:31:15,819 INFO] Step 50/10000; xent: 3.89; lr: 0.0000011;  48 docs/s;     21 sec
[2019-10-15 17:31:36,569 INFO] Step 100/10000; xent: 3.29; lr: 0.0000022;  49 docs/s;     42 sec
[2019-10-15 17:31:57,342 INFO] Step 150/10000; xent: 3.29; lr: 0.0000034;  48 docs/s;     63 sec
[2019-10-15 17:32:18,802 INFO] Step 200/10000; xent: 3.29; lr: 0.0000045;  47 docs/s;     84 sec
[2019-10-15 17:32:39,486 INFO] Step 250/10000; xent: 3.22; lr: 0.0000056;  48 docs/s;    105 sec
[2019-10-15 17:33:00,459 INFO] Step 300/10000; xent: 3.14; lr: 0.0000067;  48 docs/s;    126 sec
[2019-10-15 17:33:21,471 INFO] Step 350/10000; xent: 3.19; lr: 0.0000078;  49 docs/s;    147 sec
[2019-10-15 17:33:42,961 INFO] Step 400/10000; xent: 3.20; lr: 0.0000089;  46 docs/s;    168 sec
[2019-10-15 17:34:03,756 INFO] Step 450/10000; xent: 3.22; lr: 0.0000101;  48 docs/s;    189 sec
[2019-10-15 17:34:24,637 INFO] Step 500/10000; xent: 3.16; lr: 0.0000112;  48 docs/s;    210 sec
[2019-10-15 17:34:24,643 INFO] 

[2019-10-15 17:58:22,621 INFO] Step 3900/10000; xent: 2.79; lr: 0.0000320;  48 docs/s;   1648 sec
[2019-10-15 17:58:43,322 INFO] Step 3950/10000; xent: 2.90; lr: 0.0000318;  49 docs/s;   1668 sec
[2019-10-15 17:59:04,810 INFO] Step 4000/10000; xent: 2.80; lr: 0.0000316;  47 docs/s;   1690 sec
[2019-10-15 17:59:04,814 INFO] Saving checkpoint ./models/transformer0.37907517824181713/model_step_4000.pt
[2019-10-15 17:59:26,773 INFO] Step 4050/10000; xent: 2.89; lr: 0.0000314;  45 docs/s;   1712 sec
[2019-10-15 17:59:47,458 INFO] Step 4100/10000; xent: 2.86; lr: 0.0000312;  49 docs/s;   1733 sec
[2019-10-15 18:00:08,124 INFO] Step 4150/10000; xent: 2.87; lr: 0.0000310;  48 docs/s;   1753 sec
[2019-10-15 18:00:29,734 INFO] Step 4200/10000; xent: 2.93; lr: 0.0000309;  47 docs/s;   1775 sec
[2019-10-15 18:00:50,554 INFO] Step 4250/10000; xent: 2.86; lr: 0.0000307;  49 docs/s;   1796 sec
[2019-10-15 18:01:11,237 INFO] Step 4300/10000; xent: 2.80; lr: 0.0000305;  49 docs/s;   1816 sec
[2019-10-1

[2019-10-15 18:24:44,627 INFO] Step 7650/10000; xent: 2.76; lr: 0.0000229;  49 docs/s;   3230 sec
[2019-10-15 18:25:05,523 INFO] Step 7700/10000; xent: 2.82; lr: 0.0000228;  48 docs/s;   3251 sec
[2019-10-15 18:25:27,089 INFO] Step 7750/10000; xent: 2.86; lr: 0.0000227;  47 docs/s;   3272 sec
[2019-10-15 18:25:47,876 INFO] Step 7800/10000; xent: 2.80; lr: 0.0000226;  48 docs/s;   3293 sec
[2019-10-15 18:26:08,708 INFO] Step 7850/10000; xent: 2.83; lr: 0.0000226;  48 docs/s;   3314 sec
[2019-10-15 18:26:29,493 INFO] Step 7900/10000; xent: 2.83; lr: 0.0000225;  49 docs/s;   3335 sec
[2019-10-15 18:26:51,045 INFO] Step 7950/10000; xent: 2.76; lr: 0.0000224;  47 docs/s;   3356 sec
[2019-10-15 18:27:11,842 INFO] Step 8000/10000; xent: 2.78; lr: 0.0000224;  48 docs/s;   3377 sec
[2019-10-15 18:27:11,846 INFO] Saving checkpoint ./models/transformer0.37907517824181713/model_step_8000.pt
[2019-10-15 18:27:33,976 INFO] Step 8050/10000; xent: 2.82; lr: 0.0000223;  45 docs/s;   3399 sec
[2019-10-1

### Model Evaluation

[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation has been commonly used for evaluation text summarization.

In [26]:
import torch
from utils_nlp.models.bert.extractive_text_summarization import get_data_iter
import os

if USE_PREPROCESSED_DATA is False:
    test_dataset = torch.load(PROCESSED_TEST_FILE)
else:
    test_dataset = []
    for i in range(0, 6):
        filename = os.path.join(BERT_DATA_PATH, "cnndm.test.{0}.bert.pt".format(i))
        test_dataset.extend(torch.load(filename))

In [27]:
checkpoint_to_test = 10000
model_for_test = os.path.join(
    model_base_path + encoder + str(random_number),
    f"model_step_{checkpoint_to_test}.pt",
)
from utils_nlp.models.bert.extractive_text_summarization import Bunch

target = [test_dataset[i]["tgt_txt"] for i in range(len(test_dataset))]
prediction = bertsum_model.predict(
    device_id,
    get_data_iter(test_dataset),
    test_from=model_for_test,
    sentence_seperator="<q>",
)

[2019-10-15 18:42:17,589 INFO] * number of parameters: 115790849


device_id 0
gpu_rank 0


In [28]:
len(prediction)

11489

In [29]:
from utils_nlp.eval.evaluate_summarization import get_rouge

rouge_baseline = get_rouge(prediction, target, "./results/")

11489
11489


2019-10-15 18:45:18,185 [MainThread  ] [INFO ]  Writing summaries.
[2019-10-15 18:45:18,185 INFO] Writing summaries.
2019-10-15 18:45:18,194 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./results/tmpsubsloro/system and model files to ./results/tmpsubsloro/model.
[2019-10-15 18:45:18,194 INFO] Processing summaries. Saving system files to ./results/tmpsubsloro/system and model files to ./results/tmpsubsloro/model.
2019-10-15 18:45:18,195 [MainThread  ] [INFO ]  Processing files in ./results/rouge-tmp-2019-10-15-18-45-16/candidate/.
[2019-10-15 18:45:18,195 INFO] Processing files in ./results/rouge-tmp-2019-10-15-18-45-16/candidate/.
2019-10-15 18:45:19,514 [MainThread  ] [INFO ]  Saved processed files to ./results/tmpsubsloro/system.
[2019-10-15 18:45:19,514 INFO] Saved processed files to ./results/tmpsubsloro/system.
2019-10-15 18:45:19,516 [MainThread  ] [INFO ]  Processing files in ./results/rouge-tmp-2019-10-15-18-45-16/reference/.
[2019-10-15 18:45:19,516 INF

---------------------------------------------
1 ROUGE-1 Average_R: 0.53085 (95%-conf.int. 0.52800 - 0.53379)
1 ROUGE-1 Average_P: 0.37857 (95%-conf.int. 0.37621 - 0.38098)
1 ROUGE-1 Average_F: 0.42727 (95%-conf.int. 0.42510 - 0.42940)
---------------------------------------------
1 ROUGE-2 Average_R: 0.24509 (95%-conf.int. 0.24219 - 0.24801)
1 ROUGE-2 Average_P: 0.17560 (95%-conf.int. 0.17344 - 0.17784)
1 ROUGE-2 Average_F: 0.19750 (95%-conf.int. 0.19532 - 0.19981)
---------------------------------------------
1 ROUGE-L Average_R: 0.48578 (95%-conf.int. 0.48310 - 0.48855)
1 ROUGE-L Average_P: 0.34703 (95%-conf.int. 0.34466 - 0.34939)
1 ROUGE-L Average_F: 0.39138 (95%-conf.int. 0.38922 - 0.39357)



In [30]:
prediction[0]

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .<q>he was flown back to chicago via air ambulance on march 20 , but he died on sunday .<q>he was taken to a medical facility in the chicago area , close to his family home in glen ellyn .'

In [31]:
target[0]

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january<q>he was flown back to chicago via air on march 20 but he died on sunday<q>initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed<q>his cousin claims he was attacked and thrown 40ft from a bridge'

### Prediction

In [32]:
from utils_nlp.models.bert.extractive_text_summarization import Bunch

args = Bunch(
    {
        "max_nsents": int(1e5),
        "max_src_ntokens": int(2e6),
        "min_nsents": -1,
        "min_src_ntokens": -1,
        "use_interval": True,
    }
)

In [33]:
from bertsum.prepro.data_builder import BertData

bertdata = BertData(args)

[2019-10-15 18:47:09,248 INFO] loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/daden/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [34]:
import nltk
from utils_nlp.dataset.harvardnlp_cnndm import preprocess
from nltk import tokenize
from bertsum.others.utils import clean


def preprocess_source(line):
    return preprocess((line, [clean, tokenize.sent_tokenize], nltk.word_tokenize))

In [35]:
text = "\n".join(test_dataset[0]["src_txt"])

In [36]:
new_src = preprocess_source(text)
b_data = bertdata.preprocess(new_src, None, None)
indexed_tokens, labels, segments_ids, cls_ids, src_txt, tgt_txt = b_data
b_data_dict = {
    "src": indexed_tokens,
    "labels": labels,
    "segs": segments_ids,
    "clss": cls_ids,
    "src_txt": src_txt,
    "tgt_txt": tgt_txt,
}

In [37]:
new_src

[['a',
  'university',
  'of',
  'iowa',
  'student',
  'has',
  'died',
  'nearly',
  'three',
  'months',
  'after',
  'a',
  'fall',
  'in',
  'rome',
  'in',
  'a',
  'suspected',
  'robbery',
  'attack',
  'in',
  'rome',
  '.'],
 ['andrew',
  'mogni',
  ',',
  '20',
  ',',
  'from',
  'glen',
  'ellyn',
  ',',
  'illinois',
  ',',
  'had',
  'only',
  'just',
  'arrived',
  'for',
  'a',
  'semester',
  'program',
  'in',
  'italy',
  'when',
  'the',
  'incident',
  'happened',
  'in',
  'january',
  '.'],
 ['he',
  'was',
  'flown',
  'back',
  'to',
  'chicago',
  'via',
  'air',
  'ambulance',
  'on',
  'march',
  '20',
  ',',
  'but',
  'he',
  'died',
  'on',
  'sunday',
  '.'],
 ['andrew',
  'mogni',
  ',',
  '20',
  ',',
  'from',
  'glen',
  'ellyn',
  ',',
  'illinois',
  ',',
  'a',
  'university',
  'of',
  'iowa',
  'student',
  'has',
  'died',
  'nearly',
  'three',
  'months',
  'after',
  'a',
  'fall',
  'in',
  'rome',
  'in',
  'a',
  'suspected',
  'robbery',

In [38]:
b_data_dict["src_txt"]

['a university of iowa student has died nearly three months after a fall in rome in a suspected robbery attack in rome .',
 'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .',
 'he was flown back to chicago via air ambulance on march 20 , but he died on sunday .',
 'andrew mogni , 20 , from glen ellyn , illinois , a university of iowa student has died nearly three months after a fall in rome in a suspected robbery he was taken to a medical facility in the chicago area , close to his family home in glen ellyn .',
 "he died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest .",
 'initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed .',
 "on sunday , his cousin abby wrote online : ` this morning my cousin a

In [39]:
b_data_dict["tgt_txt"]

In [40]:
model_for_test = os.path.join(
    model_base_path + encoder + str(random_number),
    f"model_step_{checkpoint_to_test}.pt",
)
# get_data_iter(output,batch_size=30000)
prediction = bertsum_model.predict(
    device_id,
    get_data_iter([b_data_dict], False),
    test_from=model_for_test,
    sentence_seperator="<q>",
)

[2019-10-15 18:47:11,792 INFO] * number of parameters: 115790849


device_id 0
gpu_rank 0


In [41]:
prediction[0]

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .<q>he was flown back to chicago via air ambulance on march 20 , but he died on sunday .<q>a university of iowa student has died nearly three months after a fall in rome in a suspected robbery attack in rome .'

In [42]:
test_dataset[0]["tgt_txt"]

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january<q>he was flown back to chicago via air on march 20 but he died on sunday<q>initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed<q>his cousin claims he was attacked and thrown 40ft from a bridge'