## OpenNMT for English text summarization task.

The [OpenNMT](http://opennmt.net/OpenNMT-py/main.html) framework  is a open source neural machine translation system  designed for the reserach purpose for machine translation and modified later to also generate text summarization. The open source code from OpenNMT is used in this notebook for summarizing CNN-DM paragraphs into simple summaries of fewer sentences.

In [0]:
import torch 

#use only if the pacakges needs to be updated

!pip install pandas --upgrade
!pip install imgaug --upgrade
!pip install numpy --upgrade

Clone the source code from the Github repository  [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py)  and install the requirements.

In [0]:
!git clone https://github.com/OpenNMT/OpenNMT-py

Cloning into 'OpenNMT-py'...
remote: Enumerating objects: 13531, done.[K
remote: Total 13531 (delta 0), reused 0 (delta 0), pack-reused 13531[K
Receiving objects: 100% (13531/13531), 145.37 MiB | 34.73 MiB/s, done.
Resolving deltas: 100% (9648/9648), done.


In [0]:
%cd OpenNMT-py

/content/OpenNMT-py


In [0]:
# install all the required libraries

!pip install -r requirements.txt

Collecting git+https://github.com/pytorch/text.git@master#wheel=torchtext (from -r requirements.txt (line 4))
  Cloning https://github.com/pytorch/text.git (to revision master) to /tmp/pip-req-build-6x6ronul
Collecting tqdm==4.30.* (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/76/4c/103a4d3415dafc1ddfe6a6624333971756e2d3dd8c6dc0f520152855f040/tqdm-4.30.0-py2.py3-none-any.whl (47kB)
[K    100% |████████████████████████████████| 51kB 2.1MB/s 
Collecting configargparse (from -r requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/55/ea/f0ade52790bcd687127a302b26c1663bf2e0f23210d5281dbfcd1dfcda28/ConfigArgParse-0.14.0.tar.gz
Building wheels for collected packages: configargparse, torchtext
  Building wheel for configargparse (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/aa/9c/ce/7e904dddb8c7595ffbe3409d24455bc5005852850e36011bda
  Building wheel for torchtext (setup.py) ... [?25

In [0]:
# download the pre-trained models from amazon S3

%cd available_models

! wget https://s3.amazonaws.com/opennmt-models/gigaword_copy_acc_51.78_ppl_11.71_e20.pt # gigaword model
  
! wget https://s3.amazonaws.com/opennmt-models/ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl_10.58_e17.pt # CNN-DM model

/content/OpenNMT-py/available_models
--2019-03-21 16:50:32--  https://s3.amazonaws.com/opennmt-models/gigaword_copy_acc_51.78_ppl_11.71_e20.pt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.104.229
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.104.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346576972 (331M) [application/x-www-form-urlencoded]
Saving to: ‘gigaword_copy_acc_51.78_ppl_11.71_e20.pt’


2019-03-21 16:50:35 (95.6 MB/s) - ‘gigaword_copy_acc_51.78_ppl_11.71_e20.pt’ saved [346576972/346576972]

--2019-03-21 16:50:40--  https://s3.amazonaws.com/opennmt-models/ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl_10.58_e17.pt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.20.101
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.20.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 904867573 (863M) [application/x-www-form-urlencoded]
Saving to: ‘ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl

In [0]:
# download the zip file of CNN-DM datasets and unzip it

%cd ../data

! wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz # cnn corpus to get test datasets
  
! tar -xvzf cnndm.tar.gz

/content/OpenNMT-py/data
--2019-03-21 16:40:26--  https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.173
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500375629 (477M) [application/x-gzip]
Saving to: ‘cnndm.tar.gz’


2019-03-21 16:40:38 (40.0 MB/s) - ‘cnndm.tar.gz’ saved [500375629/500375629]

test.txt.src
test.txt.tgt.tagged
train.txt.src
train.txt.tgt.tagged
val.txt.src
val.txt.tgt.tagged


In [0]:
# create an empty file to store the summarized output

! touch test_cnn.out
% cd ..

/content/OpenNMT-py


In [0]:
# translate the source test dataset to short text summaries using CNN pre-trained model

!python translate.py -gpu 0 \
                    -batch_size 20 \
                    -beam_size 10 \
                    -model available_models/ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl_10.58_e17.pt \
                    -src data/test.txt.src \
                    -output data/test_cnn.out \
                    -min_length 35 \
                    -verbose \
                    -stepwise_penalty \
                    -coverage_penalty summary \
                    -beta 5 \
                    -length_penalty wu \
                    -alpha 0.9 \
                    -verbose \
                    -block_ngram_repeat 3 \
                    -ignore_when_blocking "." "</t>" "<t>"

[2019-03-21 16:54:48,102 INFO] Translating shard 0.
  var = torch.tensor(arr, dtype=self.dtype, device=device)

SENT 1: ['marseille', ',', 'france', '-lrb-', 'cnn', '-rrb-', 'the', 'french', 'prosecutor', 'leading', 'an', 'investigation', 'into', 'the', 'crash', 'of', 'germanwings', 'flight', '9525', 'insisted', 'wednesday', 'that', 'he', 'was', 'not', 'aware', 'of', 'any', 'video', 'footage', 'from', 'on', 'board', 'the', 'plane', '.', 'marseille', 'prosecutor', 'brice', 'robin', 'told', 'cnn', 'that', '``', 'so', 'far', 'no', 'videos', 'were', 'used', 'in', 'the', 'crash', 'investigation', '.', "''", 'he', 'added', ',', '``', 'a', 'person', 'who', 'has', 'such', 'a', 'video', 'needs', 'to', 'immediately', 'give', 'it', 'to', 'the', 'investigators', '.', "''", 'robin', "'s", 'comments', 'follow', 'claims', 'by', 'two', 'magazines', ',', 'german', 'daily', 'bild', 'and', 'french', 'paris', 'match', ',', 'of', 'a', 'cell', 'phone', 'video', 'showing', 'the', 'harrowing', 'final', 'secon

As the test dataset is huge, due to the memory issue the notebook was actually run using computational server '+4 NVIDIA GeForce GTX 1080 Ti having 768 GB RAM and 80×2.2GHz CPUs' from the Data and web science group of University of Mannheim.