## TensorFlow English  text summarization

The system from the Natural Language Processing Lab at Tsinghua University, China is used for text summarization of three different datasets. The implementation is a TensorFlow sequence-to-sequence model using a bidirectional GRU encoder and a GRU decoder. This notebook also includes pyrouge evaluation of the test datasets.

In [2]:
# Google drive is mounted and the source code is saved in the Google drive

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [3]:
% cd gdrive/My Drive/Thesis

/content/gdrive/My Drive/Thesis


Clone the source code from the Github repository [TensorFlow-Summarization](https://github.com/thunlp/TensorFlow-Summarization.git) into your local drive

In [7]:
!git clone https://github.com/thunlp/TensorFlow-Summarization.git

Cloning into 'TensorFlow-Summarization'...
remote: Enumerating objects: 218, done.[K
remote: Total 218 (delta 0), reused 0 (delta 0), pack-reused 218
Receiving objects: 100% (218/218), 871.09 KiB | 11.03 MiB/s, done.
Resolving deltas: 100% (115/115), done.


In [8]:
%cd TensorFlow-Summarization

/content/gdrive/My Drive/Thesis/TensorFlow-Summarization


In [9]:
# install the tensorflow library of CPU version

!pip3 install -U tensorflow==1.1

Collecting tensorflow==1.1
[?25l  Downloading https://files.pythonhosted.org/packages/cd/e4/b2a8bcd1fa689489050386ec70c5c547e4a75d06f2cc2b55f45463cd092c/tensorflow-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (31.4MB)
[K     |████████████████████████████████| 31.4MB 59.8MB/s 
[31mERROR: stable-baselines 2.2.1 has requirement tensorflow>=1.5.0, but you'll have tensorflow 1.1.0 which is incompatible.[0m
[31mERROR: magenta 0.3.19 has requirement tensorflow>=1.12.0, but you'll have tensorflow 1.1.0 which is incompatible.[0m
Installing collected packages: tensorflow
  Found existing installation: tensorflow 1.14.0
    Uninstalling tensorflow-1.14.0:
      Successfully uninstalled tensorflow-1.14.0
Successfully installed tensorflow-1.1.0


In [10]:
!pip3 uninstall tensorflow-gpu==1.1 # Uninstall GPU version of tensorflow if installed



In [0]:
import tensorflow

The datasets are downloaded from [here](https://github.com/harvardnlp/sent-summary) and the pretrained model is downloaded from [here](https://drive.google.com/drive/folders/1IiwyHBzK7xvUtMrRY7VHzIRJzkzXn4LM?usp=sharing).

The zip file of the datasets are extracted inside the data folder and models are extracted inside the model folder. The files are arranged as shown in their Github [repository](https://github.com/thunlp/TensorFlow-Summarization.git).

In [14]:
# run the test script which summaries the test datasets in the data folder using pre-trained models. 
# the machine generated summaries for three datasets Gigaword, DUC 2003 and DUC 2004 are saved in the output folder.

!python script/test.py

[['model', '300000']]
Aug 09 13:12 test.py[line:49] INFO Test model/model.ckpt-300000. 
Aug 09 13:12 test.py[line:53] INFO Test data/test.giga.txt with beam_size = 1
Aug 09 13:12 data_util.py[line:17] INFO Try load dict from data/doc_dict.txt.
Aug 09 13:12 data_util.py[line:33] INFO Load dict data/doc_dict.txt with 30000 words.
Aug 09 13:12 data_util.py[line:17] INFO Try load dict from data/sum_dict.txt.
Aug 09 13:12 data_util.py[line:33] INFO Load dict data/sum_dict.txt with 30000 words.
Aug 09 13:12 data_util.py[line:172] INFO Load test document from data/test.giga.txt.
Aug 09 13:12 data_util.py[line:178] INFO Load 1951 testing documents.
Aug 09 13:12 data_util.py[line:183] INFO Doc dict covers 97.70% words.
2019-08-09 13:12:29.697472: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2019-08-09 13:12:29.697528: W tensorflow/core/platform/c

**Evaluation using ROUGE**

Pyrouge used here for evaluation is document wise evaluation and not sentence wise. 

In [15]:
# install pyrouge using pip

!pip install -U git+https://github.com/pltrdy/pyrouge

Collecting git+https://github.com/pltrdy/pyrouge
  Cloning https://github.com/pltrdy/pyrouge to /tmp/pip-req-build-_zoml8mq
  Running command git clone -q https://github.com/pltrdy/pyrouge /tmp/pip-req-build-_zoml8mq
Building wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pyrouge: filename=pyrouge-0.1.3-cp36-none-any.whl size=191883 sha256=c34d4d9b512626d71672b6084fffe37b37345cf65b6dbf8f4d925c7c64295889
  Stored in directory: /tmp/pip-ephem-wheel-cache-9qgzgcn4/wheels/d7/6a/52/2534a6d70b54de6e94933267ff415b0b85cdb83836af9f74c3
Successfully built pyrouge
Installing collected packages: pyrouge
Successfully installed pyrouge-0.1.3


In [16]:
# download files2rouge code and install it

!git clone https://github.com/pltrdy/files2rouge.git     
%cd files2rouge
!python setup_rouge.py
!python setup.py install

Cloning into 'files2rouge'...
remote: Enumerating objects: 202, done.[K
remote: Total 202 (delta 0), reused 0 (delta 0), pack-reused 202[K
Receiving objects: 100% (202/202), 195.34 KiB | 980.00 KiB/s, done.
Resolving deltas: 100% (84/84), done.
/content/gdrive/My Drive/Thesis/TensorFlow-Summarization/files2rouge
files2rouge uses scripts and tools that will not be stored with the python package
where do you want to save it? [default: /root/.files2rouge/]/content/gdrive/My Drive/Thesis/TensorFlow-Summarization/files2rouge
Copying './files2rouge/RELEASE-1.5.5/' to '/content/gdrive/My Drive/Thesis/TensorFlow-Summarization/files2rouge'
Traceback (most recent call last):
  File "setup_rouge.py", line 40, in <module>
    data = copy_rouge()
  File "setup_rouge.py", line 33, in copy_rouge
    shutil.copytree(src_rouge_root, path)
  File "/usr/lib/python3.6/shutil.py", line 321, in copytree
    os.makedirs(dst)
  File "/usr/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
Fil

In [0]:
# Evaluation duc2003

! files2rouge ../output/duc2003.10_300000.txt ../data/task1_ref0_duc2003.txt #--verbose

Preparing documents...
Running ROUGE...
---------------------------------------------
1 ROUGE-1 Average_R: 0.25589 (95%-conf.int. 0.24063 - 0.27148)
1 ROUGE-1 Average_P: 0.27932 (95%-conf.int. 0.26346 - 0.29460)
1 ROUGE-1 Average_F: 0.26012 (95%-conf.int. 0.24529 - 0.27475)
---------------------------------------------
1 ROUGE-2 Average_R: 0.08260 (95%-conf.int. 0.07293 - 0.09203)
1 ROUGE-2 Average_P: 0.09443 (95%-conf.int. 0.08331 - 0.10499)
1 ROUGE-2 Average_F: 0.08663 (95%-conf.int. 0.07662 - 0.09651)
---------------------------------------------
1 ROUGE-L Average_R: 0.23126 (95%-conf.int. 0.21694 - 0.24600)
1 ROUGE-L Average_P: 0.25173 (95%-conf.int. 0.23722 - 0.26633)
1 ROUGE-L Average_F: 0.23430 (95%-conf.int. 0.22075 - 0.24771)

Elapsed time: 5.567 seconds


In [0]:
# Evaluation duc2004

! files2rouge ../output/duc2004.10_300000.txt ../data/task1_ref0_duc2004.txt #--verbose

Preparing documents...
Running ROUGE...
---------------------------------------------
1 ROUGE-1 Average_R: 0.27251 (95%-conf.int. 0.25767 - 0.28864)
1 ROUGE-1 Average_P: 0.29057 (95%-conf.int. 0.27528 - 0.30775)
1 ROUGE-1 Average_F: 0.27619 (95%-conf.int. 0.26169 - 0.29189)
---------------------------------------------
1 ROUGE-2 Average_R: 0.09100 (95%-conf.int. 0.08003 - 0.10155)
1 ROUGE-2 Average_P: 0.09839 (95%-conf.int. 0.08678 - 0.10993)
1 ROUGE-2 Average_F: 0.09251 (95%-conf.int. 0.08187 - 0.10312)
---------------------------------------------
1 ROUGE-L Average_R: 0.23900 (95%-conf.int. 0.22509 - 0.25470)
1 ROUGE-L Average_P: 0.25367 (95%-conf.int. 0.23872 - 0.26959)
1 ROUGE-L Average_F: 0.24145 (95%-conf.int. 0.22769 - 0.25601)

Elapsed time: 3.122 seconds
