# Improving Text Summarisation on WikiHow Data using Transfer Learning



**This notebook presents our implementation for the COMP0087 - Statistical Natural Language Processing project.**

We focus on showcasing the power of using transfer learning on the text summarisation task using the BERT-based models BertSum ([Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345)) on the WikiHow dataset [WikiHow: A Large Scale Text Summarization Dataset](https://arxiv.org/abs/1810.09305).

Implementation includes code from [PreSumm GitHub](https://github.com/nlpyang/PreSumm), modified to suit our research purposes.

We include the pre-trained BertSumExt model obtained from [here](https://drive.google.com/file/d/1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ/view), the model we trained from scratch and our best performing model trained using transfer learning.

For a demo version comparing these last two models on a small WikiHow test dataset can be checked [here](https://drive.google.com/open?id=1mwpa8DIFEB2aO43AbbFwlVE9YZy_fpK4).

Download file containing code, data and models:

In [1]:
!wget --load-cookies /tmp/cookies.txt "https://drive.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=1-Wgbe4fLdh4TWSrMkQ21HCsxi4qixh-i' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-Wgbe4fLdh4TWSrMkQ21HCsxi4qixh-i" -O Team36.zip && rm -rf /tmp/cookies.txt
!unzip Team36.zip

--2020-04-02 22:26:45--  https://drive.google.com/uc?export=download&confirm=tk3X&id=1-Wgbe4fLdh4TWSrMkQ21HCsxi4qixh-i
Resolving drive.google.com (drive.google.com)... 64.233.188.101, 64.233.188.113, 64.233.188.138, ...
Connecting to drive.google.com (drive.google.com)|64.233.188.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-10-b4-docs.googleusercontent.com/docs/securesc/jrdm4pvp76j2algtveo3i6mhg87n4hvi/hinkk4r7nu8kttf9gi0phnj46hvsh765/1585866375000/13490934451747665095/08700569489280539038Z/1-Wgbe4fLdh4TWSrMkQ21HCsxi4qixh-i?e=download [following]
--2020-04-02 22:26:45--  https://doc-10-b4-docs.googleusercontent.com/docs/securesc/jrdm4pvp76j2algtveo3i6mhg87n4hvi/hinkk4r7nu8kttf9gi0phnj46hvsh765/1585866375000/13490934451747665095/08700569489280539038Z/1-Wgbe4fLdh4TWSrMkQ21HCsxi4qixh-i?e=download
Resolving doc-10-b4-docs.googleusercontent.com (doc-10-b4-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Conn

## Installing dependencies:

In [2]:
!pip install pytorch_pretrained_bert
!pip install tensorboardX
!pip install pytorch_transformers
!pip install torch==1.1.0 torchvision==0.3.0

Collecting pytorch_pretrained_bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |██▋                             | 10kB 19.4MB/s eta 0:00:01[K     |█████▎                          | 20kB 848kB/s eta 0:00:01[K     |████████                        | 30kB 1.3MB/s eta 0:00:01[K     |██████████▋                     | 40kB 1.4MB/s eta 0:00:01[K     |█████████████▎                  | 51kB 1.0MB/s eta 0:00:01[K     |███████████████▉                | 61kB 1.2MB/s eta 0:00:01[K     |██████████████████▌             | 71kB 1.3MB/s eta 0:00:01[K     |█████████████████████▏          | 81kB 1.4MB/s eta 0:00:01[K     |███████████████████████▉        | 92kB 1.5MB/s eta 0:00:01[K     |██████████████████████████▌     | 102kB 1.4MB/s eta 0:00:01[K     |█████████████████████████████▏  | 112kB 1.4MB/s eta 0:00:01[K     |██████████████████████

Proper installation of pyrouge

In [3]:
!git clone https://github.com/bheinzerling/pyrouge
%cd pyrouge
!pip install -e .

!git clone https://github.com/andersjo/pyrouge.git rouge

!pyrouge_set_rouge_path /content/pyrouge/rouge/tools/ROUGE-1.5.5/

!sudo apt-get install libxml-parser-perl

%cd rouge/tools/ROUGE-1.5.5/data
!rm WordNet-2.0.exc.db
!./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db

Cloning into 'pyrouge'...
remote: Enumerating objects: 551, done.[K
remote: Total 551 (delta 0), reused 0 (delta 0), pack-reused 551[K
Receiving objects: 100% (551/551), 123.17 KiB | 229.00 KiB/s, done.
Resolving deltas: 100% (198/198), done.
/content/pyrouge
Obtaining file:///content/pyrouge
Installing collected packages: pyrouge
  Running setup.py develop for pyrouge
Successfully installed pyrouge
Cloning into 'rouge'...
remote: Enumerating objects: 393, done.[K
remote: Total 393 (delta 0), reused 0 (delta 0), pack-reused 393[K
Receiving objects: 100% (393/393), 298.74 KiB | 544.00 KiB/s, done.
Resolving deltas: 100% (109/109), done.
2020-04-02 22:29:46,673 [MainThread  ] [INFO ]  Set ROUGE home directory to /content/pyrouge/rouge/tools/ROUGE-1.5.5/.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libauthen-sasl-perl libdata-dump-perl libencode-locale-perl
  libfile-listing-perl

Change the default directory back

In [0]:
import os
os.chdir('/content')

## Pre-process the WikiHow dataset

We used the wikihowAll.csv version which includes concatenated articles and summaries. The code in this section was run locally, since some of the intermediary generated files occupy a lot of memory and caused Colab to crash.

### Obtain the .story files

In [0]:
!python /content/Team36/wikihow_prepro/process.py

### Check that Stanford CoreNLP works

In [0]:
import os
os.environ['CLASSPATH']="/content/Team36/wikihow_prepro/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar"
!echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

### Sentence Splitting and Tokenisation

In [0]:
!python /content/Team36/src/preprocess.py -mode tokenize -raw_path /content/Team36/wikihow_prepro/raw_data -save_path /content/Team36/wikihow_prepro/json_data

### Obtain the mapping for train/validate/test datasets

In [0]:
from sklearn.model_selection import train_test_split
import numpy

In [0]:
# Read the file containing all the story titles
with open("/content/Team36/wikihow_prepro/titles.txt", "r") as f:
    titles = f.read().split('\n')
    titles = numpy.array(titles)  #convert array to numpy type array

    train0 ,test = train_test_split(titles, test_size = 0.045) 
    train, valid = train_test_split(train0, test_size = 0.040)

# Write the mapping files
with open("/content/Team36/wikihow_prepro/mapping/mapping_test.txt", "w") as file:
        for t in range(len(test)):
            if (not test[t].endswith('story')):
                continue
            file.write("%s\n" % test[t])
            
with open("/content/Team36/wikihow_prepro/mapping/mapping_valid.txt", "w") as file:
        for t in range(len(valid)):
            if (not valid[t].endswith('story')):
                continue
            file.write("%s\n" % valid[t])
            
with open("/content/Team36/wikihow_prepro/mapping/mapping_train.txt", "w") as file:
        for t in range(len(train)):
            if (not train[t].endswith('story')):
                continue
            file.write("%s\n" % train[t])

### Format to simpler json files

In [0]:
!python /content/Team36/src/preprocess.py -mode format_to_lines -raw_path /content/Team36/wikihow_prepro/json_data -save_path /content/Team36/wikihow_prepro/merged_json_data/cnndm -n_cpus 1 -use_bert_basic_tokenizer false -map_path /content/Team36/wikihow_prepro/mapping

### Format to PyTorch files

In [0]:
!python /content/Team36/src/preprocess.py -mode format_to_bert -raw_path /content/Team36/wikihow_prepro/merged_json_data/merged_json_data -save_path /content/Team36/bert_data  -lower -n_cpus 1 -log_file /content/Team36/logs/preprocess.log

## Model Training and Evaluation

We first train the BertSumExt model from scratch on the WikiHow dataset. We then use the pre-trained BertSumExt model (on 18,000 steps) on the CNN/DailyMail dataset provided [here](https://drive.google.com/file/d/1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ/view) to train our 4 transfer learning approaches for 10,000 more steps:

*   Warmstarting 
*   Freezing BERT layers
*   Freezing encoder layers
*   Freezing positional embeddings

All steps were evaluated on the validation dataset and the top 3 performing ones were selected to be tested on the test dataset.

### Model trained from scratch on WikiHow 

In [0]:
!python /content/Team36/src/train.py -task ext -mode train -bert_data_path /content/Team36/bert_data/cnndm -ext_dropout 0.1 -model_path /content/Team36/models_scratch -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 20000 -accum_count 6 -log_file /content/Team36/logs/bertext_log -use_interval true -warmup_steps 10000 -max_pos 512

In [0]:
!python /content/Team36/src/train.py -task ext -mode validate -test_all -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -model_path /content/PreSumm/models_scratch -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/logs/abs_bert_cnndm

Training this model for 20,000 steps on GPU took 11h. Checkpoints were used. The scores obtained by the top 3 models on the test dataset:

<table class="tg">
  <tr>
    <th class="tg-0pky">Model Step</th>
    <th class="tg-0pky">ROUGE-1</th>
    <th class="tg-0pky">ROUGE-2</th>
    <th class="tg-0pky">ROUGE-L</th>
  </tr>
  <tr>
    <td class="tg-0pky">11,000</td>
    <td class="tg-0pky">29.67</td>
    <td class="tg-0pky">8.20</td>
    <td class="tg-0pky">27.44</td>
  </tr>
  <tr>
    <td class="tg-0pky">10,000</td>
    <td class="tg-0pky">29.60</td>
    <td class="tg-0pky">8.17</td>
    <td class="tg-0pky">27.35</td>
  </tr>
  <tr>
    <td class="tg-0pky">13,000</td>
    <td class="tg-0pky">29.58</td>
    <td class="tg-0pky">8.18</td>
    <td class="tg-0pky">27.41</td>
  </tr>
  <tr>
    <td class="tg-0pky"><b>Mean</b></td>
    <td class="tg-0pky"><b>29.61</b></td>
    <td class="tg-0pky"><b>8.18</b></td>
    <td class="tg-0pky"><b>27.40</b></td>
  </tr>
</table>

### Model with Warmstarting

In [0]:
!python /content/Team36/src/train.py -task ext -mode train -train_from /content/Team36/models/bert_ext.pt -bert_data_path /content/Team36/bert_data/cnndm -ext_dropout 0.1 -model_path /content/Team36/models_warmstart -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 28000 -accum_count 6 -log_file /content/Team36/logs/bertext_log -use_interval true -warmup_steps 10000 -max_pos 512

In [0]:
!python /content/Team36/src/train.py -task ext -mode validate -test_all -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -model_path /content/PreSumm/models_warmstart -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/logs/abs_bert_cnndm

Training this model took 5.5h. The scores obtained by the top 3 models on the test dataset:

<table class="tg">
  <tr>
    <th class="tg-0pky">Model Step</th>
    <th class="tg-0pky">ROUGE-1</th>
    <th class="tg-0pky">ROUGE-2</th>
    <th class="tg-0pky">ROUGE-L</th>
  </tr>
  <tr>
    <td class="tg-0pky">26,000</td>
    <td class="tg-0pky">29.75</td>
    <td class="tg-0pky">8.30</td>
    <td class="tg-0pky">27.56</td>
  </tr>
  <tr>
    <td class="tg-0pky">24,000</td>
    <td class="tg-0pky">29.76</td>
    <td class="tg-0pky">8.28</td>
    <td class="tg-0pky">27.53</td>
  </tr>
  <tr>
    <td class="tg-0pky">25,000</td>
    <td class="tg-0pky">29.83</td>
    <td class="tg-0pky">8.33</td>
    <td class="tg-0pky">27.59</td>
  </tr>
  <tr>
    <td class="tg-0pky"><b>Mean</b></td>
    <td class="tg-0pky"><b>29.78</b></td>
    <td class="tg-0pky"><b>8.30</b></td>
    <td class="tg-0pky"><b>27.56</b></td>
  </tr>
</table>

### Model with Freezing BERT layers

In [0]:
!python /content/Team36/src/train.py -task ext -mode train -train_from /content/Team36/models/bert_ext.pt -freeze bert -bert_data_path /content/Team36/bert_data/cnndm -ext_dropout 0.1 -model_path /content/Team36/models_bert -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 28000 -accum_count 6 -log_file /content/Team36/logs/bertext_log -use_interval true -warmup_steps 10000 -max_pos 512

In [0]:
!python /content/Team36/src/train.py -task ext -mode validate -test_all -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -model_path /content/PreSumm/models_bert -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/logs/abs_bert_cnndm

Training this model took 2h. The scores obtained by the top 3 models on the test dataset:

<table class="tg">
  <tr>
    <th class="tg-0pky">Model Step</th>
    <th class="tg-0pky">ROUGE-1</th>
    <th class="tg-0pky">ROUGE-2</th>
    <th class="tg-0pky">ROUGE-L</th>
  </tr>
  <tr>
    <td class="tg-0pky">28,000</td>
    <td class="tg-0pky">28.55</td>
    <td class="tg-0pky">7.63</td>
    <td class="tg-0pky">26.43</td>
  </tr>
  <tr>
    <td class="tg-0pky">27,000</td>
    <td class="tg-0pky">28.54</td>
    <td class="tg-0pky">7.62</td>
    <td class="tg-0pky">26.40</td>
  </tr>
  <tr>
    <td class="tg-0pky">26,000</td>
    <td class="tg-0pky">28.48</td>
    <td class="tg-0pky">7.58</td>
    <td class="tg-0pky">26.37</td>
  </tr>
  <tr>
    <td class="tg-0pky"><b>Mean</b></td>
    <td class="tg-0pky"><b>28.52</b></td>
    <td class="tg-0pky"><b>7.58</b></td>
    <td class="tg-0pky"><b>26.37</b></td>
  </tr>
</table>

### Model with Freezing encoder layers

In [0]:
!python /content/Team36/src/train.py -task ext -mode train -train_from /content/Team36/models/bert_ext.pt -freeze encoder -bert_data_path /content/Team36/bert_data/cnndm -ext_dropout 0.1 -model_path /content/Team36/models_encoder -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 28000 -accum_count 6 -log_file /content/Team36/logs/bertext_log -use_interval true -warmup_steps 10000 -max_pos 512

In [0]:
!python /content/Team36/src/train.py -task ext -mode validate -test_all -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -model_path /content/PreSumm/models_encoder -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/logs/abs_bert_cnndm

Training this model took 5.5h. The scores obtained by the top 3 models on the test dataset:

<table class="tg">
  <tr>
    <th class="tg-0pky">Model Step</th>
    <th class="tg-0pky">ROUGE-1</th>
    <th class="tg-0pky">ROUGE-2</th>
    <th class="tg-0pky">ROUGE-L</th>
  </tr>
  <tr>
    <td class="tg-0pky">26,000</td>
    <td class="tg-0pky">29.78</td>
    <td class="tg-0pky">8.30</td>
    <td class="tg-0pky">27.58</td>
  </tr>
  <tr>
    <td class="tg-0pky">24,000</td>
    <td class="tg-0pky">29.74</td>
    <td class="tg-0pky">8.28</td>
    <td class="tg-0pky">27.52</td>
  </tr>
  <tr>
    <td class="tg-0pky">25,000</td>
    <td class="tg-0pky">29.78</td>
    <td class="tg-0pky">8.28</td>
    <td class="tg-0pky">27.53</td>
  </tr>
  <tr>
    <td class="tg-0pky"><b>Mean</b></td>
    <td class="tg-0pky"><b>29.77</b></td>
    <td class="tg-0pky"><b>8.29</b></td>
    <td class="tg-0pky"><b>27.54</b></td>
  </tr>
</table>

### Model with Freezing positional embeddings

In [0]:
!python /content/Team36/src/train.py -task ext -mode train -train_from /content/Team36/models/bert_ext.pt -freeze positional -bert_data_path /content/Team36/bert_data/cnndm -ext_dropout 0.1 -model_path /content/Team36/models_position -lr 2e-3 -visible_gpus 0 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 28000 -accum_count 6 -log_file /content/Team36/logs/bertext_log -use_interval true -warmup_steps 10000 -max_pos 512

In [0]:
!python /content/Team36/src/train.py -task ext -mode validate -test_all -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -model_path /content/PreSumm/models_position -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/logs/abs_bert_cnndm

Training this model took 5.5h. The scores obtained by the top 3 models on the test dataset:

<table class="tg">
  <tr>
    <th class="tg-0pky">Model Step</th>
    <th class="tg-0pky">ROUGE-1</th>
    <th class="tg-0pky">ROUGE-2</th>
    <th class="tg-0pky">ROUGE-L</th>
  </tr>
  <tr>
    <td class="tg-0pky">26,000</td>
    <td class="tg-0pky">29.76</td>
    <td class="tg-0pky">8.31</td>
    <td class="tg-0pky">27.58</td>
  </tr>
  <tr>
    <td class="tg-0pky">24,000</td>
    <td class="tg-0pky">29.76</td>
    <td class="tg-0pky">8.31</td>
    <td class="tg-0pky">27.54</td>
  </tr>
  <tr>
    <td class="tg-0pky">25,000</td>
    <td class="tg-0pky">29.82</td>
    <td class="tg-0pky">8.32</td>
    <td class="tg-0pky">27.57</td>
  </tr>
  <tr>
    <td class="tg-0pky"><b>Mean</b></td>
    <td class="tg-0pky"><b>29.78</b></td>
    <td class="tg-0pky"><b>8.31</b></td>
    <td class="tg-0pky"><b>27.56</b></td>
  </tr>
</table>

## Test pre-trained BertSum models on WikiHow data (out-of-domain)

### BertSumExt

In [0]:
!python /content/Team36/src/train.py -task ext -mode test -test_from /content/Team36/models/bert_ext.pt -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/results/abs_bert_cnndm

### BertSumExtAbs

The pre-trained BertSumExtAbs model can be obtained from [here](https://drive.google.com/file/d/1-IKVCtc4Q-BdZpjXc4s70_fRsWnjtYLr/view).

In [0]:
!python /content/Team36/src/train.py -task abs -mode test -test_from /content/Team36/models/bert_ext_abs.pt -batch_size 3000 -test_batch_size 500 -bert_data_path /content/Team36/bert_data/cnndm -log_file /content/Team36/logs/val_abs_bert_cnndm -sep_optim true -use_interval true -visible_gpus 0 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path /content/Team36/results/abs_bert_cnndm