(COLING'18) The source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization".
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Layers
PyCoreNLP
ROUGE-RELEASE-1.5.5
build_model
data_processor
evaluation
generation
log
mylog
settings
test_data
train_data
utility
valid_data
vocabulary
Arc_Map.pkl
LICENSE.md
Pos_Map.pkl
README.md
genTestDataSettings.py
genTrainDataSettings.py
generate.py
get_vocab.py
optimizer.py
options_loader.py
toolkit.py
train.py
train_2.py

README.md

Structure-Infused Copy Mechanisms for Abstractive Summarization

We provide the source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization", accepted at COLING'18. If you find the code useful, please cite the following paper.

@inproceedings{song-zhao-liu:2018,
 Author = {Kaiqiang Song and Lin Zhao and Fei Liu},
 Title = {Structure-Infused Copy Mechanisms for Abstractive Summarization},
 Booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
 Year = {2018}}

Goal

  • Our system seeks to re-write a lengthy sentence, often the 1st sentence of a news article, to a concise, title-like summary. The average input and output lengths are 31 words and 8 words, respectively.

  • The code takes as input a text file with one sentence per line. It generates a text file in the same directory as the output, ended with ".result.summary", where each source sentence is replaced by a title-like summary.

  • Example input and output are shown below.

    An estimated 4,645 people died in Hurricane Maria and its aftermath in Puerto Rico , according to an academic report published Tuesday in a prestigious medical journal .

    hurricane maria kills 4,645 in puerto rico .

A Quick Demo

Demo of Sentence Summarizer

Dependencies

The code is written in Python (v2.7) and Theano (v1.0.1). We suggest the following environment:

To install Python (v2.7), run the command:

$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ bash Anaconda2-5.0.1-Linux-x86_64.sh
$ source ~/.bashrc

To install Theano and its dependencies, run the below command (you may want to add export MKL_THREADING_LAYER=GNU to "~/.bashrc" for future use).

$ conda install numpy scipy mkl nose sphinx pydot-ng
$ conda install theano pygpu
$ export MKL_THREADING_LAYER=GNU

To download the Stanford CoreNLP toolkit and use it as a server, run the command below. The CoreNLP toolkit helps derive structure information (part-of-speech tags, dependency parse trees) from source sentences.

$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
$ unzip stanford-corenlp-full-2018-02-27.zip
$ cd stanford-corenlp-full-2018-02-27
$ nohup java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
$ cd -

To install Pyrouge, run the command below. Pyrouge is a Python wrapper for the ROUGE toolkit, an automatic metric used for summary evaluation.

$ pip install pyrouge

I Want to Generate Summaries..

  1. Clone this repo. Download this TAR file (model_coling18.tar.gz) containing vocabulary files and pretrained models. Move the TAR file to folder "struct_infused_summ" and uncompress.

    $ git clone https://github.com/KaiQiangSong/struct_infused_summ/
    $ mv model_coling18.tar.gz struct_infused_summ
    $ cd struct_infused_summ
    $ tar -xvzf model_coling18.tar.gz
    $ rm model_coling18.tar.gz
    
  2. Extract structural features from a list of input files. The file ./test_data/test_filelist.txt contains absolute (or relative) paths to individual files (test_000.txt and test_001.txt are toy files). Each file contains a number of source sentences, one sentence per line. Then, execute the command:

    $ python toolkit.py -f ./test_data/test_filelist.txt
    
  3. Generate the model configuration file in the ./settings/ folder.

    $ python genTestDataSettings.py ./test_data/test_filelist.txt ./settings/my_test_settings
    

    After that, you need to modify the "dataset" field of the options_loader.py file to point it to the new settings file: 'dataset':'settings/my_test_settings.json'.

  4. Run the testing script. The summary files, located in the same directory as the input, are ended with ".result.summary".

    $ python generate.py
    

    struct_edge is the default model. It corresponds to the "2way+relation" architecture described in the paper. You can modify the file generate.py (Line 152-153) by globally replacing struct_edge with struct_node to enable the "2way+word" architecture.

I Want to Train the Model..

  1. Create a folder to save the model files. ./model/struct_node is for the "2way+word" architecture and ./model/struct_edge for the "2way+relation" architecture.

    $ mkdir -p ./model/struct_node ./model/struct_edge
    
  2. Extract structural features from the input files. source_file.txt and summary_file.txt in the ./train_data/ folder are toy files containing source and summary sentences, one sentence per line. Often, tens of thousands of (source, sentence) pairs are required for training.

    $ python toolkit.py ./train_data/source_file.txt
    $ python toolkit.py ./train_data/summary_file.txt
    

    Adjust file names using below commands. .Ndocument, .dfeature, and Nsummary respectively contain the source sentences, structural features of source sentences, and summary sentences.

    $ cd ./train_data/
    $ mv source_file.txt.Ndocument train.Ndocument
    $ mv source_file.txt.feature train.dfeature
    $ mv summary_file.txt.Ndocument train.Nsummary
    $ cd -
    
  3. Repeat the previous step for validation data, which are used for early stopping. ./valid_data contain toy files.

    $ python toolkit.py ./valid_data/source_file.txt
    $ python toolkit.py ./valid_data/summary_file.txt
    $ cd ./valid_data/
    $ mv source_file.txt.Ndocument valid.Ndocument
    $ mv source_file.txt.feature valid.dfeature
    $ mv summary_file.txt.Ndocument valid.Nsummary
    $ cd -
    
  4. Generate the model configuration file in the ./settings/ folder.

    $ python genTrainDataSettings.py ./train_data/train ./valid_data/valid ./settings/my_train_settings
    

    After that, you need to modify the "dataset" field of the options_loader.py file to point to the new settings file: 'dataset':'settings/my_train_settings.json'.

  5. Download the GloVe embeddings and uncompress.

    $ wget http://nlp.stanford.edu/data/glove.6B.zip
    $ unzip glove.6B.zip
    $ rm glove.6B.zip
    

    Modify the "vocab_emb_init_path" field in the file ./settings/vocabulary.json from "vocab_emb_init_path": "../../vocab/glove.6B.100d.txt" to "vocab_emb_init_path": "glove.6B.100d.txt".

  6. Create a vocabulary file from ./train_data/train.Ndocument and ./train_data/train.Nsummary. Words appearing less than 5 times are excluded.

    $ python get_vocab.py my_vocab
    
  7. Modify the path to the vocabulary file in train.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab').

  8. To train the model, run the below command.

    $ THEANO_FLAGS='floatX=float32' python train.py
    

    The training program stops when it reaches the maximum number of epoches (30 epoches). This number can be modified by changing the "max_epochs" field in ./settings/training.json. The model files are saved in folder ./model/.

    "2way+relation" is the default architecture. It uses the settings file ./settings/network_struct_edge.json. You can modify the 'network' field of the options_loader.py from 'settings/network_struct_edge.json' to './settings/network_struct_node.json' to train the "2way+word" architecture.

  9. (Optional) train the model with early stopping.

    You might want to change the paramters used for early stopping. These are specified in ./setttings/earlyStop.json and explained below. If early stopping is enabled, the best model files, model_best.npz and options_best.json, will be saved in the ./model/struct_edge/ folder.

{
	"sample":true, # enable model checkpoint
	"sampleMin":10000, # the first checkpoint occurs after 10K batches
	"sampleFreq":2000, # there is a checkpoint every 2K batches afterwards
	"sample_path":"./sample/",
	"earlyStop":true, # enable early stopping 
	"earlyStop_method":"valid_err", # based on validation loss
	"earlyStop_bound":62000, # the training program stops if the valid loss has no improvement after 62K batches
	"rate_bound":24000 # halve the learning rate if the valid loss has no improvement after 2K batches
}

62K batches (used for earlyStop_bound) correspond to about 1 epoch for our dataset. 24K batches (used for rate_Bound) is slightly less than half of an epoch.

I Want to Apply the Coverage Mechanism in a 2nd Training Stage..

  1. You will switch to the file train_2.py. Modify the path to the vocabulary file in train_2.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab') to point it to your vocabulary file.

  2. Run the below command to perform the 2nd-stage training. Two files ./model/struct_edge/model_check2_best.npz and ./model/struct_edge/options_check2_best.json will be generated, containing the best model parameters and system configurations for the "2way+relation" architecture.

    $ python train_2.py
    

License

This project is licensed under the BSD License - see the LICENSE.md file for details.

Acknowledgments

We grateful acknowledge the work of Kelvin Xu whose code in part inspired this project.