(COLING'18) The source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization".
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Layers
PyCoreNLP
ROUGE-RELEASE-1.5.5
build_model
data_processor
evaluation
generation
log
mylog
settings
test_data
train_data
utility
valid_data
vocabulary
Arc_Map.pkl
LICENSE.md
Pos_Map.pkl
README.md
genTestDataSettings.py
genTrainDataSettings.py
generate.py
get_vocab.py
optimizer.py
options_loader.py
toolkit.py
train.py
train_2.py

README.md

Structure-Infused Copy Mechanisms for Abstractive Summarization

We provide the source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization", accepted at COLING'18. If you find the code useful, please cite the following paper.

@inproceedings{song-zhao-liu:2018,
 Author = {Kaiqiang Song and Lin Zhao and Fei Liu},
 Title = {Structure-Infused Copy Mechanisms for Abstractive Summarization},
 Booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
 Year = {2018}}

Goal

  • Our system seeks to re-write a lengthy sentence, often the 1st sentence of a news article, to a concise, title-like summary. The average input and output lengths are 31 words and 8 words, respectively.

  • The code takes as input a text file with one sentence per line. It generates a text file in the same directory as the output, ended with ".result.summary", where each source sentence is replaced by a title-like summary.

  • Example input and output are shown below.

    An estimated 4,645 people died in Hurricane Maria and its aftermath in Puerto Rico , according to an academic report published Tuesday in a prestigious medical journal .

    hurricane maria kills 4,645 in puerto rico .

A Quick Demo

Demo of Sentence Summarizer

Dependencies

The code is written in Python (v2.7) and Theano (v1.0.1). We suggest the following environment:

To install Python (v2.7), run the command:

$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ bash Anaconda2-5.0.1-Linux-x86_64.sh
$ source ~/.bashrc

To install Theano and its dependencies, run the below command (you may want to add export MKL_THREADING_LAYER=GNU to "~/.bashrc" for future use).

$ conda install numpy scipy mkl nose sphinx pydot-ng
$ conda install theano pygpu
$ export MKL_THREADING_LAYER=GNU

To download the Stanford CoreNLP toolkit and use it as a server, run the command below. The CoreNLP toolkit helps derive structure information (part-of-speech tags, dependency parse trees) from source sentences.

$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
$ unzip stanford-corenlp-full-2018-02-27.zip
$ cd stanford-corenlp-full-2018-02-27
$ nohup java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
$ cd -

To install Pyrouge, run the command below. Pyrouge is a Python wrapper for the ROUGE toolkit, an automatic metric used for summary evaluation.

$ pip install pyrouge

I Want to Generate Summaries..

  1. Clone this repo. Download this TAR file (model_coling18.tar.gz) containing vocabulary files and pretrained models. Move the TAR file to folder "struct_infused_summ" and uncompress.

    $ git clone https://github.com/KaiQiangSong/struct_infused_summ/
    $ mv model_coling18.tar.gz struct_infused_summ
    $ cd struct_infused_summ
    $ tar -xvzf model_coling18.tar.gz
    $ rm model_coling18.tar.gz
    
  2. Extract structural features from a list of input files. The file ./test_data/test_filelist.txt contains absolute (or relative) paths to individual files (test_000.txt and test_001.txt are toy files). Each file contains a number of source sentences, one sentence per line. Then, execute the command:

    $ python toolkit.py -f ./test_data/test_filelist.txt
    
  3. Generate the model configuration file in the ./settings/ folder.

    $ python genTestDataSettings.py ./test_data/test_filelist.txt ./settings/my_test_settings
    

    After that, you need to modify the "dataset" field of the options_loader.py file to point it to the new settings file: 'dataset':'settings/my_test_settings.json'.

  4. Run the testing script. The summary files, located in the same directory as the input, are ended with ".result.summary".

    $ python generate.py
    

    struct_edge is the default model. It corresponds to the "2way+relation" architecture described in the paper. You can modify the file generate.py (Line 152-153) by globally replacing struct_edge with struct_node to enable the "2way+word" architecture.

I Want to Train the Model..

  1. Create a folder to save the model files. ./model/struct_node is for the "2way+word" architecture and ./model/struct_edge for the "2way+relation" architecture.

    $ mkdir -p ./model/struct_node ./model/struct_edge
    
  2. Extract structural features from the input files. source_file.txt and summary_file.txt in the ./train_data/ folder are toy files containing source and summary sentences, one sentence per line. Often, tens of thousands of (source, sentence) pairs are required for training.

    $ python toolkit.py ./train_data/source_file.txt
    $ python toolkit.py ./train_data/summary_file.txt
    

    Adjust file names using below commands. .Ndocument, .dfeature, and Nsummary respectively contain the source sentences, structural features of source sentences, and summary sentences.

    $ cd ./train_data/
    $ mv source_file.txt.Ndocument train.Ndocument
    $ mv source_file.txt.feature train.dfeature
    $ mv summary_file.txt.Ndocument train.Nsummary
    $ cd -
    
  3. Repeat the previous step for validation data, which are used for early stopping. ./valid_data contain toy files.

    $ python toolkit.py ./valid_data/source_file.txt
    $ python toolkit.py ./valid_data/summary_file.txt
    $ cd ./valid_data/
    $ mv source_file.txt.Ndocument valid.Ndocument
    $ mv source_file.txt.feature valid.dfeature
    $ mv summary_file.txt.Ndocument valid.Nsummary
    $ cd -
    
  4. Generate the model configuration file in the ./settings/ folder.

    $ python genTrainDataSettings.py ./train_data/train ./valid_data/valid ./settings/my_train_settings
    

    After that, you need to modify the "dataset" field of the options_loader.py file to point to the new settings file: 'dataset':'settings/my_train_settings.json'.

  5. Download the GloVe embeddings and uncompress.

    $ wget http://nlp.stanford.edu/data/glove.6B.zip
    $ unzip glove.6B.zip
    $ rm glove.6B.zip
    

    Modify the "vocab_emb_init_path" field in the file ./settings/vocabulary.json from "vocab_emb_init_path": "../../vocab/glove.6B.100d.txt" to "vocab_emb_init_path": "glove.6B.100d.txt".

  6. Create a vocabulary file from ./train_data/train.Ndocument and ./train_data/train.Nsummary. Words appearing less than 5 times are excluded.

    $ python get_vocab.py my_vocab
    
  7. Modify the path to the vocabulary file in train.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab').

  8. To train the model, run the below command.

    $ THEANO_FLAGS='floatX=float32' python train.py
    

    The training program stops when it reaches the maximum number of epoches (30 epoches). This number can be modified by changing the "max_epochs" field in ./settings/training.json. The model files are saved in folder ./model/.

    "2way+relation" is the default architecture. It uses the settings file ./settings/network_struct_edge.json. You can modify the 'network' field of the options_loader.py from 'settings/network_struct_edge.json' to './settings/network_struct_node.json' to train the "2way+word" architecture.

  9. (Optional) train the model with early stopping.

    You might want to change the paramters used for early stopping. These are specified in ./setttings/earlyStop.json and explained below. If early stopping is enabled, the best model files, model_best.npz and options_best.json, will be saved in the ./model/struct_edge/ folder.

{
	"sample":true, # enable model checkpoint
	"sampleMin":10000, # the first checkpoint occurs after 10K batches
	"sampleFreq":2000, # there is a checkpoint every 2K batches afterwards
	"sample_path":"./sample/",
	"earlyStop":true, # enable early stopping 
	"earlyStop_method":"valid_err", # based on validation loss
	"earlyStop_bound":62000, # the training program stops if the valid loss has no improvement after 62K batches
	"rate_bound":24000 # halve the learning rate if the valid loss has no improvement after 2K batches
}

62K batches (used for earlyStop_bound) correspond to about 1 epoch for our dataset. 24K batches (used for rate_Bound) is slightly less than half of an epoch.

I Want to Apply the Coverage Mechanism in a 2nd Training Stage..

  1. You will switch to the file train_2.py. Modify the path to the vocabulary file in train_2.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab') to point it to your vocabulary file.

  2. Run the below command to perform the 2nd-stage training. Two files ./model/struct_edge/model_check2_best.npz and ./model/struct_edge/options_check2_best.json will be generated, containing the best model parameters and system configurations for the "2way+relation" architecture.

    $ python train_2.py
    

License

This project is licensed under the BSD License - see the LICENSE.md file for details.

Acknowledgments

We grateful acknowledge the work of Kelvin Xu whose code in part inspired this project.