Structure-Infused Copy Mechanisms for Abstractive Summarization

We provide the source code for the paper "Structure-Infused Copy Mechanisms for Abstractive Summarization", accepted at COLING'18. If you find the code useful, please cite the following paper.

@inproceedings{song-zhao-liu:2018,
 Author = {Kaiqiang Song and Lin Zhao and Fei Liu},
 Title = {Structure-Infused Copy Mechanisms for Abstractive Summarization},
 Booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
 Year = {2018}}

Goal

Our system seeks to re-write a lengthy sentence, often the 1st sentence of a news article, to a concise, title-like summary. The average input and output lengths are 31 words and 8 words, respectively.
The code takes as input a text file with one sentence per line. It generates a text file in the same directory as the output, ended with ".result.summary", where each source sentence is replaced by a title-like summary.
Example input and output are shown below.

An estimated 4,645 people died in Hurricane Maria and its aftermath in Puerto Rico , according to an academic report published Tuesday in a prestigious medical journal .

hurricane maria kills 4,645 in puerto rico .

A Quick Demo

Dependencies

The code is written in Python (v2.7) and Theano (v1.0.1). We suggest the following environment:

A Linux machine (Ubuntu) with GPU (Cuda 8.0)
Python (v2.7)
Theano (v1.0.1)
Stanford CoreNLP
Pyrouge

To install Python (v2.7), run the command:

$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ bash Anaconda2-5.0.1-Linux-x86_64.sh
$ source ~/.bashrc

To install Theano and its dependencies, run the below command (you may want to add export MKL_THREADING_LAYER=GNU to "~/.bashrc" for future use).

$ conda install numpy scipy mkl nose sphinx pydot-ng
$ conda install theano pygpu
$ export MKL_THREADING_LAYER=GNU

To download the Stanford CoreNLP toolkit and use it as a server, run the command below. The CoreNLP toolkit helps derive structure information (part-of-speech tags, dependency parse trees) from source sentences.

$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
$ unzip stanford-corenlp-full-2018-02-27.zip
$ cd stanford-corenlp-full-2018-02-27
$ nohup java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
$ cd -

To install Pyrouge, run the command below. Pyrouge is a Python wrapper for the ROUGE toolkit, an automatic metric used for summary evaluation.

$ pip install pyrouge

I Want to Generate Summaries..

Clone this repo. Download this TAR file (model_coling18.tar.gz) containing vocabulary files and pretrained models. Move the TAR file to folder "struct_infused_summ" and uncompress.

$ git clone https://github.com/KaiQiangSong/struct_infused_summ/
$ mv model_coling18.tar.gz struct_infused_summ
$ cd struct_infused_summ
$ tar -xvzf model_coling18.tar.gz
$ rm model_coling18.tar.gz

Extract structural features from a list of input files. The file ./test_data/test_filelist.txt contains absolute (or relative) paths to individual files (test_000.txt and test_001.txt are toy files). Each file contains a number of source sentences, one sentence per line. Then, execute the command:
```
$ python toolkit.py -f ./test_data/test_filelist.txt
```
Generate the model configuration file in the ./settings/ folder.
```
$ python genTestDataSettings.py ./test_data/test_filelist.txt ./settings/my_test_settings
```
After that, you need to modify the "dataset" field of the options_loader.py file to point it to the new settings file: 'dataset':'settings/my_test_settings.json'.
Run the testing script. The summary files, located in the same directory as the input, are ended with ".result.summary".
```
$ python generate.py
```
struct_edge is the default model. It corresponds to the "2way+relation" architecture described in the paper. You can modify the file generate.py (Line 152-153) by globally replacing struct_edge with struct_node to enable the "2way+word" architecture.

I Want to Train the Model..

Create a folder to save the model files. ./model/struct_node is for the "2way+word" architecture and ./model/struct_edge for the "2way+relation" architecture.
```
$ mkdir -p ./model/struct_node ./model/struct_edge
```
Extract structural features from the input files. source_file.txt and summary_file.txt in the ./train_data/ folder are toy files containing source and summary sentences, one sentence per line. Often, tens of thousands of (source, sentence) pairs are required for training.
```
$ python toolkit.py ./train_data/source_file.txt
$ python toolkit.py ./train_data/summary_file.txt
```
Adjust file names using below commands. .Ndocument, .dfeature, and Nsummary respectively contain the source sentences, structural features of source sentences, and summary sentences.
```
$ cd ./train_data/
$ mv source_file.txt.Ndocument train.Ndocument
$ mv source_file.txt.feature train.dfeature
$ mv summary_file.txt.Ndocument train.Nsummary
$ cd -
```

Repeat the previous step for validation data, which are used for early stopping. ./valid_data contain toy files.

$ python toolkit.py ./valid_data/source_file.txt
$ python toolkit.py ./valid_data/summary_file.txt
$ cd ./valid_data/
$ mv source_file.txt.Ndocument valid.Ndocument
$ mv source_file.txt.feature valid.dfeature
$ mv summary_file.txt.Ndocument valid.Nsummary
$ cd -

Generate the model configuration file in the ./settings/ folder.
```
$ python genTrainDataSettings.py ./train_data/train ./valid_data/valid ./settings/my_train_settings
```
After that, you need to modify the "dataset" field of the options_loader.py file to point to the new settings file: 'dataset':'settings/my_train_settings.json'.
Download the GloVe embeddings and uncompress.
```
$ wget http://nlp.stanford.edu/data/glove.6B.zip
$ unzip glove.6B.zip
$ rm glove.6B.zip
```
Modify the "vocab_emb_init_path" field in the file ./settings/vocabulary.json from "vocab_emb_init_path": "../../vocab/glove.6B.100d.txt" to "vocab_emb_init_path": "glove.6B.100d.txt".
Create a vocabulary file from ./train_data/train.Ndocument and ./train_data/train.Nsummary. Words appearing less than 5 times are excluded.
```
$ python get_vocab.py my_vocab
```
Modify the path to the vocabulary file in train.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab').
To train the model, run the below command.
```
$ THEANO_FLAGS='floatX=float32' python train.py
```
The training program stops when it reaches the maximum number of epoches (30 epoches). This number can be modified by changing the "max_epochs" field in ./settings/training.json. The model files are saved in folder ./model/.

"2way+relation" is the default architecture. It uses the settings file ./settings/network_struct_edge.json. You can modify the 'network' field of the options_loader.py from 'settings/network_struct_edge.json' to './settings/network_struct_node.json' to train the "2way+word" architecture.
(Optional) train the model with early stopping.

You might want to change the paramters used for early stopping. These are specified in ./setttings/earlyStop.json and explained below. If early stopping is enabled, the best model files, model_best.npz and options_best.json, will be saved in the ./model/struct_edge/ folder.

{
	"sample":true, # enable model checkpoint
	"sampleMin":10000, # the first checkpoint occurs after 10K batches
	"sampleFreq":2000, # there is a checkpoint every 2K batches afterwards
	"sample_path":"./sample/",
	"earlyStop":true, # enable early stopping 
	"earlyStop_method":"valid_err", # based on validation loss
	"earlyStop_bound":62000, # the training program stops if the valid loss has no improvement after 62K batches
	"rate_bound":24000 # halve the learning rate if the valid loss has no improvement after 2K batches
}

62K batches (used for earlyStop_bound) correspond to about 1 epoch for our dataset. 24K batches (used for rate_Bound) is slightly less than half of an epoch.

I Want to Apply the Coverage Mechanism in a 2nd Training Stage..

You will switch to the file train_2.py. Modify the path to the vocabulary file in train_2.py from Vocab_Giga = loadFromPKL('../../dataset/gigaword_eng_5/giga_new.Vocab') to Vocab_Giga = loadFromPKL('my_vocab.Vocab') to point it to your vocabulary file.
Run the below command to perform the 2nd-stage training. Two files ./model/struct_edge/model_check2_best.npz and ./model/struct_edge/options_check2_best.json will be generated, containing the best model parameters and system configurations for the "2way+relation" architecture.
```
$ python train_2.py
```

License

This project is licensed under the BSD License - see the LICENSE.md file for details.

Acknowledgments

We grateful acknowledge the work of Kelvin Xu whose code in part inspired this project.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Layers		Layers
PyCoreNLP		PyCoreNLP
ROUGE-RELEASE-1.5.5		ROUGE-RELEASE-1.5.5
build_model		build_model
data_processor		data_processor
evaluation		evaluation
generation		generation
log		log
mylog		mylog
settings		settings
test_data		test_data
train_data		train_data
utility		utility
valid_data		valid_data
vocabulary		vocabulary
Arc_Map.pkl		Arc_Map.pkl
LICENSE.md		LICENSE.md
Pos_Map.pkl		Pos_Map.pkl
README.md		README.md
genTestDataSettings.py		genTestDataSettings.py
genTrainDataSettings.py		genTrainDataSettings.py
generate.py		generate.py
get_vocab.py		get_vocab.py
optimizer.py		optimizer.py
options_loader.py		options_loader.py
toolkit.py		toolkit.py
train.py		train.py
train_2.py		train_2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structure-Infused Copy Mechanisms for Abstractive Summarization

Goal

A Quick Demo

Dependencies

I Want to Generate Summaries..

I Want to Train the Model..

I Want to Apply the Coverage Mechanism in a 2nd Training Stage..

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

KaiQiangSong/struct_infused_summ

Folders and files

Latest commit

History

Repository files navigation

Structure-Infused Copy Mechanisms for Abstractive Summarization

Goal

A Quick Demo

Dependencies

I Want to Generate Summaries..

I Want to Train the Model..

I Want to Apply the Coverage Mechanism in a 2nd Training Stage..

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages