Skip to content

RuifengYuan/FactExsum-coling2020

Repository files navigation

FactExsum-coling2020

Code for Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT (coling 2020)

The CNN/DaliyMail dataset we use is directly from the chunked data in https://github.com/JafferWilson/Process-Data-of-CNN-DailyMailv, Download FINISHED FILES. The chunked data is put in /data/DMCNN/...

If you are interested in the fact-level CNN/DaliyMail dataset described in our paper, you can download them here: https://drive.google.com/file/d/1ma0uuXd5b2EgMUslRIGGF6pVPFHBCIs-/view?usp=sharing.

The best checkpoint can be download with the link below, you can use it with the (3) Test the model. https://drive.google.com/file/d/1TbXdybrV1K1-oOvoOaUmosa5MoZ-YDkX/view?usp=sharing

The output summary of our model "our s+f" (the checkpoint above) is in result folder, the our s+f_cand refers to the standard setting described in our paper and our s+f 6_cand represents the result that extract 6 facts rather than 4 facts.


Introduction for the files:

/data/DMCNN/...: use to store the chunked CNN/DaliyMail dataset.

/data/raw_data_loader.py: use to extract article-summary pair from the chunked data.

/data_file/DMCNN/...: use to store the pickle files that contain processed data generated by make_data.py, and there are some examples in the folder. You can obtain the complete organized fact-level data with the link above.

/model/BERT.py: it contains BERT encoder with Hierarchical Graph Mask and the classifier for extractive summarization.

/utility/pyrougex.py: use to evaluate the result with ROUGE.

/utility/utility.py: it contains some functions used in make_data.py.

call_rouge.py: use to evaluate the result with ROUGE.

data_loader.py: data loader for training and testing the model, and it convert the data in pickle files into the form that used for BERT. It also construct the mask matrix.

make_data.py: split the chunked data into fact level and process the data. The output are pickle files stored in data_file.

run.py: use to train and test the model.


Dependency:

stanford_openie 1.0.3

stanfordcorenlp 3.9.1.1

pyrouge 0.1.3

rouge 1.0.0

pytorch 1.6.0

pytorch-pretrained-bert 0.6.2

transformers 2.11.0

The use of the code:

(1) Create the processed pickle files to train the model:

python make_data.py --nlp_path /home/ziqiang/stanfordnlp_resources/stanford-corenlp-full-2018-10-05 --data_path 'data/DMCNN/train*' --output_path 'data_file/DMCNN/train_file/'

nlp_path: the path for the stanfordnlp.

data_path: where you store the chunked CNN/DaliyMail dataset, for training files, 'path.../train*', for val files , 'path.../val*', for test files, 'path.../test*'.

output_path: where you store the processed pickle file.

The process will take a relatively long time (one to several days).

(2) Train the model:

python run.py --do_train --device 0 --train_size 32 --checkmin 60000 --checkfreq 6000

device: the gpu used for the training, no support for multiple gpus.

train_size: the batch size.

checkmin: the minimum step for saving the checkpoint.

checkfreq: for every checkfreq step, save a checkpoint.

The checkpoint will be saved in save_model folder.

(3) Test the model:

python run.py --do_test --device 0 --test_model 1---1-126000-0.2923-0.3453.pth.tar --block_trigram 1 --ext_num 4

device: the gpu used for the testing, no support for multiple gpus.

test_model: the name of the checkpoint used for testing, the checkpoint should be in the save_model folder.

block_trigram: whether to use the block trigram to reduce the redundancy.

ext_num: the number of the sentences that composed of the summary.

The program will output the result in result folder, checkpoint_cand.txt and checkpoint_gold.txt.

About

Code for Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT (coling 2020)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages