SalienceSum

Installation

python3 -m venv venv
pip install -r requirements.txt

We use the BBC data that comprised of the raw version:

Split the one file document into document files for PKUSUM processing. Need to argparse the path, for now we change the path manually

cd preprocess
python generate_docs_pkusum.py

Run the script. Each script for training takes 30 hours to run (the val will take ~3 hours).

./centroid.sh
./centroid_val.sh
./submodular.sh
./submodular_val.sh
./textrank.sh
./textrank_val.sh

Once all scripts are done, run

python preprocess_salience_from_pkupu.py

allennlp train model_config/exp_01.jsonnet --serialization-dir data/train_01 --include-package salience_sum --file-friendly-logging

To generate an unsupervised noisy salience, we process all set of source files.

python preprocess-salience.py -input <path_to_source_folder> --submodular --NER --textrank --compression -max-words 30

The process will generate labeled source files using |#| separator as follows.

Replace manually the file name inside the preprocessing.

python preprocess/preprocessing.py

Create a smaller one for dev.

cat data/bbc/train.tsv | awk 'NR%200==1' > data/dev_bbc/train.dev.tsv
cat data/bbc/val.tsv | awk 'NR%100==1' > data/dev_bbc/val.dev.tsv

python preprocess/add_noisy_dummy.py

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
HPC/crf		HPC/crf
data/sample_data		data/sample_data
model_config		model_config
pointer_generator		pointer_generator
pointer_generator_salience		pointer_generator_salience
postprocess		postprocess
preprocess		preprocess
salience_sum		salience_sum
salience_sum_crf		salience_sum_crf
salience_sum_old		salience_sum_old
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
predict.py		predict.py
requirements.txt		requirements.txt