python3 -m venv venv
pip install -r requirements.txt
We use the BBC data that comprised of the raw version:
train.tsv
test.tsv
val.tsv
and the salience model tagged version:train.tsv.tagged
test.tsv.tagged
val.tsv.tagged
Split the one file document into document files for PKUSUM processing. Need to argparse the path, for now we change the path manually
cd preprocess
python generate_docs_pkusum.py
Run the script. Each script for training takes 30 hours to run (the val will take ~3 hours).
./centroid.sh
./centroid_val.sh
./submodular.sh
./submodular_val.sh
./textrank.sh
./textrank_val.sh
Once all scripts are done, run
python preprocess_salience_from_pkupu.py
allennlp train model_config/exp_01.jsonnet --serialization-dir data/train_01 --include-package salience_sum --file-friendly-logging
To generate an unsupervised noisy salience, we process all set of source files.
python preprocess-salience.py -input <path_to_source_folder> --submodular --NER --textrank --compression -max-words 30
The process will generate labeled source files using |#|
separator as follows.
Replace manually the file name inside the preprocessing.
python preprocess/preprocessing.py
Create a smaller one for dev.
cat data/bbc/train.tsv | awk 'NR%200==1' > data/dev_bbc/train.dev.tsv
cat data/bbc/val.tsv | awk 'NR%100==1' > data/dev_bbc/val.dev.tsv
python preprocess/add_noisy_dummy.py