This repo contains the code for our paper A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization.
The scripts are developed with Anaconda python 3.8, and the working environment can be configured with the following commands.
git clone
conda create -n NACC_MANGA python=3.8
conda activate NACC_MANGA
pip install gdown
pip install git+
conda install pytorch cudatoolkit=10.2 -c pytorch
git clone --recursive
cd ctcdecode && pip install .
cd ..
rm -rf ctcdecode
pip install -e.
The search-based summaries on the Gigaword dataset and pre-trained model weights can be found in this publically available Google drive folder, which can be automatically downloaded and organized with the following commands. If the command doesn't work, download and extract manually.
chmod +x
Execute the following command to preprocess the data.
chmod +x
Our training script is
. We introduce some of its important training parameters, other parameters can be found here.
data_source: (Required) Directory to the pre-processed training data (e.g., data-bin/gigaword_10).
arch: (Required) Model Architecture. This must be set to nat_encoder_only_ctc
task: (Required) The task we are training for. This must be set to summarization
best-checkpoint-metric: (Required) Criteria to save the best checkpoint. This can be set to either rouge
or loss
(we used rouge).
criterion: (Required) Criteria for training loss calculation. This must be set to summarization_ctc
max-valid-steps: (Optional) Maximum steps during validation. e.g., 100
. Limiting this number avoids time-consuming validation on a large validation dataset.
batch-size-valid (Optional) Batch size during validation. e.g., 5
. Set this parameter to 1
if you want to test the unparallel inference efficiency.
decoding_algorithm: (Optional) Decoding algorithm of model output (logits) sequence. This can be set to ctc_char_greedy_decoding
and ctc_char_length_control
truncate_summary: (Optional) Whether to truncate the generated summaries. This parameter is valid when decoding_algorithm
is set to ctc_char_greedy_decoding
desired_length: (Optional) Desired (maximum) number of characters of the output summary. If decoding_algorithm
is set to ctc_char_greedy_decoding
, and truncate_summary
is True
, the model will truncate longer summaries to the desired_length
When decoding_algorithm
is ctc_char_length_control
, the model's decoding strategy depends on the parameter force_length
, which will be explained in the next paragraph.
force_length: (Optional) This parameter is only useful when decoding_algorithm
is set to ctc_char_length_control
; the parameter determines whether to force the length of the generated summaries to be desired_length
. If force_length
is set to False
, the model returns the greedily decoded summary if the summary length does not exceed desired_length
. Otherwise, the model search for the (approximately) most probable summary of the desired_length
with a length control algorithm.
bucket_size: (Optional) This parameter is only useful when decoding_algorithm
is set to ctc_char_length_control
. It refers to the bucket size of the length control algorithm.
valid_subset: (Optional) Names of the validation dataset, separating by comma, e.g, test,valid.
max_token: (Optional) Max tokens in each training batch.
max_update: (Optional) Maximum training steps.
For example, if we want to train NACC with 10-word searched summaries, ctc_char_length_control
decoding, desired length of 50
and bucket size of 2
, we can use the following training command.
CUDA_VISIBLE_DEVICES=0 nohup python data-bin/$data_source --source-lang article --target-lang summary --save-dir NACC_${data_source}_${max_token}_${decoding_algorithm}_${desired_length}_truncate_summary_${truncate_summary}_label_smoothing_${label_smoothing}_dropout_${drop_out}_checkpoints --keep-interval-updates 5 --save-interval-updates 5000 --validate-interval-updates 5000 --scoring rouge --maximize-best-checkpoint-metric --best-checkpoint-metric rouge --log-format simple --log-interval 100 --keep-last-epochs 5 --keep-best-checkpoints 5 --share-all-embeddings --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --weight-decay 0.01 --fp16 --clip-norm 2.0 --max-update $max_update --task summarization --criterion summarization_ctc --arch nat_encoder_only_ctc --activation-fn gelu --dropout 0.1 --max-tokens $max_token --valid-subset $valid_subset --decoding_algorithm $decoding_algorithm --desired_length $desired_length --bucket_size $bucket_size --force_length $force_length --truncate_summary $truncate_summary --max-valid-steps $max_valid_steps&
Our evaluation script is fairseq_cli/
, and it inherits the training arguments related to the data source, model architecture and decoding strategy.
Besides, it requires the following arguments.
path: (Required) Directory to the trained model (e.g., NACC/
gen-subset: (Required) Names of the generation dataset (e.g., test).
scoring: (Required) Similar to the criteria in training arguments, it must be set to rouge
For example, the following command evaluates the performance of our pretrained
on the Gigaword test dataset.
CUDA_VISIBLE_DEVICES=0 python fairseq_cli/ $data_source --seed $seed --source-lang article --target-lang summary --path $path --task summarization --scoring rouge --arch nat_encoder_only_ctc --gen-subset $gen_subset --model-overrides "{'decoding_algorithm': '$decoding_algorithm', 'desired_length': $desired_length, 'bucket_size': $bucket_size, 'truncate_summary': $truncate_summary, 'beam_size': 1, 'generator_type': 'ctc'}"
The evaluation result will be saved at the folder *_evaluation_result
by default, including the generated summaries and the statistics of the generated summaries (e.g., ROUGE score).
Notice: if you want to test the unparallel inference efficiency, include an extra parameter --batch-size 1
in the evaluation command.
As you may notice, our script is developed based on Fairseq, which is a very useful & extendable package to develop Seq2Seq models. We didn't ruin any of its built-in functionality to retain its extension ability.