CS5803 Natural Language Processing - Document Level Text Simplification - A Two-stage Plan-Guided Approach
To activate the conda environment using environment.yml files, navigate to the SimSum directory and run the following command:
cd SimSum
conda env create -f environment.yml
We have used codes from Document Level Planning for Text Simplification and SimSum.
- Smaller version of the Wiki-Auto-Dataset which can be found here.
- Plaba Dataset has been taken from here
- D-Wikipedia has been taken from here
Use the following commands to evaluate output:
The outputs can be evaluated using the following command:
python evaluate.py --model MODEL --dataset DATASETMODEL's - BART, SIMSUM, PG_SIMSUM
DATASET's - WIKI_AUTO_REDUCED, PLABA, D_WIKI
The model output on the corresponding test datasets and tensorboard logs are available in the SimSum/experiments folder.
The model files can be downloaded from here and replace the experiments directory with the downloaded directory.
We have used the existing model weights (from Hugging Face) for the Plan Guided Model for evaluation.
python generate.py dynamic --clf_model_ckpt=liamcripwell/pgdyn-plan --model_ckpt=liamcripwell/pgdyn-simp --test_file=data_pg/wiki_auto/wikiauto_sents_test_reduced.csv --doc_id_col=pair_id --context_doc_id=pair_id --context_dir=context_save_dir/wiki_auto/test --out_file=data_pg/wiki_auto/wiki_auto_reduced_output.csvpython generate.py dynamic --clf_model_ckpt=liamcripwell/pgdyn-plan --model_ckpt=liamcripwell/pgdyn-simp --test_file=data_pg/D_wiki/DWiki_sents_test.csv --doc_id_col=pair_id --context_doc_id=pair_id --context_dir=context_save_dir/D_wiki/test --out_file=data_pg/D_wiki/D_wiki_output.csvpython generate.py dynamic --clf_model_ckpt=liamcripwell/pgdyn-plan --model_ckpt=liamcripwell/pgdyn-simp --test_file=data_pg/plaba/plaba_sents_test.csv --doc_id_col=pair_id --context_doc_id=pair_id --context_dir=context_save_dir/plaba/test --out_file=data_pg/plaba/plaba_output.csv Every time you run generate.py, make sure to delete the temp_embeds/ directory:
rm -r temp_embeds/We have trained three models (BART, SIMSUM, PGSIMSUM) on the Reduced Wiki-Auto dataset.
- Uncomment relevant lines in
SimSum/main.py. - Use
dataset = WIKI_AUTO_REDUCED - Run
python main.py.
[OPTIONAL] To obtain OPERATION TOKENS for each sentence:
- Uncomment relevant lines in
SimSum/prepend_tokens.py. - Run
python prepend_tokens.py. - Results are stored in
data/wiki_auto_reduced_control. - This step is necessary even for generating output during evaluation, but results are stored in the
SimSum/datafolder with the '_control' suffix added to the dataset name.
Uncomment relevant lines in SimSum/main.py.
- Use
dataset = WIKI_AUTO_REDUCED_CONTROL. - Run
python main.py.
IMPORTANT : If you run into some error while training, downgrade transformers to verion 4.21.1 pip install transformers==4.21.1
In order to run obtain results for PG_SIMSUM and pretrained-plan guided module, we need to get context representation for surrounding sentences. This can be done using the following command
python encode_contexts.py --data=DATASET_FILE.csv --x_col=complex --id_col=pair_id --save_dir=CONTEXT_DIR
We have already done this for all the datasets and the embeddings are stored in SimSum/context_save_dir/
| Model | SARI | D-SARI |
|---|---|---|
| BART | 38.84 | 24.32 |
| SIMSUM | 35.07 | 32.47 |
| Plan-Guided | 25.57 | 24.27 |
| PG-SIMSUM | 43.56 | 38.52 |
Table: Results on R-Wiki-auto