Original Implementation and Dataset of the Paper 'You Don’t Have Time to Read This: an Exploration of Document-Level Reading Time Prediction' by Weller et. al. published at ACL 2020.
Datasets can be found in src/data
. If you want all the data, use the src/data/screening_study_with_article_info.csv
file which contains the article texts and the demographic info. The other data files are explained in the readme located src/data
.
- Create a new virtual enviroment
- Run
pip3 install -r requirements.txt
andbash nltk_downloader.sh
to load the required packages
- Use the preloaded data files found in
src/data
or re-create them following theNeural Complexity
steps below. - Go into the
src
directory and create the document embedders by runningpython3 document_embedder.py
. - Run
bash run_n_seeds
to generate the results insrc/results
. - Compile the results by running
python3 utils.py --gather_results
to collect the results intosr/results/final_results.csv
- Split the article text by going into
src
and runningpython3 utils.py --split_text
- Clone the repo at https://github.com/vansky/neural-complexity.git and follow the instructions to train on Wikitext-2:
time python main.py --model_file 'wiki_2_model.pt' --vocab_file 'wiki_2_vocab.txt' --tied --cuda --data_dir './data/wikitext-2/' --trainfname 'train.txt' --validfname 'valid.txt'
- Evaluate on the text from this repo by running inside of the
neural_complexity
folder:
# gets the non-tokenized version for the LMM over the surprisal
for ((i=1; i<33; i++));
do
time python main.py --model_file 'wiki_2_model.pt' --vocab_file 'wiki_2_vocab.txt' --cuda --data_dir "../src/data/article_texts/" --testfname "$i.txt" --test --words --nopp > "../src/data/article_texts/$i.output"
done
# gets the tokenized version for the sum of the surprisal
for ((i=1; i<33; i++));
do
time python main.py --model_file 'wiki_2_model.pt' --vocab_file 'wiki_2_vocab.txt' --cuda --data_dir "../src/data/article_texts/" --testfname "$i-tok.txt" --test --words --nopp > "../src/data/article_texts/$i.output-tok"
done
- Re-gather the suprisals for each article by running
python3 utils.py --gather_suprisal
inside ofsrc
which will place the data intosrc/data/article_num_to_surprisal_only.csv
- Go into
src/naturalstories
and runbash gather_data_from_natural_stories
andpython3 process_natural_stories.py --create_story_files
. This creates the individual story files. - Go to the cloned
neural-complexity
repo and run the script found insrc/naturalstories/model_all_stories.sh
in the root of that cloned repo. - Gather and process the models output by going back to
src/naturalstories
and runnningpython3 process_natural_stories.py --create_data_file_for_lmm --train_lmm
. This creates the trained LMM model on the Natural Stories corpus and then generates predictions for the new data.
If you find this work useful or related to yours, please cite the following:
@inproceedings{weller2020you,
title={You Don’t Have Time to Read This: An Exploration of Document Reading Time Prediction},
author={Weller, Orion and Hildebrandt, Jordan and Reznik, Ilya and Challis, Christopher and Tass, E Shannon and Snell, Quinn and Seppi, Kevin},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
pages={1789--1794},
year={2020}
}