Natural Stories Corpus

Update 2020-11-11: The repository has been updated with corrected alignments between SPR RTs and tokens in Story 3. In the previously-released files, the words after position 230 in Story 3 were mis-aligned with SPR RTs by one position. In the new files and alignment scripts, these errors have been corrected. For more details, see the README file in the naturalstories_RT directory. Thanks to Luca Campanelli and Cory Shain for helping us find and correct this issue.

Update 2021-10-01: Related to the update above, we removed erroneous RT datapoints corresponding to a blank token shown to SPR participants at position 231 in Story 3.

This is a corpus of naturalistic stories meant to contain varied, low-frequency syntactic constructions. There are a variety of annotations and psycholinguistic measures available for the stories.

The stories in with their various annotations are coordinated around the file words.tsv, which specifies a unique code for each token in the story under a variety of different tokenization schemes. For example, the following lines in words.tsv cover the phrase the long-bearded mill owners.:

1.54.whole      the
1.54.word       the
1.54.1  the
1.55.whole      long - bearded
1.55.word       long - bearded
1.55.1  long
1.55.2  -
1.55.3  bearded
1.56.whole      mill
1.56.word       mill
1.56.1  mill
1.57.whole      owners .
1.57.word       owners
1.57.1  owners
1.57.2  .

The first column is the token code; the second is the token itself. For example, 1.57.whole represents the token owners. and 1.57.word represents the token owners. The token code consists of three fields:

The id of the story the token is found in,
The number of the token in the story,
An additional field whose value is whole for the entire token including punctuation, word for the token stripped of punctuation to the left and right, and then 1 through n for each sub-token in whole as segmented by NLTK's TreebankWordTokenizer.

The various annotations (frequencies, parses, RTs, etc.) should reference these codes so that we can track tokens uniformly.

If you use the corpus please cite:

@article{futrell2021natural,
author={Richard Futrell and Edward Gibson and Harry J. Tily and Idan Blank and Anastasia Vishnevetsky and Steven T. Piantadosi and Evelina Fedorenko},
year={2021},
title={The Natural Stories corpus: A reading-time corpus of English texts containing rare syntactic constructions},
journal={Language Resources and Evaluation},
volume={55},
number={1},
pages={63--77}}

Deep syntactic annotations following a categorial grammar are also available here (see paper).

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
audio		audio
freqs		freqs
naturalstories_RTS		naturalstories_RTS
parses		parses
probs		probs
LICENSE.txt		LICENSE.txt
README.md		README.md
parsed-natural-stories.xlsx		parsed-natural-stories.xlsx
words.tsv		words.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio

audio

freqs

freqs

naturalstories_RTS

naturalstories_RTS

parses

parses

probs

probs

LICENSE.txt

LICENSE.txt

README.md

README.md

parsed-natural-stories.xlsx

parsed-natural-stories.xlsx

words.tsv

words.tsv

Repository files navigation

Natural Stories Corpus

About

Releases

Packages

Contributors 4

Languages

License

languageMIT/naturalstories

Folders and files

Latest commit

History

Repository files navigation

Natural Stories Corpus

About

Resources

License

Stars

Watchers

Forks

Languages