Lossy-context surprisal experiments and simulations.
This repository contains Python 3 code for replicating experiments and simulations for lossy-context surprisal, introduced in Futrell & Levy (2017, EACL) and fleshed out in Futrell, Gibson & Levy (2020, Cognitive Science).
@article{futrell2020lossy,
author={Richard Futrell and Edward Gibson and Roger P. Levy},
title={Lossy-context surprisal: An information-theoretic model of memory effects in sentence processing},
year={2020},
journal={Cognitive Science},
volume={44},
pages={e12814}}
@inproceedings{futrell2017noisy,
author={Richard Futrell and Roger P. Levy},
title={Noisy-context surprisal as a human sentence processing cost model},
year={2017},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers},
pages={688-698},
address={Valencia, Spain}}
To get started: pip3 install -r requirements.txt
.
Lossy-context surprisal values for grammatical and ungrammatical completions of a string generated by a PCFG grammar. The results show a crossover between English and German, whereby English has a grammaticality illusion and German doesn't. This phenomenon is called structural forgetting. To generate the model values:
import experiments
_, english = experiments.verb_forgetting_conditions(m=.5, r=.5, e=.2, s=.8)
_, german = experiments.verb_forgetting_conditions(m=.5, r=.5, e=.2, s=0)
The resulting numbers, divided by log2, are plotted against reading time data in shravanplot.R
.
Regions of different model behavior for the structural forgetting case, based on PCFG parameters and depth of embedding.
import experiments
df = experiments.verb_forgetting_grid()
experiments.verb_forgetting_plot(df)
This will bring up a matplotlib plot of the phase diagram.
Supposing you have the ngrams at $PATH
, use syntngrams_depmi.py
to extract the appropriate counts:
$ zcat $PATH/arcs* | python3 syntngrams_depmi.py 01 01 | sort | sh uniqsum.sh > arcs_01-01
The script takes two arguments, match_code
and get_code
. match_code
tells the script what dependency structures to filter for. For example, 01
means a head and its direct dependent. 012
means a chain of a word w_0, w_0's dependent w_1, and w_1's dependent w_2. 011
means to look at structures with one head and two dependents. get_code
tells the script which two words to extract wordforms for. The example above looks for direct dependencies and takes the wordforms of head and dependent. The table in the paper uses codes 012 01
, 012 02
, and 011 12
.
The resulting file arcs_01_01
contains joint counts of two words in the specified dependency relationship.
Now generate the vocabulary file for the frequency cutoff:
$ cat arcs_01-01 | sed "s/^.* //g" | sort | uniqsum.sh | sort -rnk 2 > vocab
Then use the vocab file to calculate MI with a frequency cutoff:
$ cat arcs_01-01 | python3 compute_mi.py vocab 10000
To compare the MI of two sets of counts using a permutation test, do (for example):
$ python3 compare_mi.py arcs_012-01 arcs_012-02 vocab 10000 500
This does a permutation test with 500 samples comparing MI in the files arcs_012-01
and arcs_012-02
with vocabulary from file vocab
cutoff at the most frequent 10,000 forms.
Mutual information over part of speech tags for different dependency relations in several UD corpora. To replicate these numbers, do in Python:
import hdmi
hdmi.hdmi_topologies_with_permutation_tests()
Average pmi values of pos tags at different distances in the UD corpora. To replicate these numbers, do in Python:
import hdmi
hdmi.skip_mi_sweep()
To get the numbers for wordforms:
import cliqs.conditioning as cond
import hdmi
hdmi.skip_mi_sweep(cond.get_word)