lc-surprisal

Lossy-context surprisal experiments and simulations.

This repository contains Python 3 code for replicating experiments and simulations for lossy-context surprisal, introduced in Futrell & Levy (2017, EACL) and fleshed out in Futrell, Gibson & Levy (2020, Cognitive Science).

@article{futrell2020lossy,
author={Richard Futrell and Edward Gibson and Roger P. Levy},
title={Lossy-context surprisal: An information-theoretic model of memory effects in sentence processing},
year={2020},
journal={Cognitive Science},
volume={44},
pages={e12814}}

@inproceedings{futrell2017noisy,
author={Richard Futrell and Roger P. Levy},
title={Noisy-context surprisal as a human sentence processing cost model},
year={2017},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers},
pages={688-698},
address={Valencia, Spain}}

To get started: pip3 install -r requirements.txt.

Structural forgetting surprisal

Lossy-context surprisal values for grammatical and ungrammatical completions of a string generated by a PCFG grammar. The results show a crossover between English and German, whereby English has a grammaticality illusion and German doesn't. This phenomenon is called structural forgetting. To generate the model values:

import experiments
_, english = experiments.verb_forgetting_conditions(m=.5, r=.5, e=.2, s=.8)
_, german = experiments.verb_forgetting_conditions(m=.5, r=.5, e=.2, s=0)

The resulting numbers, divided by log2, are plotted against reading time data in shravanplot.R.

Structural forgetting phase diagram

Regions of different model behavior for the structural forgetting case, based on PCFG parameters and depth of embedding.

import experiments
df = experiments.verb_forgetting_grid()
experiments.verb_forgetting_plot(df)

This will bring up a matplotlib plot of the phase diagram.

HDMI Hypothesis in Google Syntactic N-grams

Supposing you have the ngrams at $PATH, use syntngrams_depmi.py to extract the appropriate counts:

$ zcat $PATH/arcs* | python3 syntngrams_depmi.py 01 01 | sort | sh uniqsum.sh > arcs_01-01

The script takes two arguments, match_code and get_code. match_code tells the script what dependency structures to filter for. For example, 01 means a head and its direct dependent. 012 means a chain of a word w_0, w_0's dependent w_1, and w_1's dependent w_2. 011 means to look at structures with one head and two dependents. get_code tells the script which two words to extract wordforms for. The example above looks for direct dependencies and takes the wordforms of head and dependent. The table in the paper uses codes 012 01, 012 02, and 011 12.

The resulting file arcs_01_01 contains joint counts of two words in the specified dependency relationship. Now generate the vocabulary file for the frequency cutoff:

$ cat arcs_01-01 | sed "s/^.* //g" | sort | uniqsum.sh | sort -rnk 2 > vocab

Then use the vocab file to calculate MI with a frequency cutoff:

$ cat arcs_01-01 | python3 compute_mi.py vocab 10000

To compare the MI of two sets of counts using a permutation test, do (for example):

$ python3 compare_mi.py arcs_012-01 arcs_012-02 vocab 10000 500

This does a permutation test with 500 samples comparing MI in the files arcs_012-01 and arcs_012-02 with vocabulary from file vocab cutoff at the most frequent 10,000 forms.

HDMI Hypothesis in Universal Dependencies by Topology

Mutual information over part of speech tags for different dependency relations in several UD corpora. To replicate these numbers, do in Python:

import hdmi
hdmi.hdmi_topologies_with_permutation_tests()

HDMI Hypothesis in Universal Dependencies by Distance

Average pmi values of pos tags at different distances in the UD corpora. To replicate these numbers, do in Python:

import hdmi
hdmi.skip_mi_sweep()

To get the numbers for wordforms:

import cliqs.conditioning as cond
import hdmi
hdmi.skip_mi_sweep(cond.get_word)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
compare_mi.py		compare_mi.py
compute_mi.py		compute_mi.py
entropy.py		entropy.py
experiments.py		experiments.py
hdmi.py		hdmi.py
incnoise.py		incnoise.py
infotrees.py		infotrees.py
pcfg.py		pcfg.py
plot_hdmi.R		plot_hdmi.R
plot_mi_distance.R		plot_mi_distance.R
pmonad.py		pmonad.py
requirements.txt		requirements.txt
run_depcount.sh		run_depcount.sh
shravanplot.R		shravanplot.R
syntngrams_depmi.py		syntngrams_depmi.py
uniqsum.sh		uniqsum.sh

Futrell/lc-surprisal

Folders and files

Latest commit

History

Repository files navigation

lc-surprisal

Structural forgetting surprisal

Structural forgetting phase diagram

HDMI Hypothesis in Google Syntactic N-grams

HDMI Hypothesis in Universal Dependencies by Topology

HDMI Hypothesis in Universal Dependencies by Distance

About

Resources

Stars

Watchers

Forks

Languages