Calculate predictors of adjective order and test them in large dependency treebanks.
In this study, we extracted data from the following external data sources, not included here:
- A parsed Common Crawl corpus in CoNLLU format, introduced in Futrell et al. (2019).
- Universal Dependencies 2.4, in particular the English Web Treebank
- GLoVe embeddings from glove.42B.300d.zip (only if performing your own clustering)
English adjective and noun wordforms from CELEX are provided in data/english_adjectives.txt
and data/english_nouns.txt
is from CELEX.
The files data/subjectivity*
are from Scontras et al. (2017): these are subjectivity ratings collected in previous experiments.
The subjectivity ratings collected for this study are at experiments/1-UD-subjectivity/results/adjective-subjectivity.csv
.
The tarball data/clust_pairs.tar.gz
contains pre-clustered pairs of adjectives and nouns from the Common Crawl corpus.
The directory corpus_extraction
has scripts for pulling relevant data out of CoNLLU-formatted dependency treebank files. Supposing you have a bunch of files at location $CORPORA
, run the following in the directory corpus_extraction
to get all the adjective--adjective--noun pairs:
cat $CORPORA | python extract_conllu.py aan > aan.csv
sh csvcount.sh aan.csv > aan_counts.csv
and run the following to just get all the adjectives:
cat $CORPORA | python extract_conllu.py a > a.csv
sh csvcount.sh a.csv > a_counts.csv
Adjectives and nouns are clustered with measures/cluster.py
with the following arguments:
-v $GLOVE
-- a file containing space-delimited wordforms and their vectors-p $PAIRS
-- a file containing comma-delimitedcount,adj,noun
rows-k ($ADJ_K,$NOUN_K)
-- [optional] a tuple listing what k to use for adjectives and nouns; default is (300,1000)-c $PCA
-- [optional] the amount of information to preserve when running PCA; default is 1.0 (no reduction)
Output is a comma-delimited clust_pairs.csv
with the following columns:
count
-- the count of this pair in$PAIRS
awf
-- adjective wordformnwf
-- noun wordformacl
-- adjective cluster IDncl
-- noun cluster ID
Predictors are calculated using measures/score_adj_pairs.py
with the following arguments:
-t $TRIPLES
-- a comma-delimited file with at least the rows[count,adj1_word,adj2_word,noun_word]
-s $SUBJ
-- a comma-delimited file containing at least the rows[predicate,response]
Output is a comma-delimited scores.csv
with the following columns:
id
-- the ID of a triple in$TRIPLES
idx
-- 0 or 1 depending on position of this adjective in$TRIPLES
count
-- the count of this triple in$TRIPLES
awf
-- adjective wordformnwf
-- noun wordformacl
-- adjective cluster IDncl
-- noun cluster ID- various predictors named according to the following scheme:
ic_
-- integration costig_
-- information gainp_
-- log probabilitypmi_
-- pointwise mutual informationsubj_
-- subjectivity rating
The predictors calculated and reported in scores.csv
can be run with python measures/predict.py scores.csv
. Output is deltas.csv
, a comma-delimited file with the following columns:
id
-- the ID of a triple in$TRIPLES
predictor
-- the predictor being rundelta
-- the (absolute) difference between the predictor score for each adjectiveresult
-- whether the adjective with the smallest predictor comes first (0) or second (1).
Note that predictors with None
values in scores.csv
will not be included in deltas.csv
. This can happen due to out-of-vocabulary words, adjectives not rated for subjectivity, and so on.
Plots can be generated by running plots/plot_logistic.py delta.csv
. A single image (predictors.png
) will be generated with a plot for each predictor, showing predictive accuracy and area under curve (AUC) for a logistic regression indicating the predicted probability (y-axis) as a function of the difference between each adjective's score (x-axis). Note that if accuracy is less than 0.5 for a given predictor, the polarity of the predictions -- and the resulting logistic regression -- is switched.
If you are here for the code used in Futrell (2019), check out the previous version of this repo at #464e24d.