Skip to content

langprocgroup/adjorder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calculate predictors of adjective order and test them in large dependency treebanks.

Data

External data requirements

In this study, we extracted data from the following external data sources, not included here:

  • A parsed Common Crawl corpus in CoNLLU format, introduced in Futrell et al. (2019).
  • Universal Dependencies 2.4, in particular the English Web Treebank
  • GLoVe embeddings from glove.42B.300d.zip (only if performing your own clustering)

Data provided

English adjective and noun wordforms from CELEX are provided in data/english_adjectives.txt and data/english_nouns.txt is from CELEX.

The files data/subjectivity* are from Scontras et al. (2017): these are subjectivity ratings collected in previous experiments.

The subjectivity ratings collected for this study are at experiments/1-UD-subjectivity/results/adjective-subjectivity.csv.

The tarball data/clust_pairs.tar.gz contains pre-clustered pairs of adjectives and nouns from the Common Crawl corpus.

Extracting adjective data from corpora

The directory corpus_extraction has scripts for pulling relevant data out of CoNLLU-formatted dependency treebank files. Supposing you have a bunch of files at location $CORPORA, run the following in the directory corpus_extraction to get all the adjective--adjective--noun pairs:

cat $CORPORA | python extract_conllu.py aan > aan.csv
sh csvcount.sh aan.csv > aan_counts.csv

and run the following to just get all the adjectives:

cat $CORPORA | python extract_conllu.py a > a.csv
sh csvcount.sh a.csv > a_counts.csv

Clustering (k-means)

Adjectives and nouns are clustered with measures/cluster.py with the following arguments:

  • -v $GLOVE -- a file containing space-delimited wordforms and their vectors
  • -p $PAIRS -- a file containing comma-delimited count,adj,noun rows
  • -k ($ADJ_K,$NOUN_K) -- [optional] a tuple listing what k to use for adjectives and nouns; default is (300,1000)
  • -c $PCA -- [optional] the amount of information to preserve when running PCA; default is 1.0 (no reduction)

Output is a comma-delimited clust_pairs.csv with the following columns:

  1. count -- the count of this pair in $PAIRS
  2. awf -- adjective wordform
  3. nwf -- noun wordform
  4. acl -- adjective cluster ID
  5. ncl -- noun cluster ID

Calculating predictors

Predictors are calculated using measures/score_adj_pairs.py with the following arguments:

  • -t $TRIPLES -- a comma-delimited file with at least the rows [count,adj1_word,adj2_word,noun_word]
  • -s $SUBJ -- a comma-delimited file containing at least the rows [predicate,response]

Output is a comma-delimited scores.csv with the following columns:

  1. id -- the ID of a triple in $TRIPLES
  2. idx -- 0 or 1 depending on position of this adjective in $TRIPLES
  3. count -- the count of this triple in $TRIPLES
  4. awf -- adjective wordform
  5. nwf -- noun wordform
  6. acl -- adjective cluster ID
  7. ncl -- noun cluster ID
  8. various predictors named according to the following scheme:
    • ic_ -- integration cost
    • ig_ -- information gain
    • p_ -- log probability
    • pmi_ -- pointwise mutual information
    • subj_ -- subjectivity rating

Running predictors

The predictors calculated and reported in scores.csv can be run with python measures/predict.py scores.csv. Output is deltas.csv, a comma-delimited file with the following columns:

  1. id -- the ID of a triple in $TRIPLES
  2. predictor -- the predictor being run
  3. delta -- the (absolute) difference between the predictor score for each adjective
  4. result -- whether the adjective with the smallest predictor comes first (0) or second (1).

Note that predictors with None values in scores.csv will not be included in deltas.csv. This can happen due to out-of-vocabulary words, adjectives not rated for subjectivity, and so on.

Generating plots

Plots can be generated by running plots/plot_logistic.py delta.csv. A single image (predictors.png) will be generated with a plot for each predictor, showing predictive accuracy and area under curve (AUC) for a logistic regression indicating the predicted probability (y-axis) as a function of the difference between each adjective's score (x-axis). Note that if accuracy is less than 0.5 for a given predictor, the polarity of the predictions -- and the resulting logistic regression -- is switched.

Previous work

If you are here for the code used in Futrell (2019), check out the previous version of this repo at #464e24d.

About

Predicting adjective order using mutual information and subjectivity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published