GitHub - langprocgroup/adjorder: Predicting adjective order using mutual information and subjectivity

Calculate predictors of adjective order and test them in large dependency treebanks.

Data

External data requirements

In this study, we extracted data from the following external data sources, not included here:

A parsed Common Crawl corpus in CoNLLU format, introduced in Futrell et al. (2019).
Universal Dependencies 2.4, in particular the English Web Treebank
GLoVe embeddings from glove.42B.300d.zip (only if performing your own clustering)

Data provided

English adjective and noun wordforms from CELEX are provided in data/english_adjectives.txt and data/english_nouns.txt is from CELEX.

The files data/subjectivity* are from Scontras et al. (2017): these are subjectivity ratings collected in previous experiments.

The subjectivity ratings collected for this study are at experiments/1-UD-subjectivity/results/adjective-subjectivity.csv.

The tarball data/clust_pairs.tar.gz contains pre-clustered pairs of adjectives and nouns from the Common Crawl corpus.

Extracting adjective data from corpora

The directory corpus_extraction has scripts for pulling relevant data out of CoNLLU-formatted dependency treebank files. Supposing you have a bunch of files at location $CORPORA, run the following in the directory corpus_extraction to get all the adjective--adjective--noun pairs:

cat $CORPORA | python extract_conllu.py aan > aan.csv
sh csvcount.sh aan.csv > aan_counts.csv

and run the following to just get all the adjectives:

cat $CORPORA | python extract_conllu.py a > a.csv
sh csvcount.sh a.csv > a_counts.csv

Clustering (k-means)

Adjectives and nouns are clustered with measures/cluster.py with the following arguments:

-v $GLOVE -- a file containing space-delimited wordforms and their vectors
-p $PAIRS -- a file containing comma-delimited count,adj,noun rows
-k ($ADJ_K,$NOUN_K) -- [optional] a tuple listing what k to use for adjectives and nouns; default is (300,1000)
-c $PCA -- [optional] the amount of information to preserve when running PCA; default is 1.0 (no reduction)

Output is a comma-delimited clust_pairs.csv with the following columns:

count -- the count of this pair in $PAIRS
awf -- adjective wordform
nwf -- noun wordform
acl -- adjective cluster ID
ncl -- noun cluster ID

Calculating predictors

Predictors are calculated using measures/score_adj_pairs.py with the following arguments:

-t $TRIPLES -- a comma-delimited file with at least the rows [count,adj1_word,adj2_word,noun_word]
-s $SUBJ -- a comma-delimited file containing at least the rows [predicate,response]

Output is a comma-delimited scores.csv with the following columns:

id -- the ID of a triple in $TRIPLES
idx -- 0 or 1 depending on position of this adjective in $TRIPLES
count -- the count of this triple in $TRIPLES
awf -- adjective wordform
nwf -- noun wordform
acl -- adjective cluster ID
ncl -- noun cluster ID
various predictors named according to the following scheme:
- ic_ -- integration cost
- ig_ -- information gain
- p_ -- log probability
- pmi_ -- pointwise mutual information
- subj_ -- subjectivity rating

Running predictors

The predictors calculated and reported in scores.csv can be run with python measures/predict.py scores.csv. Output is deltas.csv, a comma-delimited file with the following columns:

id -- the ID of a triple in $TRIPLES
predictor -- the predictor being run
delta -- the (absolute) difference between the predictor score for each adjective
result -- whether the adjective with the smallest predictor comes first (0) or second (1).

Note that predictors with None values in scores.csv will not be included in deltas.csv. This can happen due to out-of-vocabulary words, adjectives not rated for subjectivity, and so on.

Generating plots

Plots can be generated by running plots/plot_logistic.py delta.csv. A single image (predictors.png) will be generated with a plot for each predictor, showing predictive accuracy and area under curve (AUC) for a logistic regression indicating the predicted probability (y-axis) as a function of the difference between each adjective's score (x-axis). Note that if accuracy is less than 0.5 for a given predictor, the polarity of the predictions -- and the resulting logistic regression -- is switched.

Previous work

If you are here for the code used in Futrell (2019), check out the previous version of this repo at #464e24d.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
analysis		analysis
corpus_extraction		corpus_extraction
data		data
experiments		experiments
measures		measures
plots		plots
.gitignore		.gitignore
README.md		README.md
analyze.R		analyze.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

corpus_extraction

corpus_extraction

data

data

experiments

experiments

measures

measures

plots

plots

.gitignore

.gitignore

README.md

README.md

analyze.R

analyze.R

Repository files navigation

Data

External data requirements

Data provided

Extracting adjective data from corpora

Clustering (k-means)

Calculating predictors

Running predictors

Generating plots

Previous work

About

Releases

Packages

Contributors 3

Languages

langprocgroup/adjorder

Folders and files

Latest commit

History

Repository files navigation

Data

External data requirements

Data provided

Extracting adjective data from corpora

Clustering (k-means)

Calculating predictors

Running predictors

Generating plots

Previous work

About

Resources

Stars

Watchers

Forks

Languages