The embedding spaces used in this study were trained using the PPMI+SVD implementation from Levy et al. (2015). The English 260 MB embeddings are contained in the embeddings directory. The complete set of pre-trained embedding spaces can be downloaded via
wget https://www.ling.uni-potsdam.de/~beyer/embeddings.zip
Other resources (e.g. vocab files, corpora and embeddings from the simulation study) are available upon request. (anne.beyer@uni-potsdam.de)
CCA measure:
- python3
- numpy
- sklearn
- gensim
Simulation study:
The PPMI+SVD embeddings in the simulation study require Python 2.7 (see link above for further dependencies)
As the other project scripts are written in Python 3.6, conda is used to switch envronments (ebeddings and embeddings2) in corpus_simulation.sh
mapping_correlation.sh
assumes a directory "embeddings" containing the pre-trained embedding spaces (adapt paths as appropriate) and creates a directory "correlations", in which it computes the correlation scores for all corpus combinations described in the paper, as well as creating a visualization of the dimension-wise correlations. The CCA measure scores are printed to stdout.