Skip to content

AnneBeyer/emb_sim

Repository files navigation

Code for experiments on measuring domain similarity based on embedding spaces

Embedding spaces

The embedding spaces used in this study were trained using the PPMI+SVD implementation from Levy et al. (2015). The English 260 MB embeddings are contained in the embeddings directory. The complete set of pre-trained embedding spaces can be downloaded via

wget https://www.ling.uni-potsdam.de/~beyer/embeddings.zip

Other resources (e.g. vocab files, corpora and embeddings from the simulation study) are available upon request. (anne.beyer@uni-potsdam.de)

Requirements

CCA measure:

  • python3
  • numpy
  • sklearn
  • gensim

Simulation study:
The PPMI+SVD embeddings in the simulation study require Python 2.7 (see link above for further dependencies)
As the other project scripts are written in Python 3.6, conda is used to switch envronments (ebeddings and embeddings2) in corpus_simulation.sh

CCA Measure

mapping_correlation.sh assumes a directory "embeddings" containing the pre-trained embedding spaces (adapt paths as appropriate) and creates a directory "correlations", in which it computes the correlation scores for all corpus combinations described in the paper, as well as creating a visualization of the dimension-wise correlations. The CCA measure scores are printed to stdout.

About

Measuring domain similarity based on embedding spaces

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published