Code for experiments on measuring domain similarity based on embedding spaces

Embedding spaces

The embedding spaces used in this study were trained using the PPMI+SVD implementation from Levy et al. (2015). The English 260 MB embeddings are contained in the embeddings directory. The complete set of pre-trained embedding spaces can be downloaded via

wget https://www.ling.uni-potsdam.de/~beyer/embeddings.zip

Other resources (e.g. vocab files, corpora and embeddings from the simulation study) are available upon request. (anne.beyer@uni-potsdam.de)

Requirements

CCA measure:

python3
numpy
sklearn
gensim

Simulation study:
The PPMI+SVD embeddings in the simulation study require Python 2.7 (see link above for further dependencies)
As the other project scripts are written in Python 3.6, conda is used to switch envronments (ebeddings and embeddings2) in corpus_simulation.sh

CCA Measure

mapping_correlation.sh assumes a directory "embeddings" containing the pre-trained embedding spaces (adapt paths as appropriate) and creates a directory "correlations", in which it computes the correlation scores for all corpus combinations described in the paper, as well as creating a visualization of the dimension-wise correlations. The CCA measure scores are printed to stdout.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
embeddings/en_260MB		embeddings/en_260MB
utils		utils
README.md		README.md
corpus_simulation.sh		corpus_simulation.sh
domain_similarity.py		domain_similarity.py
mapping_correlation.py		mapping_correlation.py
mapping_correlation.sh		mapping_correlation.sh
visualize.py		visualize.py
visualize_cross_lang.py		visualize_cross_lang.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embeddings/en_260MB

embeddings/en_260MB

utils

utils

README.md

README.md

corpus_simulation.sh

corpus_simulation.sh

domain_similarity.py

domain_similarity.py

mapping_correlation.py

mapping_correlation.py

mapping_correlation.sh

mapping_correlation.sh

visualize.py

visualize.py

visualize_cross_lang.py

visualize_cross_lang.py

Repository files navigation

Code for experiments on measuring domain similarity based on embedding spaces

Embedding spaces

Requirements

CCA Measure

About

Releases

Packages

Languages

AnneBeyer/emb_sim

Folders and files

Latest commit

History

Repository files navigation

Code for experiments on measuring domain similarity based on embedding spaces

Embedding spaces

Requirements

CCA Measure

About

Resources

Stars

Watchers

Forks

Languages