CLUSE: Cross-Lingual Unsupervised Sense Embeddings


CLUSE is an unsupervised learning framework for crosslingual sense embeddings, whose goal is to provide the community with:

  • state-of-the-art multilingual sense embeddings where the embeddings are aligned in a common space
  • large-scale and high-quality English-Chinese contextual similarity evaluation dataset


Get training & evaluation datasets

Get training dataset for Engilsh-German parallel corpus: Europarl.

Get training dataset for English-Chinese parallel corpus: UM-Corpus.

Get mono-lingual sense embeddings evaluation dataset: SCWS.

Get cross-lingual sense embeddings evaluation dataset: BCWS.

Please cite the corresponding papers if you use the above datasets.

Data preprocessing

All the data are in the data/ directory. You can safely download the preprocessed data from Unzip it and replace the old data/ directory.

Or you can preprocess the data by yourself.

First put the bcws.txt from BCWS into data/en_ch/. Then put the ratings.txt from SCWS into data/en_ch/ and data/en_de/.

Since this work requires parallel corpus, you have to prepare two files for each language pair. These two files should have the same number of lines, such that the sentences with same line number form a paralle setence pair.

For example, to prepare the training and evaluation data for the Engilsh-German language pair,

cd data/en_de/
bash english_parallel german_parallel english_vocab_size german_vocab_size

To reproduce the results in the paper,

bash 6000 6000

will generate all the training and evaluation files.


cd data/en_ch/
bash en.txt ch.txt 6000 6000

will generate all the training and evaluation files for the Engilsh-Chinese language pair. Note that there are several domains in UM-Corpus, and we simply concatenate all the files.


To train the Engilsh-German sense embeddings:

cd en_de/
bash checkpoint_dir major_weight reg_weight

For example,

bash log 0.5 1.0

will train the model and save the checkpoint files to log directory with the specified major weight and regularization weight. For details, please refer to the paper.


cd en_ch/
bash checkpoint_dir major_weight reg_weight

will train the model for the English-Chinese sense embeddings.


You will see the spearman correlation score of SCWS/BCWS during the training process.

To evaluate the trained models:

cd en_de/ or cd en_ch/
bash path_to_ckpt

will evaluate the SCWS/BCWS again and dump the trained sense embeddings.

To decode the sense for a specific word with its context,

cd en_de/ or cd en_ch/
bash path_to_ckpt

Note that we only allow for English input currently.

Results (AvgSimC / MaxSimC)

Model Bilingual Weight Bilingual (BCWS)
Luong et al. (2015) - 50.4
Conneau et al. (2017) - 54.7
CLUSE 0.1 58.3 / 58.3
0.3 58.8 / 58.8
0.5 58.5 / 58.5
0.7 58.3 / 58.4
0.9 58.3 / 58.3


Please cite [1] if you found the resources in this repository useful and cite [2] if you use the BCWS dataset.

[1] Ta-Chung Chi and Yun-Nung Chen, CLUSE: Cross-Lingual Unsupervised Sense Embeddings

  author    = {Chi, Ta-Chung  and  Chen, Yun-Nung},
  title     = {Cluse: Cross-lingual underspervised sense embeddings},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP)},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},

BCWS: Bilingual Contextual Word Similarity

[2] Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Chen, BCWS: Bilingual Contextual Word Similarity

  title={BCWS: Bilingual Contextual Word Similarity},
  author={Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Che},
  journal={arXiv preprint arXiv:},


This project is supported by Google Faculty Research Awards and MOST.


