This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
This repository hosts the source code for an efficient implementation of "Word Mover's Distance" (WMD) using the Sinkhorn-Knopp algorithm. Paper reference will be added upon publication.
- REQUIREMENT: gcc version gcc-7.1.0 or higher
- source your_icc_compiler
- source compile
-
Download the embedding file from https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m. We do not provide the file, since it is large.
-
Then perform the following steps to prepare the input file.
-
- take first 100001 lines: head -n100001 crawl-300d-2M.vec >test.out
-
- remove first line: sed '1d' test.out > test2.out
-
- remove first column of each line: cut -d" " -f2- test2.out > data/vecs.out
-
- discard temporary files: rm test.out test2.out
-
set KMP AFFINITY. For example: export KMP_AFFINITY=compact,1,0,granularity=fine
-
./name_of_executable
-
There is also a small input in data (v2, r2, sample.mat, set the input in the program to run, set word2vec size to 3).
@article{tithi2020efficient, title={An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance}, author={Tithi, Jesmin Jahan and Petrini, Fabrizio}, journal={arXiv preprint arXiv:2005.06727}, year={2020} }