GitHub - AotY/MPyWE: Morpheme, Pinyin Enhanced Word Embedding

A Python implementation of the Continuous Bag of Words (CBOW) and skip-gram neural network architectures, and the hierarchical softmax and negative sampling learning algorithms for efficient learning of word vectors (Mikolov, et al., 2013a, b, c; http://code.google.com/p/word2vec/).

Usage

To train word vectors:

word2vec.py [-h] -train FI -model FO [-cbow CBOW] [-negative NEG]
            [-dim DIM] [-alpha ALPHA] [-window WIN]
            [-min-count MIN_COUNT] [-processes NUM_PROCESSES]
            [-binary BINARY]

required arguments:
  -train FI                 Training file
  -model FO                 Output model file

optional arguments:
  -h, --help                show this help message and exit
  -cbow CBOW                1 for CBOW, 0 for skip-gram
  -negative NEG             Number of negative examples (>0) for negative sampling, 
                            0 for hierarchical softmax
  -dim DIM                  Dimensionality of word embeddings
  -alpha ALPHA              Starting learning rate
  -window WIN               Max window length
  -min-count MIN_COUNT      Min count for words used to learn <unk>
  -processes NUM_PROCESSES  Number of processes
  -binary BINARY            1 for output model in binary format, 0 otherwise

Each sentence in the training file is expected to be newline separated.

Implementation Details

Written in Python 2.7.6 and NumPy 1.9.1.

Evaluation

Accuracy (%) on the word analogy task compared against the original C implementation (in parentheses). Trained on a preprocessed version of the first 10⁸ bytes of the English Wikipedia dump on March 3, 2006 (http://mattmahoney.net/dc/textdata.html).

Model	Total	Semantic	Syntactic
CBOW HS	6.76 (6.90)	4.86 (3.61)	7.93 (8.93)
CBOW NS	4.52 (6.72)	3.94 (3.74)	4.88 (8.56)
Skip-gram HS	14.76 (14.59)	11.40 (10.40)	16.83 (17.18)
Skip-gram NS	8.43 (7.72)	4.91 (4.62)	10.62 (9.63)

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013a). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. http://arxiv.org/pdf/1301.3781.pdf

Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL. http://msr-waypoint.com/en-us/um/people/gzweig/Pubs/NAACL2013Regularities.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
CA8		CA8
dict		dict
people's_daily		people's_daily
t_sne		t_sne
word_analogy		word_analogy
word_score		word_score
word_similarity		word_similarity
.gitignore		.gitignore
cpywe.py		cpywe.py
cpywtfidfe.py		cpywtfidfe.py
cpywtfidfe2.py		cpywtfidfe2.py
ctfidfwe.py		ctfidfwe.py
ctfidfwtfidfe.py		ctfidfwtfidfe.py
cwe.py		cwe.py
cwtfidfe.py		cwtfidfe.py
gradient_check.py		gradient_check.py
mpywe.py		mpywe.py
mpywtextranke.py		mpywtextranke.py
mpywtfidfe.py		mpywtfidfe.py
mtfidfwe.py		mtfidfwe.py
mwe.py		mwe.py
mwtfidfe.py		mwtfidfe.py
pytfidfwe.py		pytfidfwe.py
pytfidfwtfidfe.py		pytfidfwtfidfe.py
pywe.py		pywe.py
pywtfidfe.py		pywtfidfe.py
readme.md		readme.md
word2vec.py		word2vec.py
wtfidfe.py		wtfidfe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Implementation Details

Evaluation

References

About

Releases

Packages

Contributors 2

Languages

AotY/MPyWE

Folders and files

Latest commit

History

Repository files navigation

Usage

Implementation Details

Evaluation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages