ngram2vec

Embeddings for n-grams via sampling.

Learning word2vec model with word2vec (gensim word2vec)

Parameters for extraction are inside the learnmdl.py, they can be changed in the file.

$ python3 learnmdl.py preproc.data.en model.en

Now model.en can be loaded in and used with python (gensim word2vec model).

Extracting only ngrams

Extracting only ngrams is good, because mainly fasttext with python is very slow and C++ code is used to train. Additionally, adds modularity - extracting ngrams once and then training whatever embeddings: different embeddings (word2vec, glove, fasttext etc) or different hyperparameters...

Example use with fasttext (C++ compiled):

$ python3 extract_ngrams.py data.clean.en data.ngrams.en
$ ./fasttext cbow -input data.ngram.en  -thread 16 -dim 300 ngram.mdl.d3.en

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_ngrams.py		extract_ngrams.py
learnmdl.py		learnmdl.py
ngram.py		ngram.py
paraphrase.py		paraphrase.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngram2vec

Learning word2vec model with word2vec (gensim word2vec)

Extracting only ngrams

Example use with fasttext (C++ compiled):

About

Releases

Packages

Contributors 2

Languages

License

TartuNLP/ngram2vec

Folders and files

Latest commit

History

Repository files navigation

ngram2vec

Learning word2vec model with word2vec (gensim word2vec)

Extracting only ngrams

Example use with fasttext (C++ compiled):

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages