Skip to content

TartuNLP/ngram2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ngram2vec

Embeddings for n-grams via sampling.

Learning word2vec model with word2vec (gensim word2vec)

Parameters for extraction are inside the learnmdl.py, they can be changed in the file.

$ python3 learnmdl.py preproc.data.en model.en

Now model.en can be loaded in and used with python (gensim word2vec model).

Extracting only ngrams

Extracting only ngrams is good, because mainly fasttext with python is very slow and C++ code is used to train. Additionally, adds modularity - extracting ngrams once and then training whatever embeddings: different embeddings (word2vec, glove, fasttext etc) or different hyperparameters...

Example use with fasttext (C++ compiled):

$ python3 extract_ngrams.py data.clean.en data.ngrams.en
$ ./fasttext cbow -input data.ngram.en  -thread 16 -dim 300 ngram.mdl.d3.en 

Releases

No releases published

Packages

No packages published

Languages