Skip to content
Julia interface to word2vec
Julia
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
deps
docs
examples
src Fixed train data path (https://groups.google.com/forum/#toolkit/ZWPsg… Jan 9, 2019
test
.gitignore
.travis.yml
LICENSE.md
NEWS.md
README.md
REQUIRE

README.md

Word2Vec

License Build Status Coverage Status

Julia interface to word2vec

Word2Vec takes a text corpus as input and produces the word vectors as output. Training is done using the original C code, other functionalities are pure Julia. See demo for more details.

Installation

Pkg.add("Word2Vec")

Note: Only linux and OS X are supported.

Functions

All exported functions are documented, i.e., we can type ? functionname to get help. For a list of functions, see here.

Examples

We first download some text corpus, for example http://mattmahoney.net/dc/text8.zip.

Suppose the file text8 is stored in the current working directory. We can train the model with the function word2vec.

julia> word2vec("text8", "text8-vec.txt", verbose = true)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 350.44k  

Now we can import the word vectors text8-vec.txt to Julia.

julia> model = wordvectors("./text8-vec")
WordVectors 71291 words, 100-element Float64 vectors

The vector representation of a word can be obtained using get_vector.

julia> get_vector(model, "book")'
100-element Array{Float64,1}:
 -0.05446138539336186
  0.001090934639284009
  0.06498087707990222-0.0024113040415322516
  0.04755140828570571
  0.039764719065723826

The cosine similarity of book, for example, can be computed using cosine_similar_words.

julia> cosine_similar_words(model, "book")
10-element Array{String,1}:
 "book"
 "books"
 "diary"
 "story"
 "chapter"
 "novel"
 "preface"
 "poem"
 "tale"
 "bible"

Word vectors have many interesting properties. For example, vector("king") - vector("man") + vector("woman") is close to vector("queen").

5-element Array{String,1}:
 "queen"
 "empress"
 "prince"
 "princess"
 "throne"

References

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", In Proceedings of Workshop at ICLR, 2013. [pdf]

  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality", In Proceedings of NIPS, 2013. [pdf]

  • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations", In Proceedings of NAACL HLT, 2013. [pdf]

Acknowledgements

The design of the package is inspired by Daniel Rodriguez (@danielfrg)'s Python word2vec interface.

Reporting Bugs

Please file an issue to report a bug or request a feature.

You can’t perform that action at this time.