# Python2Vec Example

We begin by loading the pretrained model. It was trained with the following parameters:

```
from pyspark.mllib.feature import Word2Vec

word2vec = Word2Vec()
word2vec.setMinCount(25)
word2vec.setLearningRate(0.025)
word2vec.setVectorSize(50)
model = word2vec.fit(words)
```

Where `words` is a collection of lists, with each list containing the words from a single line of Python. The code used for the training set came from the following libraries:


- [matplotlib](https://github.com/matplotlib/matplotlib)
- [scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [numpy](https://github.com/numpy/numpy)
- [pandas](https://github.com/pydata/pandas)
- [django](https://github.com/django/django)
- [scipy](https://github.com/scipy/scipy)
- [flask](https://github.com/mitsuhiko/flask)
- [requests](https://github.com/kennethreitz/requests)
- [ansible](https://github.com/ansible/ansible)
- [sentry](https://github.com/getsentry/sentry)
- [scrapy](https://github.com/scrapy/scrapy)
- [Mailpile](https://github.com/mailpile/Mailpile)
- [sshuttle](https://github.com/apenwarr/sshuttle)
- [salt](https://github.com/saltstack/salt)
- [NewsBlur](https://github.com/samuelclay/NewsBlur)
- [beets](https://github.com/beetbox/beets)
- [SublimeCodeIntel](https://github.com/SublimeCodeIntel/SublimeCodeIntel)

## Load the Model

In [1]:
from src.model import Py2Vec

json_file = "./data/blog_model.json"
model = Py2Vec(json_file)

## Play with the Model

The model can find the vector for a word as follows:

In [2]:
WORD = "if"

In [3]:
model[WORD]

array([ 0.09466234,  0.28889984,  0.1308112 , -0.05844196, -0.16983712,
       -0.0166158 , -0.16119699, -0.10156288, -0.05623753, -0.14935999,
       -0.11714134,  0.14477304,  0.28775242,  0.12222207, -0.09567802,
        0.08175498, -0.09353275, -0.12397329, -0.08333923,  0.04878221,
       -0.38459352, -0.09947983, -0.2902336 , -0.06798651,  0.14922994,
        0.30129457, -0.09702022,  0.17557855,  0.22417529, -0.18842494,
        0.12816922, -0.20911421, -0.24793662,  0.09082287, -0.07911459,
       -0.09016705,  0.13365737,  0.05189047,  0.42096484, -0.09606148,
       -0.29605725,  0.12067144,  0.01540663,  0.22810976,  0.08252437,
        0.02883578,  0.10888317, -0.16921936, -0.06220951,  0.06783944])

It can also find the closest words in the model:

In [4]:
model.closest_words(WORD, 5)

[(0.418, u'elif'),
 (0.657, u'else'),
 (0.857, u'module_dir_switch'),
 (0.966, u'repo_opts'),
 (0.969, u'bzip2')]

Where the number is the euclidian distance between `WORD` and the output word.

The model can also find the closest words to a vector:

In [5]:
target_vector = model.null_vector

model.closest_words(target_vector, 5)

[(0.3, u'mismatched'),
 (0.312, u'mxtype'),
 (0.332, u'lns'),
 (0.361, u'nonsense'),
 (0.378, u'system_frames')]

In [6]:
target_vector = model['for'] - model['continue']

model.closest_words(target_vector, 5)

[(0.891, u'for'),
 (1.691, u'contrast'),
 (1.697, u'measurement'),
 (1.699, u'ripple'),
 (1.731, u'stock')]