In [None]:
import word2vec # pip install word2vec

# word2vec  

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from the original [Google code](https://code.google.com/p/word2vec/).

Source: http://nbviewer.ipython.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb

## Training

In [23]:
# you can skip that part and use the pretrained model, see below

# Download some data, for example: 
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

# Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"
word2vec.word2phrase('text8', 'text8-phrases', verbose=True)
# This will create a text8-phrases that we can use as a better input for word2vec. 
# Note that you could easily skip this previous step and use the origial data as input for word2vec.

# Train the model using the word2phrase output.
word2vec.word2vec('text8-phrases', 'text8.bin', size=100, verbose=True)
# That generated a `text8.bin` file containing the word vectors in a binary format.
print('training complete :-)')

--2015-11-26 16:51:59--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net... 98.139.135.129
Connecting to mattmahoney.net|98.139.135.129|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: 'text8.zip'


2015-11-26 16:52:22 (1.30 MB/s) - 'text8.zip' saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   
Starting training using file text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Starting training using file text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 245.75k  

## Predictions

Use the model file created above, or download it from https://dl.dropboxusercontent.com/u/975350/tmp/text8.bin

In [24]:
model = word2vec.load('text8.bin')

We can take a look at the vocabulaty as a numpy array

In [25]:
model.vocab

array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], 
      dtype='<U78')

Or take a look at the whole matrix

In [26]:
model.vectors.shape

(98331, 100)

In [27]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.14694442, -0.00921887,  0.04583032, ..., -0.03086229,
        -0.05368164, -0.10046262],
       [ 0.16525993,  0.04536457,  0.04944701, ...,  0.05263765,
         0.1678641 ,  0.0518899 ],
       ..., 
       [-0.06522302,  0.07457283,  0.03537172, ...,  0.15117142,
        -0.04217494, -0.07174871],
       [ 0.00877762,  0.0425899 ,  0.14123389, ..., -0.0597503 ,
        -0.09166503, -0.0475128 ],
       [-0.01401589, -0.07021915,  0.10536838, ...,  0.03014692,
        -0.08648375, -0.09778806]])

We can retreive the vector of individual words

In [28]:
model['dog'].shape

(100,)

In [29]:
model['dog'][:10]

array([ 0.10332464,  0.06703388,  0.05134746,  0.08328107,  0.08478329,
       -0.029021  ,  0.01412035,  0.13168371,  0.08635285, -0.05263404])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [30]:
indexes, metrics = model.cosine('red')
model.generate_response(indexes, metrics).tolist()

[(u'blue', 0.9102310124344305),
 (u'yellow', 0.8952224651112171),
 (u'green', 0.8222502950427031),
 (u'purple', 0.8189022284021563),
 (u'orange', 0.8147085127359422),
 (u'white', 0.8103996728750976),
 (u'black', 0.7809016351589374),
 (u'pink', 0.7522223146405485),
 (u'grey', 0.7417127727132912),
 (u'colored', 0.7313320169929519)]

In [31]:
indexes, metrics = model.cosine('sausage')
model.generate_response(indexes, metrics).tolist()

[(u'tofu', 0.8876018411826816),
 (u'grilled', 0.8825909072946603),
 (u'noodles', 0.8727839960713353),
 (u'mustard', 0.8660121928025903),
 (u'salted', 0.8633289673644092),
 (u'broth', 0.8582858806598598),
 (u'liqueur', 0.8582677050451581),
 (u'soda', 0.8552275215675876),
 (u'eggplant', 0.855012967375358),
 (u'soy_sauce', 0.8549815457362573)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases"

In [33]:
indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()

[(u'san_francisco', 0.8907173960415067),
 (u'san_diego', 0.8764709136380514),
 (u'seattle', 0.8387995826597778),
 (u'las_vegas', 0.8315778508753239),
 (u'california', 0.8295829170840956),
 (u'chicago', 0.8292007306509285),
 (u'cleveland', 0.8267398709931983),
 (u'miami', 0.8244456208850379),
 (u'detroit', 0.8200824457281422),
 (u'chicago_illinois', 0.8173753945669766)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [34]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
model.generate_response(indexes, metrics).tolist()

[(u'queen', 0.2880241341290669),
 (u'prince', 0.27092894724490246),
 (u'son', 0.26934393426434866),
 (u'empress', 0.2684484402477858),
 (u'wife', 0.2646606373064051),
 (u'emperor', 0.2633477131940194),
 (u'regent', 0.25994703729205215),
 (u'throne', 0.25966095236175724),
 (u'aragon', 0.25878382226993896),
 (u'monarch', 0.25648878233523753)]

We can see get all the words grouped on an specific cluster