# word2vec 

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Training

Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

In [10]:
import word2vec

Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [13]:
word2vec.word2phrase('/home/bear/bigdata/tox30000.txt', '/home/bear/bigdata/tox30000-phrases', verbose=True)

Starting training using file /home/bear/bigdata/tox30000.txt
Words processed: 1800K     Vocab size: 1162K  
Vocab size (unigrams + bigrams): 638379
Words in train file: 1890940
Words written: 1800K

This created a `text8-phrases` file that we can use as a better input for `word2vec`.
Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.

Now actually train the word2vec model.

In [14]:
word2vec.word2vec('/home/bear/bigdata/tox30000-phrases', '/home/bear/bigdata/tox30000.bin', size=300, verbose=True)

Starting training using file /home/bear/bigdata/tox30000-phrases
Vocab size: 26973
Words in train file: 1644602
Alpha: 0.000002  Progress: 100.20%  Words/thread/sec: 103.19k  65k  

That created a `text8.bin` file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [6]:
word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 317.72k  

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
import word2vec

Import the `word2vec` binary file created above

In [15]:
model = word2vec.load('/home/bear/bigdata/tox.bin')

We can take a look at the vocabulary as a numpy array

In [16]:
model.vocab

array(['</s>', 'exposure', 'effects', ..., 'r665', 'oobiphenol',
       'recipavrin'], dtype='<U78')

Or take a look at the whole matrix

In [17]:
model.vectors.shape

(135794, 300)

In [18]:
model.vectors

array([[ 0.08015626,  0.08850129, -0.07670335, ..., -0.02626957,
        -0.03316621,  0.0614953 ],
       [-0.06314697, -0.0020846 ,  0.01790366, ...,  0.06257509,
        -0.05106722,  0.01221308],
       [-0.00866614, -0.00828651,  0.03921769, ...,  0.04933339,
         0.0625715 ,  0.03236538],
       ...,
       [-0.01647791, -0.12637718, -0.05695276, ...,  0.05563355,
         0.05544947,  0.01216888],
       [-0.08948104, -0.10090277,  0.00732808, ..., -0.03293901,
         0.00705413, -0.07581794],
       [ 0.00969618, -0.09953561,  0.05869497, ...,  0.09755769,
         0.01010175, -0.09685887]])

We can retreive the vector of individual words

In [19]:
model['liver'].shape

(300,)

In [20]:
model['liver'][:10]

array([ 0.00105867, -0.07775147, -0.05253823, -0.00679678,  0.03260306,
        0.03492132,  0.04879236,  0.06382352,  0.09693788, -0.10532206])

We can calculate the distance between two or more (all combinations) words.

In [22]:
model.distance("liver", "heart", "cell")

[('liver', 'heart', 0.46211296256692835),
 ('liver', 'cell', 0.16932151990970476),
 ('heart', 'cell', 0.01176196145551267)]

## Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [23]:
indexes, metrics = model.similar("liver")
indexes, metrics

(array([  175,  1008,  2797,   206,  1069,  1706,  3362,  3234, 10341,
          678]),
 array([0.7051077 , 0.66506537, 0.57917927, 0.57059308, 0.51442374,
        0.48549201, 0.48429055, 0.47507805, 0.46735535, 0.46211296]))

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [24]:
model.vocab[indexes]

array(['hepatic', 'livers', 'hepatocellular', 'kidney', 'organ', 'testis',
       'pancreatic', 'pancreas', 'liver_homogenate', 'heart'],
      dtype='<U78')

There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [25]:
model.generate_response(indexes, metrics)

rec.array([('hepatic', 0.7051077 ), ('livers', 0.66506537),
           ('hepatocellular', 0.57917927), ('kidney', 0.57059308),
           ('organ', 0.51442374), ('testis', 0.48549201),
           ('pancreatic', 0.48429055), ('pancreas', 0.47507805),
           ('liver_homogenate', 0.46735535), ('heart', 0.46211296)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [26]:
model.generate_response(indexes, metrics).tolist()

[('hepatic', 0.7051076955702468),
 ('livers', 0.6650653680083782),
 ('hepatocellular', 0.5791792659172474),
 ('kidney', 0.570593078628412),
 ('organ', 0.5144237445663715),
 ('testis', 0.48549201288328525),
 ('pancreatic', 0.4842905491948804),
 ('pancreas', 0.4750780490688806),
 ('liver_homogenate', 0.46735534591553507),
 ('heart', 0.46211296256692835)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [28]:
indexes, metrics = model.similar('cell_line')
model.generate_response(indexes, metrics).tolist()

[('cell_lines', 0.8452627475811616),
 ('hepg2', 0.7789955019696182),
 ('cellline', 0.7654203295789872),
 ('fibroblasts', 0.7449751605717496),
 ('lung_fibroblasts', 0.7369620083890245),
 ('fibroblast', 0.7307756186508869),
 ('hela_cells', 0.7303405837250463),
 ('a549', 0.7097297177166331),
 ('cells', 0.7053972818334586),
 ('immortalized', 0.7014494405101693)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [33]:
indexes, metrics = model.analogy(pos=['bioinformatics', 'organ'], neg=['biology'])
indexes, metrics

(array([ 7771, 15053,   823,   206, 85441, 17197, 27571,  3847, 35492,
           23]),
 array([0.20956429, 0.19867973, 0.19831104, 0.19617407, 0.19575099,
        0.1922857 , 0.18965993, 0.18874681, 0.18825623, 0.18733298]))

In [34]:
model.generate_response(indexes, metrics).tolist()

[('semiquantitative', 0.20956428692481882),
 ('multivariate_statistical', 0.19867973385807608),
 ('organs', 0.19831104340750566),
 ('kidney', 0.19617406961301156),
 ('countercurrent', 0.1957509887745693),
 ('uplcmsms', 0.19228569608626525),
 ('hplcec', 0.18965992851009603),
 ('lcms', 0.18874681332399962),
 ('metabolomics_approach', 0.18825623492047724),
 ('liver', 0.1873329787903628)]

### Clusters

In [18]:
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [19]:
clusters.vocab

array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [20]:
clusters.get_words_on_cluster(90).shape

(206,)

In [21]:
clusters.get_words_on_cluster(90)[:10]

array(['along', 'associated', 'relations', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'respect', 'mixed'],
      dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [22]:
model.clusters = clusters

In [23]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])

In [24]:
model.generate_response(indexes, metrics).tolist()

[('berlin', 0.3187078682472152, 15),
 ('vienna', 0.28562803640143397, 12),
 ('munich', 0.28527806428082675, 21),
 ('moscow', 0.27085681100243797, 74),
 ('leipzig', 0.2697639527846636, 8),
 ('st_petersburg', 0.25841328545046965, 61),
 ('prague', 0.2571333430942206, 72),
 ('bonn', 0.2546126113385251, 8),
 ('dresden', 0.2471285069069249, 71),
 ('warsaw', 0.2450778083401204, 74)]