# word2vec 

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Training

Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

In [83]:
import word2vec
from summa.summarizer import summarize

Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [13]:
word2vec.word2phrase('/home/bear/bigdata/tox30000.txt', '/home/bear/bigdata/tox30000-phrases', verbose=True)

Starting training using file /home/bear/bigdata/tox30000.txt
Words processed: 1800K     Vocab size: 1162K  
Vocab size (unigrams + bigrams): 638379
Words in train file: 1890940
Words written: 1800K

This created a `text8-phrases` file that we can use as a better input for `word2vec`.
Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.

Now actually train the word2vec model.

In [109]:
word2vec.word2vec('/home/bear/bigdata/tox-phrases', '/home/bear/bigdata/tox_512.bin', size=512, verbose=True)

Starting training using file /home/bear/bigdata/tox-phrases
Vocab size: 135794
Words in train file: 15055260
Alpha: 0.000002  Progress: 100.05%  Words/thread/sec: 65.48k   lpha: 0.022872  Progress: 8.52%  Words/thread/sec: 63.38k  pha: 0.022856  Progress: 8.59%  Words/thread/sec: 63.44k  ress: 11.62%  Words/thread/sec: 64.45k  2%  Words/thread/sec: 65.35k  : 0.019239  Progress: 23.06%  Words/thread/sec: 65.94k  : 25.29%  Words/thread/sec: 65.99k  0.016340  Progress: 34.65%  Words/thread/sec: 66.04k  ha: 0.015206  Progress: 39.19%  Words/thread/sec: 65.68k  lpha: 0.010148  Progress: 59.42%  Words/thread/sec: 65.09k  09k  hread/sec: 65.33k  sec: 65.33k  .34k  0.006647  Progress: 73.43%  Words/thread/sec: 65.35k  3  Progress: 73.64%  Words/thread/sec: 65.34k  ress: 73.86%  Words/thread/sec: 65.34k  4.07%  Words/thread/sec: 65.32k  Words/thread/sec: 65.31k  hread/sec: 65.31k  ec: 65.30k  29k  006160  Progress: 75.37%  Words/thread/sec: 65.30k  : 77.44%  Words/thread/sec: 65.36k  ds/thread/

That created a `text8.bin` file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [59]:
word2vec.word2clusters('/home/bear/bigdata/tox.txt', '/home/bear/bigdata/tox-clusters.txt', 300, verbose=True)

Starting training using file /home/bear/bigdata/tox.txt
Vocab size: 91435
Words in train file: 16869865


Alpha: 0.004535  Progress: 81.86%  Words/thread/sec: 160.67k  c: 166.33k  c: 175.81k  51k  2.98%  Words/thread/sec: 177.05k   Progress: 3.20%  Words/thread/sec: 175.29k  %  Words/thread/sec: 177.60k  s/thread/sec: 178.49k  d/sec: 177.59k  179.04k    a: 0.023866  Progress: 4.55%  Words/thread/sec: 179.04k  3818  Progress: 4.74%  Words/thread/sec: 177.71k  rogress: 4.93%  Words/thread/sec: 179.00k  : 5.12%  Words/thread/sec: 179.01k    Words/thread/sec: 179.31k  sec: 179.79k   179.62k  0.023432  Progress: 6.28%  Words/thread/sec: 180.28k  4  Progress: 6.48%  Words/thread/sec: 179.84k  ress: 6.67%  Words/thread/sec: 180.53k  .86%  Words/thread/sec: 179.54k  ords/thread/sec: 180.30k  : 179.53k  ha: 0.023046  Progress: 7.83%  Words/thread/sec: 179.74k  rogress: 8.21%  Words/thread/sec: 180.75k  : 8.40%  Words/thread/sec: 180.22k    Words/thread/sec: 180.51k  sec: 179.87k  k  0.022613  Progress: 9.56%  Words/thread/sec: 180.46k  ess: 9.94%  Words/thread/sec: 179.23k  22421  Progress: 10.33% 

Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 164.55k  0.81k  .20%  Words/thread/sec: 160.87k  gress: 83.39%  Words/thread/sec: 160.91k  4106  Progress: 83.59%  Words/thread/sec: 160.92k  05k  sec: 161.03k  s/thread/sec: 161.05k  ess: 84.73%  Words/thread/sec: 161.14k  72  Progress: 84.92%  Words/thread/sec: 161.14k  a: 0.003724  Progress: 85.11%  Words/thread/sec: 161.17k  c: 161.21k    Words/thread/sec: 161.16k  ss: 86.23%  Words/thread/sec: 161.22k  0.003392  Progress: 86.44%  Words/thread/sec: 161.34k  161.37k  ords/thread/sec: 161.43k  thread/sec: 161.54k  03061  Progress: 87.77%  Words/thread/sec: 161.59k  lpha: 0.003013  Progress: 87.96%  Words/thread/sec: 161.60k  /sec: 161.72k  ds/thread/sec: 161.79k  ress: 88.91%  Words/thread/sec: 161.87k  a: 0.002681  Progress: 89.29%  Words/thread/sec: 162.04k  c: 162.15k    Words/thread/sec: 162.24k   Progress: 90.43%  Words/thread/sec: 162.33k  ad/sec: 162.48k  1.56%  Words/thread/sec: 162.57k  2017  Progress: 91.94%  Words/threa

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
import word2vec

Import the `word2vec` binary file created above

In [15]:
model = word2vec.load('/home/bear/bigdata/tox.bin')

We can take a look at the vocabulary as a numpy array

In [16]:
model.vocab

array(['</s>', 'exposure', 'effects', ..., 'r665', 'oobiphenol',
       'recipavrin'], dtype='<U78')

Or take a look at the whole matrix

In [17]:
model.vectors.shape

(135794, 300)

In [18]:
model.vectors

array([[ 0.08015626,  0.08850129, -0.07670335, ..., -0.02626957,
        -0.03316621,  0.0614953 ],
       [-0.06314697, -0.0020846 ,  0.01790366, ...,  0.06257509,
        -0.05106722,  0.01221308],
       [-0.00866614, -0.00828651,  0.03921769, ...,  0.04933339,
         0.0625715 ,  0.03236538],
       ...,
       [-0.01647791, -0.12637718, -0.05695276, ...,  0.05563355,
         0.05544947,  0.01216888],
       [-0.08948104, -0.10090277,  0.00732808, ..., -0.03293901,
         0.00705413, -0.07581794],
       [ 0.00969618, -0.09953561,  0.05869497, ...,  0.09755769,
         0.01010175, -0.09685887]])

We can retreive the vector of individual words

In [19]:
model['liver'].shape

(300,)

In [20]:
model['liver'][:10]

array([ 0.00105867, -0.07775147, -0.05253823, -0.00679678,  0.03260306,
        0.03492132,  0.04879236,  0.06382352,  0.09693788, -0.10532206])

We can calculate the distance between two or more (all combinations) words.

In [22]:
model.distance("liver", "heart", "cell")

[('liver', 'heart', 0.46211296256692835),
 ('liver', 'cell', 0.16932151990970476),
 ('heart', 'cell', 0.01176196145551267)]

## Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [23]:
indexes, metrics = model.similar("liver")
indexes, metrics

(array([  175,  1008,  2797,   206,  1069,  1706,  3362,  3234, 10341,
          678]),
 array([0.7051077 , 0.66506537, 0.57917927, 0.57059308, 0.51442374,
        0.48549201, 0.48429055, 0.47507805, 0.46735535, 0.46211296]))

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [24]:
model.vocab[indexes]

array(['hepatic', 'livers', 'hepatocellular', 'kidney', 'organ', 'testis',
       'pancreatic', 'pancreas', 'liver_homogenate', 'heart'],
      dtype='<U78')

There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [25]:
model.generate_response(indexes, metrics)

rec.array([('hepatic', 0.7051077 ), ('livers', 0.66506537),
           ('hepatocellular', 0.57917927), ('kidney', 0.57059308),
           ('organ', 0.51442374), ('testis', 0.48549201),
           ('pancreatic', 0.48429055), ('pancreas', 0.47507805),
           ('liver_homogenate', 0.46735535), ('heart', 0.46211296)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [26]:
model.generate_response(indexes, metrics).tolist()

[('hepatic', 0.7051076955702468),
 ('livers', 0.6650653680083782),
 ('hepatocellular', 0.5791792659172474),
 ('kidney', 0.570593078628412),
 ('organ', 0.5144237445663715),
 ('testis', 0.48549201288328525),
 ('pancreatic', 0.4842905491948804),
 ('pancreas', 0.4750780490688806),
 ('liver_homogenate', 0.46735534591553507),
 ('heart', 0.46211296256692835)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [28]:
indexes, metrics = model.similar('cell_line')
model.generate_response(indexes, metrics).tolist()

[('cell_lines', 0.8452627475811616),
 ('hepg2', 0.7789955019696182),
 ('cellline', 0.7654203295789872),
 ('fibroblasts', 0.7449751605717496),
 ('lung_fibroblasts', 0.7369620083890245),
 ('fibroblast', 0.7307756186508869),
 ('hela_cells', 0.7303405837250463),
 ('a549', 0.7097297177166331),
 ('cells', 0.7053972818334586),
 ('immortalized', 0.7014494405101693)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [106]:
indexes, metrics = model.analogy(pos=["breast_cancer", "liver"], neg=['breast'])
indexes, metrics

(array([ 947,  175,  412,  843, 1008, 1645, 1655, 1461, 2117,  370]),
 array([0.21516253, 0.20577065, 0.16443839, 0.1630304 , 0.16087712,
        0.15813997, 0.1553662 , 0.15516005, 0.1545503 , 0.15339436]))

In [107]:
model.generate_response(indexes, metrics).tolist()

[('hepatotoxicity', 0.21516252542431225, 281),
 ('hepatic', 0.20577064842314757, 123),
 ('hepatocytes', 0.16443839324541545, 13),
 ('liver_microsomes', 0.16303039670439448, 119),
 ('livers', 0.1608771203520329, 108),
 ('hepatocyte', 0.15813997420322706, 67),
 ('cyp2e1', 0.1553661984648455, 137),
 ('ccl4', 0.15516004641554593, 188),
 ('apap', 0.15455030257951768, 246),
 ('oxidative', 0.15339436227934383, 55)]

### Clusters

In [60]:
clusters = word2vec.load_clusters('/home/bear/bigdata/tox-clusters.txt')

We can see get the cluster number for individual words

In [61]:
clusters.vocab

array(['</s>', 'exposure', 'cells', ..., 'r665', 'oobiphenol',
       'recipavrin'], dtype='<U66')

We can see get all the words grouped on an specific cluster

In [62]:
clusters.get_words_on_cluster(90).shape

(180,)

In [64]:
clusters.get_words_on_cluster(90)[:10]

array(['phenobarbital', 'inducer', 'dexamethasone', 'dosedependently',
       'ac', '3methylcholanthrene', 'tpa', '3mc', 'ketoconazole', 'bnf'],
      dtype='<U66')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [65]:
model.clusters = clusters

In [102]:
indexes, metrics = model.analogy(pos=["breast_cancer", "liver"], neg=["breast"])

In [103]:
model.generate_response(indexes, metrics).tolist()

[('hepatotoxicity', 0.21516252542431225, 281),
 ('hepatic', 0.20577064842314757, 123),
 ('hepatocytes', 0.16443839324541545, 13),
 ('liver_microsomes', 0.16303039670439448, 119),
 ('livers', 0.1608771203520329, 108),
 ('hepatocyte', 0.15813997420322706, 67),
 ('cyp2e1', 0.1553661984648455, 137),
 ('ccl4', 0.15516004641554593, 188),
 ('apap', 0.15455030257951768, 246),
 ('oxidative', 0.15339436227934383, 55)]

In [99]:
text = "Race and ancestry have long been associated with differential risk and outcomes to disease as well as responses to medications. These differences in drug response are multifactorial with some portion associated with genomic variation. The field of pharmacogenomics aims to predict drug response in patients prior to medication administration and to uncover the biological underpinnings of drug response. The field of human genetics has long recognized that genetic variation differs in frequency between ancestral populations, with some single nucleotide polymorphisms found solely in one population. Thus far, most pharmacogenomic studies have focused on individuals of European and East Asian ancestry, resulting in a substantial disparity in the clinical utility of genetic prediction for drug response in US minority populations. In this review, we discuss the genetic factors that underlie variability to drug response and known pharmacogenomic associations and how these differ between populations, with an emphasis on the current knowledge in cardiovascular pharmacogenomics"
summarize(text, words=19)

'In this review, we discuss the genetic factors that underlie variability to drug response and known pharmacogenomic associations and how these differ between populations, with an emphasis on the current knowledge in cardiovascular pharmacogenomics'