<a href="https://colab.research.google.com/github/CMWENLIU/Drug_label_embedding/blob/master/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# word2vec 

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.

In [0]:
!pip install word2vec

Collecting word2vec
  Using cached https://files.pythonhosted.org/packages/ce/51/5e2782b204015c8aef0ac830297c2f2735143ec90f592b9b3b909bb89757/word2vec-0.10.2.tar.gz
Building wheels for collected packages: word2vec
  Building wheel for word2vec (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/ef/9f/06/aec42532c9c37e05f936d4d586b15cfdfc9f2ffb62bd7fed1c
Successfully built word2vec
Installing collected packages: word2vec
Successfully installed word2vec-0.10.2


In [0]:
%load_ext autoreload
%autoreload 2

## Training

Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

In [0]:
import word2vec



Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [0]:
word2vec.word2phrase('text8', 'text8-phrases', verbose=True)

Starting training using file text8

Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206


This created a `text8-phrases` file that we can use as a better input for `word2vec`.
Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.

Now actually train the word2vec model.

In [0]:
word2vec.word2vec('text8-phrases', 'text8.bin', size=100, verbose=True)

Starting training using file text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 200.37k  

That created a `text8.bin` file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [0]:
word2vec.word2clusters('text8', 'text8-clusters.txt', 100, verbose=True)

Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 201.42k  

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [0]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [0]:
import word2vec

Import the `word2vec` binary file created above

In [0]:
model = word2vec.load('text8.bin')

We can take a look at the vocabulary as a numpy array

In [0]:
model.vocab

array(['</s>', 'the', 'of', ..., 'denishawn', 'tamiris', 'dolophine'],
      dtype='<U78')

Or take a look at the whole matrix

In [0]:
model.vectors.shape

(98331, 100)

In [0]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.16697241,  0.00419569,  0.03345228, ..., -0.08384706,
         0.02464598,  0.04118611],
       [ 0.20209818,  0.0518965 ,  0.19446139, ..., -0.08062559,
         0.0891011 ,  0.14994313],
       ...,
       [-0.16343404,  0.07923861, -0.07191788, ...,  0.08626001,
        -0.32177019,  0.03699231],
       [-0.0362827 , -0.04966592, -0.02323416, ...,  0.16307202,
        -0.26759657, -0.06015199],
       [ 0.0251137 , -0.01108364,  0.13510559, ...,  0.11882564,
        -0.06248424,  0.02373932]])

We can retreive the vector of individual words

In [0]:
model['dg'].shape

(100,)

In [0]:
model['dog'][:10]

array([-0.05243927, -0.02918311,  0.01063815,  0.09861034,  0.01886699,
       -0.02386677,  0.01734321,  0.16603038,  0.09383278,  0.05912584])

We can calculate the distance between two or more (all combinations) words.

In [0]:
model.distance("dog", "cat", "fish", "us", "puppy", "fish")

[('dog', 'cat', 0.8651788554338464),
 ('dog', 'fish', 0.6026456066703929),
 ('dog', 'us', 0.04316332385533227),
 ('dog', 'puppy', 0.6040419769629664),
 ('dog', 'fish', 0.6026456066703929),
 ('cat', 'fish', 0.649937052867408),
 ('cat', 'us', -0.08818993474386388),
 ('cat', 'puppy', 0.6118911176147784),
 ('cat', 'fish', 0.649937052867408),
 ('fish', 'us', 0.06723651128426716),
 ('fish', 'puppy', 0.47403451451476797),
 ('fish', 'fish', 0.9999999835563418),
 ('us', 'puppy', 0.07460756186017278),
 ('us', 'fish', 0.06723651128426716),
 ('puppy', 'fish', 0.47403451451476797)]

## Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [0]:
indexes, metrics = model.similar("president")
indexes, metrics

(array([  692,  3369,  4675,  2841,  9702,  1265,  2636,  8028, 13667,
         9551]),
 array([0.87103229, 0.84904996, 0.7873654 , 0.7805514 , 0.76927457,
        0.76881406, 0.76800883, 0.74190841, 0.73325711, 0.72971273]))

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [0]:
model.vocab[indexes]

array(['prime_minister', 'vice_president', 'governor_general',
       'presidency', 'secretary_general', 'governor', 'chairman',
       'attorney_general', 'presidential_candidate', 'chief_executive'],
      dtype='<U78')

There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [0]:
model.generate_response(indexes, metrics)

rec.array([('prime_minister', 0.87103229), ('vice_president', 0.84904996),
           ('governor_general', 0.7873654 ), ('presidency', 0.7805514 ),
           ('secretary_general', 0.76927457), ('governor', 0.76881406),
           ('chairman', 0.76800883), ('attorney_general', 0.74190841),
           ('presidential_candidate', 0.73325711),
           ('chief_executive', 0.72971273)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [0]:
model.generate_response(indexes, metrics).tolist()

[('prime_minister', 0.8710322861130149),
 ('vice_president', 0.8490499626679484),
 ('governor_general', 0.7873654018634985),
 ('presidency', 0.7805514004217579),
 ('secretary_general', 0.7692745667585241),
 ('governor', 0.7688140585104973),
 ('chairman', 0.7680088255106219),
 ('attorney_general', 0.7419084113413443),
 ('presidential_candidate', 0.7332571127925567),
 ('chief_executive', 0.7297127263843398)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [0]:
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()

[('san_francisco', 0.892287914295503),
 ('san_diego', 0.8748981469833962),
 ('las_vegas', 0.8423311311019948),
 ('miami', 0.8271736458495658),
 ('seattle', 0.826007480538819),
 ('st_louis', 0.8190536079413904),
 ('california', 0.8102486091288328),
 ('chicago_illinois', 0.8089109900829013),
 ('atlanta', 0.8064165348227031),
 ('chicago', 0.8061796974720457)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [0]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics

(array([1088, 1145, 7540, 1335,  344, 3141,  648, 1827, 4978, 1427]),
 array([0.3009306 , 0.28403791, 0.2811162 , 0.27420477, 0.27184272,
        0.27068602, 0.26695572, 0.26684499, 0.26527098, 0.26492222]))

In [0]:
model.generate_response(indexes, metrics).tolist()

[('queen', 0.30093059684471635),
 ('prince', 0.2840379143612827),
 ('empress', 0.2811162012283034),
 ('wife', 0.2742047661529334),
 ('son', 0.27184272259250386),
 ('monarch', 0.27068601704401174),
 ('emperor', 0.2669557155381216),
 ('throne', 0.26684498948027807),
 ('heir', 0.2652709790272968),
 ('bishop', 0.26492221588511666)]

### Clusters

In [0]:
clusters = word2vec.load_clusters('text8-clusters.txt')

We can see get the cluster number for individual words

In [0]:
clusters.vocab

array(['</s>', 'the', 'of', ..., 'denishawn', 'tamiris', 'dolophine'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [0]:
clusters.get_words_on_cluster(90).shape

(277,)

In [0]:
clusters.get_words_on_cluster(90)[:10]

array(['common', 'popular', 'important', 'complex', 'personal', 'simple',
       'perhaps', 'direct', 'likely', 'difficult'], dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [0]:
model.clusters = clusters

In [0]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])

In [0]:
model.generate_response(indexes, metrics).tolist()

[('berlin', 0.32906900744509854, 20),
 ('munich', 0.2932948277328571, 2),
 ('vienna', 0.2930819847373299, 82),
 ('leipzig', 0.28171054300657594, 41),
 ('moscow', 0.2760764595617845, 59),
 ('st_petersburg', 0.2717150667508963, 63),
 ('z_rich', 0.26163907170100476, 42),
 ('prague', 0.2612307315522932, 45),
 ('bonn', 0.2533462051722798, 10),
 ('dresden', 0.2518359568299504, 86)]