<a href="https://colab.research.google.com/github/CMWENLIU/Drug_label_embedding/blob/master/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
! pip install --quiet word2vec
! pip install --quiet wget

  Building wheel for wget (setup.py) ... [?25l[?25hdone


In [0]:
import wget
import word2vec
import re

# word2vec 

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.

In [7]:
%load_ext autoreload
%autoreload 2
urls = ["https://raw.githubusercontent.com/CMWENLIU/Drug_label_embedding/master/pubmed/abs1.txt",
       "https://raw.githubusercontent.com/CMWENLIU/Drug_label_embedding/master/pubmed/abs2.txt",
       ]
[wget.download(u) for u in urls]   

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


['abs1.txt', 'abs2.txt']

In [0]:
def clean_str(string):
   
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    string = string.replace("\\","")
    return string.strip().lower()

def split_record(s):
  journal = s.split('-!!-', 1)[0]
  temp = s.split('-!!-', 1)[1]
  year = temp.split('-##-', 1)[0]
  abstract = s.split('-##-', 1)[1]
  return journal, year, abstract

In [0]:
with open('w2vdata', 'w') as wf:
  with open('abs1.txt') as f:
    for line in f:
      line = clean_str(line)
      wf.write(line)

## Training

Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [18]:
word2vec.word2phrase('w2vdata', 'w2vdata-phrases', verbose=True)

Starting training using file w2vdata

Vocab size (unigrams + bigrams): 335899
Words in train file: 1716190


This created a `text8-phrases` file that we can use as a better input for `word2vec`.
Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.

Now actually train the word2vec model.

In [19]:
word2vec.word2vec('w2vdata-phrases', 'w2vdata.bin', size=100, verbose=True)

Starting training using file w2vdata-phrases
Vocab size: 21335
Words in train file: 1537115
Alpha: 0.000002  Progress: 100.20%  Words/thread/sec: 193.62k  

That created a `text8.bin` file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [20]:
word2vec.word2clusters('w2vdata', 'w2vdata-clusters.txt', 100, verbose=True)

Starting training using file w2vdata
Vocab size: 17654
Words in train file: 1652908
Alpha: 0.000002  Progress: 100.28%  Words/thread/sec: 190.35k  

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import the `word2vec` binary file created above

In [0]:
model = word2vec.load('w2vdata.bin')

We can take a look at the vocabulary as a numpy array

In [24]:
model.vocab

array(['</s>', 'the', ',', ..., 'sw', 'lfcs', 'usat'], dtype='<U78')

Or take a look at the whole matrix

In [25]:
model.vectors.shape

(21335, 100)

In [26]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.08136111, -0.07500461,  0.01063167, ..., -0.04574704,
         0.0701974 , -0.12510048],
       [-0.0591997 , -0.04052412,  0.00540176, ...,  0.13127078,
        -0.00493684,  0.06459591],
       ...,
       [ 0.08967227, -0.14987729,  0.09585784, ...,  0.10959856,
        -0.11328791, -0.01284291],
       [-0.02349118, -0.23456866,  0.17473452, ...,  0.09311981,
        -0.02780636, -0.01349567],
       [ 0.00334223, -0.14207917,  0.01476534, ...,  0.13891329,
        -0.05570399, -0.00767328]])

We can retreive the vector of individual words

In [27]:
model['dog'].shape

(100,)

In [28]:
model['dog'][:10]

array([ 0.03412343, -0.21800287,  0.09563856, -0.05208769,  0.02066191,
        0.00465136,  0.01780929,  0.01945632, -0.11154303,  0.11058243])

We can calculate the distance between two or more (all combinations) words.

In [30]:
model.distance("dog", "cat", "fish")

[('dog', 'cat', 0.8139222866656213),
 ('dog', 'fish', 0.7836921646553869),
 ('cat', 'fish', 0.6833860218539964)]

## Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [0]:
indexes, metrics = model.similar("dog")
indexes, metrics

(array([ 2437,  5478,  7593, 10230,  3964,  9963,  2428, 10309,  4812,
         2391]),
 array([0.86937327, 0.83396105, 0.77854628, 0.7692265 , 0.76743628,
        0.7612772 , 0.7600788 , 0.75935677, 0.75693881, 0.75438956]))

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [0]:
model.vocab[indexes]

array(['cat', 'cow', 'goat', 'pig', 'dogs', 'rabbit', 'bear', 'rat',
       'wolf', 'girl'], dtype='<U78')

There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [0]:
model.generate_response(indexes, metrics)

rec.array([('cat', 0.86937327), ('cow', 0.83396105), ('goat', 0.77854628),
           ('pig', 0.7692265 ), ('dogs', 0.76743628),
           ('rabbit', 0.7612772 ), ('bear', 0.7600788 ),
           ('rat', 0.75935677), ('wolf', 0.75693881),
           ('girl', 0.75438956)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [0]:
model.generate_response(indexes, metrics).tolist()

[('cat', 0.8693732680572173),
 ('cow', 0.8339610529888226),
 ('goat', 0.7785462766666428),
 ('pig', 0.7692265048531302),
 ('dogs', 0.7674362783482181),
 ('rabbit', 0.7612771996422674),
 ('bear', 0.7600788045286304),
 ('rat', 0.7593567655129181),
 ('wolf', 0.7569388070301634),
 ('girl', 0.754389556345068)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [0]:
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()

[('san_francisco', 0.8876351265573288),
 ('san_diego', 0.8652920422732189),
 ('seattle', 0.8387625165949533),
 ('las_vegas', 0.8325965377422355),
 ('california', 0.8252775393303263),
 ('miami', 0.8167069457881345),
 ('detroit', 0.8164911899252103),
 ('chicago', 0.813283620659967),
 ('cincinnati', 0.8116379669114295),
 ('cleveland', 0.810708205429068)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [0]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics

(array([1087, 6768, 1145, 7523, 1335, 8419, 3141, 1827,  344, 4980]),
 array([0.28823424, 0.26614362, 0.26265608, 0.26111525, 0.26091172,
        0.25844542, 0.25781944, 0.25678284, 0.25424551, 0.2529607 ]))

In [0]:
model.generate_response(indexes, metrics).tolist()

[('queen', 0.28823424120681784),
 ('regent', 0.26614361576778933),
 ('prince', 0.2626560787162791),
 ('empress', 0.2611152451318436),
 ('wife', 0.26091172315990346),
 ('aragon', 0.25844541581050506),
 ('monarch', 0.25781944140528035),
 ('throne', 0.256782835877586),
 ('son', 0.25424550637754495),
 ('heir', 0.25296070456687614)]

### Clusters

In [0]:
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [0]:
clusters.vocab

array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [0]:
clusters.get_words_on_cluster(90).shape

(206,)

In [0]:
clusters.get_words_on_cluster(90)[:10]

array(['along', 'associated', 'relations', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'respect', 'mixed'],
      dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [0]:
model.clusters = clusters

In [0]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])

In [0]:
model.generate_response(indexes, metrics).tolist()

[('berlin', 0.3187078682472152, 15),
 ('vienna', 0.28562803640143397, 12),
 ('munich', 0.28527806428082675, 21),
 ('moscow', 0.27085681100243797, 74),
 ('leipzig', 0.2697639527846636, 8),
 ('st_petersburg', 0.25841328545046965, 61),
 ('prague', 0.2571333430942206, 72),
 ('bonn', 0.2546126113385251, 8),
 ('dresden', 0.2471285069069249, 71),
 ('warsaw', 0.2450778083401204, 74)]