### Word2Vec

Tomas Mikolov led team at Google created Word2Vec (word to vector) in 2013. It utilizes two models. 

* Continuous bag-of-words (CBOW) model to predict the current word from a window of surrounding context words or given a set of context words predict the missing word that is likely to appear in that context. For example "cigarette smoking is ??? to health" will likely predict the missing word as injurious based on the training set.

* Continuous skip-gram model to predict the surrounding window of context words using the current word or given a single word, predict the probability of other words that are likely to appear near it in that context. For example, 


I'll use Google's pre-trained model which you can download from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit, this model includes vocabulary of 3 million words/phrases taken from 100 billion words from a Google News dataset.

In [1]:
import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)

In [2]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431607246399),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133)]

In [3]:
model.most_similar(['girl', 'father'], ['boy'], topn=3)

[('mother', 0.831214427947998),
 ('daughter', 0.8000643253326416),
 ('husband', 0.769158124923706)]

In [4]:
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

### Training word2vec model on your own custom data

Parameters
* size: The dimensionality of the vectors, note that bigger size values require more training data, but can lead to more accurate models
* window: The maximum distance between the current and predicted word within a sentence
* min_count: Ignore all words with total frequency lower than this
* sg = 0 for CBOW model and 1 for skip-gram model

In [5]:
sentences = [['cigarette','smoking','is','injurious', 'to', 'health'],['cigarette','smoking','causes','cancer'],['cigarette','are','not','to','be','sold','to','kids']]

# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1, sg=1, window = 3)

model.wv.most_similar(positive=['cigarette', 'smoking'], negative=['kids'], topn=1)

[('causes', 0.10139459371566772)]