# **Word2Vec Tutorial**

* __Based on Lecture 2 from the Stanford Deep Learning Series by Chris Manning__
* __Word2Vec: Set of algorithms developed by Mikolov et al. (2013)__
* __Original Word2Vec Paper: https://arxiv.org/abs/1301.3781__

### Question: How do we represent the meaning of a word computationally?
* Often we make use of a taxonomy like WordNet, using hypernym relationships and synonym sets.
* Problem with this:
    * Misses nuances (eg. 'good' and 'expert' are considered synonyms in WordNet, but "he's an expert at deep learning" and "he's good at deep learning" clearly have different meanings).
    * Misses new words (must constantly keep up to date), requires a lot of human labor.
    
    
* This can be considered a 'discrete' representation, using atomic symbols.
* Words like 'hotel', 'conference', 'walk', in vector space terms are represented using a vector of one 1 and many zeros [0 0 0 0 0 0 0 0 1 0 0 0], thus a 'one-hot' vector representation.
* These vectors end up being very long, depending on the vocabulary.
* Another problem: doesn't show any inherent relationship between words, eg. 'Seattle motel' and 'Seattle hotel' should be considered similar.

### Better approach:
* Find a way for the vectors to directly encode similarity &mdash; distributional similarity (the meaning of a word is the context it is in).
* "You shall know a word by the company it keeps." J. R. Firth 1957

* __Our goal:__ build a dense vector for each word, chosen so that it is good at predicting other words appearing in its context &mdash; the other words also represented by vectors of their own.
* Eg. linguistics = [0.286 0.792 -0.177 -0.107 0.109 -0.542 0.349 0.271]
* This type of dense vector representation is distributed &mdash; instead of an atomic 'one-hot' vector representation, the meaning of the word is 'smeared' over the whole vector with different numbers.

### Word2Vec &mdash; Set of algorithms developed by Tomas Mikolov in 2013

* These algorithms build dense vector representations for words, as outlined above.
* Became highly influential in statistical NLP.

### Basic idea:

* Define a model that aims to predict between a center word $w_t$ and context words in terms of word vectors:
    * $P(context \mid w_{t}) = \ldots$
    * Which has a loss function, eg.
    * $J = 1 - P(w_{-t} \mid w_{t})$ -- $w_{-t}$ refers to everything not $w_t$, hence all the words in the context
    * We look at many positions $t$ in a big language corpus.
    * We keep adjusting the vector representations of words to minimize this loss.
    * As a result, we obtain dense word vectors that are very powerful in representing meaning.

### Two algorithms of Word2Vec:

* __Skip-Gram (SG)__ &mdash; predicts context words given target
* __Continuous Bag of Words (CBOW)__ &mdash; predicts target word from bag-of-words context

### Two training methods:

* __Hierarchical softmax__
* __Negative sampling__

## Skip-Gram Model

<img src="SkipGramPrediction.png">

* For each word $t = 1 \ldots T$, predict surrounding words in a window of 'radius' $m$ of every word.
* Objective function: maximize the probability of any context word given the current center word.


* $\displaystyle J'(\theta) = \prod_{t=1}^T \prod_{\substack{-m \leq j \leq m\\ j \neq 0}} P(w_{t+j} \mid w_t ; \theta)$
* __Negative Log-Likelihood:__ $\displaystyle J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log P(w_{t+j} \mid w_t)$
    * Where $\theta$ represents all variables we will optimize (the vector representations of the words)
    * Notes on second equation:
        * $-$ changes it to a minimization problem, since machine learning people like to minimize things, rather than maximizing them
        * $\frac{1}{T}$ &mdash; normalization, making it 'per word', instead of a probability for the whole corpus
        * $\log$ &mdash; our products turn into sums, making the math a lot easier

* __How is the probability itself measured?__


* For $P(w_{t+j} \mid w_t)$:
    * $P(o \mid c) = \frac{exp(u_{o}^T v_c)}{\sum_{w=1}^v exp(u_{w}^T v_c)}$ &mdash; _softmax_ form


* $o$: the outside/output word index
* $c$: the center word index
* $v_c$ and $u_o$: "center" and "outside" vectors of indices $c$ and $o$


* We put the dot product of the two vectors into a softmax form.
* The dot product will be bigger if $u$ and $v$ are more similar.
* Softmax form: standard way to turn numbers into a probability distribution.

    * $P_i = \frac{e^{u_i}}{\sum_{j}e^{u_j}}$


* Exponentiation makes sure everything is positive.
* Dividing by the sum of all the numbers gives a probability distribution.

* As a result of this function, each word has two representations: one for when it is the context word, another for when it is the center word.

## Final picture of the Skip-Gram model:


<img src="SkipGramDiagram.png">

## Training the model

$$\mathbf{\theta} = \left[\begin{array}
{r}
v_{aardvark} \\
v_{a} \\
\vdots \\
u_{aardvark} \\
u_{a} \\
\vdots \\
u_{zebra}
\end{array}\right]
\in {\rm I\!R}^{2dV}
$$

* $V$ &mdash; size of vocabulary
* $d$ &mdash; dimensionality of vector representations
* To train, compute the gradient for all vectors.

## Model Result

* The final word vectors obtained are able to capture some general and quite useful semantic information about words and their relationship to one another.
* In the latent vector space, different directions specialize towards different semantic and even grammatical relationships (male-female, verb tense, country-capital, and so on).
* This can be seen in the following graphs, using t-SNE for dimensionality reduction (from https://www.tensorflow.org/tutorials/word2vec):

<img src="W2VSemanticRelationships.png">

In addition, the following is a t-SNE visualization of learned Word2Vec embeddings, showing that similar words do indeed cluster nearby each other:

<img src="W2VVisualization.png">

# Building a Word2Vec Model

In [50]:
import gensim, logging, multiprocessing
import pandas as pd
import numpy as np
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Using TensorFlow backend.


In [51]:
## Configuring the word2vec model
w2vconfig = {
    'min_count': 15, # minimum frequency of a word to be included in the model
    'hs': 0, # training method: 0 --- negative sampling, 1 --- hierarchical softmax
    'negative': 5, # number of "negative" words selected to update model weights
    'size': 300, # dimension of the vectors
    'sg': 0, # 0 --- CBOW, 1 --- skip-gram 
    'batch_words': 10000, # batch size for training the model
    'iter': 100, # number of epochs
    'window': 10, # context window
    'workers': multiprocessing.cpu_count(),
}

In [52]:
## Initializing the word2vec model using the above configuration
model = gensim.models.Word2Vec(**w2vconfig)

In [64]:
## Preparing the word2vec input (the gensim word2vec module expects a sequence of sentences as its input (list of lists))
# Eg. sentences = [['sentence', 'one'], ['sentence', 'two'], ...]
# Prior to making the model, helps to pre-process data (tokenization, stemming, POS-tagging, stop word/punctuation removal, etc.)

In [97]:
# Here the brown corpus will be used as a demonstration
from nltk.corpus import brown
sentences = brown.sents(categories=brown.categories())
print(len(sentences))
print(sentences[:2])

57340
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']]


In [63]:
# Building the vocabulary and training the model
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

2018-02-01 17:04:06,736 : INFO : collecting all words and their counts
2018-02-01 17:04:06,739 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-01 17:04:07,562 : INFO : PROGRESS: at sentence #10000, processed 219770 words, keeping 23488 word types
2018-02-01 17:04:08,297 : INFO : PROGRESS: at sentence #20000, processed 430477 words, keeping 34367 word types
2018-02-01 17:04:09,046 : INFO : PROGRESS: at sentence #30000, processed 669056 words, keeping 42365 word types
2018-02-01 17:04:09,797 : INFO : PROGRESS: at sentence #40000, processed 888291 words, keeping 49136 word types
2018-02-01 17:04:10,353 : INFO : PROGRESS: at sentence #50000, processed 1039920 words, keeping 53024 word types
2018-02-01 17:04:10,808 : INFO : collected 56057 word types from a corpus of 1161192 raw words and 57340 sentences
2018-02-01 17:04:10,810 : INFO : Loading a fresh vocabulary
2018-02-01 17:04:10,892 : INFO : min_count=15 retains 6528 unique words (11% of original 56057

2018-02-01 17:05:17,998 : INFO : PROGRESS: at 11.20% examples, 118556 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:19,035 : INFO : PROGRESS: at 11.36% examples, 118509 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:20,083 : INFO : PROGRESS: at 11.52% examples, 118647 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:21,117 : INFO : PROGRESS: at 11.66% examples, 118539 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:22,140 : INFO : PROGRESS: at 11.88% examples, 118484 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:23,164 : INFO : PROGRESS: at 12.07% examples, 118479 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:24,195 : INFO : PROGRESS: at 12.25% examples, 118614 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:25,225 : INFO : PROGRESS: at 12.42% examples, 118755 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:26,268 : INFO : PROGRESS: at 12.59% examples, 118969 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:05:27,280 : INFO : PROGRESS: at 12.79% examples, 119157 wor

2018-02-01 17:06:40,680 : INFO : PROGRESS: at 24.30% examples, 114789 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:41,761 : INFO : PROGRESS: at 24.45% examples, 114759 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:42,782 : INFO : PROGRESS: at 24.58% examples, 114708 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:43,824 : INFO : PROGRESS: at 24.71% examples, 114562 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:44,827 : INFO : PROGRESS: at 24.93% examples, 114583 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:45,849 : INFO : PROGRESS: at 25.08% examples, 114488 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:46,869 : INFO : PROGRESS: at 25.22% examples, 114395 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:47,879 : INFO : PROGRESS: at 25.39% examples, 114503 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:48,898 : INFO : PROGRESS: at 25.55% examples, 114573 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:06:49,936 : INFO : PROGRESS: at 25.74% examples, 114705 wor

2018-02-01 17:08:03,003 : INFO : PROGRESS: at 37.94% examples, 115550 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:04,018 : INFO : PROGRESS: at 38.07% examples, 115408 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:05,064 : INFO : PROGRESS: at 38.21% examples, 115356 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:06,112 : INFO : PROGRESS: at 38.39% examples, 115402 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:07,124 : INFO : PROGRESS: at 38.53% examples, 115423 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:08,149 : INFO : PROGRESS: at 38.68% examples, 115410 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:09,178 : INFO : PROGRESS: at 38.88% examples, 115379 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:10,214 : INFO : PROGRESS: at 39.08% examples, 115410 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:11,253 : INFO : PROGRESS: at 39.24% examples, 115385 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:08:12,270 : INFO : PROGRESS: at 39.35% examples, 115274 wor

2018-02-01 17:09:25,406 : INFO : PROGRESS: at 50.62% examples, 113840 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:26,435 : INFO : PROGRESS: at 50.77% examples, 113776 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:27,435 : INFO : PROGRESS: at 51.00% examples, 113809 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:28,467 : INFO : PROGRESS: at 51.16% examples, 113832 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:29,473 : INFO : PROGRESS: at 51.34% examples, 113889 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:30,486 : INFO : PROGRESS: at 51.51% examples, 113963 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:31,501 : INFO : PROGRESS: at 51.69% examples, 114039 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:32,526 : INFO : PROGRESS: at 51.93% examples, 114079 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:33,564 : INFO : PROGRESS: at 52.12% examples, 114107 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:09:34,571 : INFO : PROGRESS: at 52.29% examples, 114142 wor

2018-02-01 17:10:47,627 : INFO : PROGRESS: at 64.02% examples, 113977 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:48,631 : INFO : PROGRESS: at 64.17% examples, 113973 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:49,669 : INFO : PROGRESS: at 64.34% examples, 113979 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:50,721 : INFO : PROGRESS: at 64.49% examples, 113979 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:51,752 : INFO : PROGRESS: at 64.62% examples, 113963 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:52,768 : INFO : PROGRESS: at 64.81% examples, 113963 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:53,820 : INFO : PROGRESS: at 64.98% examples, 113896 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:54,857 : INFO : PROGRESS: at 65.15% examples, 113915 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:55,896 : INFO : PROGRESS: at 65.32% examples, 113935 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:10:56,921 : INFO : PROGRESS: at 65.49% examples, 113988 wor

2018-02-01 17:12:09,986 : INFO : PROGRESS: at 76.94% examples, 113417 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:11,002 : INFO : PROGRESS: at 77.12% examples, 113442 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:12,010 : INFO : PROGRESS: at 77.31% examples, 113491 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:13,050 : INFO : PROGRESS: at 77.49% examples, 113545 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:14,078 : INFO : PROGRESS: at 77.64% examples, 113570 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:15,115 : INFO : PROGRESS: at 77.88% examples, 113606 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:16,123 : INFO : PROGRESS: at 78.06% examples, 113596 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:17,153 : INFO : PROGRESS: at 78.22% examples, 113602 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:18,180 : INFO : PROGRESS: at 78.41% examples, 113658 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:12:19,197 : INFO : PROGRESS: at 78.57% examples, 113707 wor

2018-02-01 17:13:32,630 : INFO : PROGRESS: at 90.35% examples, 113591 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:33,650 : INFO : PROGRESS: at 90.47% examples, 113565 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:34,752 : INFO : PROGRESS: at 90.61% examples, 113540 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:35,773 : INFO : PROGRESS: at 90.76% examples, 113507 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:36,835 : INFO : PROGRESS: at 90.96% examples, 113492 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:37,871 : INFO : PROGRESS: at 91.11% examples, 113464 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:38,872 : INFO : PROGRESS: at 91.25% examples, 113454 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:39,955 : INFO : PROGRESS: at 91.39% examples, 113428 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:41,039 : INFO : PROGRESS: at 91.53% examples, 113416 words/s, in_qsize 0, out_qsize 0
2018-02-01 17:13:42,045 : INFO : PROGRESS: at 91.63% examples, 113365 wor

70560623

In [123]:
## Testing the model

print("Most similar words to 'person':", model.most_similar(positive=["person"]), "\n")
print("Most similar words to 'Asia':", model.most_similar(positive=["Asia"]), "\n")
print("Most similar words to 'two':", model.most_similar(positive=["two"]), "\n")
print("Most similar words to 'Fred':", model.most_similar(positive=["Fred"]), "\n")
print("'England' - 'London' + 'Germany' (expected 'Berlin'):", model.most_similar(positive=["England", "Germany"], negative=["London"]), "\n")

# Similar words:
print("Similarity of 'two' and 'three':", model.similarity("two", "three"), "\n")
print("Similarity of 'go' and 'come':", model.similarity("go", "come"), "\n")

# Dissimilar words:
print("Similarity of 'go' and 'orange':", model.similarity("go", "orange"))

Most similar words to 'person': [('man', 0.3397452235221863), ('child', 0.311055451631546), ('word', 0.2815074324607849), ('woman', 0.2698192298412323), ('lady', 0.24463726580142975), ('dancer', 0.23879790306091309), ('people', 0.23758931457996368), ('award', 0.23718714714050293), ('anyone', 0.23104585707187653), ('teacher', 0.2303423285484314)] 

Most similar words to 'Asia': [('Southeast', 0.43082574009895325), ('Europe', 0.396241158246994), ('Africa', 0.3862916827201843), ('China', 0.3359772562980652), ('Organization', 0.3180088400840759), ('Foreign', 0.30951645970344543), ('peoples', 0.3089768886566162), ('Maryland', 0.2858218550682068), ('NATO', 0.2852109372615814), ('Free', 0.2796500623226166)] 

Most similar words to 'two': [('four', 0.4273889660835266), ('several', 0.37681204080581665), ('three', 0.35842716693878174), ('six', 0.3434075117111206), ('thirty', 0.3404092490673065), ('few', 0.3227190375328064), ('dozen', 0.3199787735939026), ('forty', 0.3110663890838623), ('Two', 0.

In [96]:
# Checking individual vectors
orange_vector = model.wv.syn0[model.wv.vocab["orange"].index]
print(orange_vector)

[  1.89136040e+00  -1.13161635e+00   5.34476101e-01  -6.50712788e-01
  -5.39923370e-01  -2.00670436e-01  -5.16958773e-01  -1.74992669e+00
   3.16177279e-01   3.29478770e-01   9.42208827e-01  -4.04226422e-01
   1.01700008e+00  -4.18334454e-01   7.59061158e-01   3.86817828e-02
  -8.75214994e-01   1.35940850e+00  -4.35161322e-01   1.23223066e+00
  -1.00116587e+00   4.72312748e-01   1.99902400e-01  -6.12646788e-02
  -8.72191608e-01  -8.76358628e-01  -8.53626490e-01  -3.95954311e-01
   1.16534472e+00   5.43274283e-01  -4.84839916e-01   5.45051098e-01
  -7.80732274e-01  -1.82947624e+00   9.00417686e-01  -7.68705726e-01
  -9.75545585e-01   4.69934523e-01  -1.19707450e-01   5.54120600e-01
  -2.07117051e-01  -1.96374446e-01  -1.16820574e+00  -7.21928999e-02
  -4.04828846e-01   1.98103964e-01   1.16515708e+00   1.18716514e+00
   1.15367913e+00  -1.26554132e+00   1.38506591e-01   5.30824065e-02
   2.70326912e-01   2.07977265e-01  -4.65884686e-01   5.20242095e-01
   9.31067392e-02  -8.85635674e-01

* Alternatively, can use a pre-trained word2vec model.
* W2V model made by Google (1.5 GB, vocabulary of 3 million words, 100 billion words total): https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
* W2V models for many different languages: https://github.com/Kyubyong/wordvectors
* More W2V models: https://sites.google.com/site/rmyeid/projects/polyglot

* Example: loading the above word2vec model by Google
* large_model = gensim.models.KeyedVectors.load_word2vec_format('directory/GoogleNews-vectors-negative300.bin.gz', binary=True)