# **Word2Vec Tutorial**

* __Based on Lecture 2 from the Stanford Deep Learning Series by Chris Manning__
* __Word2Vec: Set of algorithms developed by Mikolov et al. (2013)__
* __Original Word2Vec Paper: https://arxiv.org/abs/1301.3781__

### Question: How do we represent the meaning of a word computationally?
* Often we make use of a taxonomy like WordNet, using hypernym relationships and synonym sets.
* Problem with this:
    * Misses nuances (eg. 'good' and 'expert' are considered synonyms in WordNet, but "he's an expert at deep learning" and "he's good at deep learning" clearly have different meanings).
    * Misses new words (must constantly keep up to date), requires a lot of human labor.
    
    
* This can be considered a 'discrete' representation, using atomic symbols.
* Words like 'hotel', 'conference', 'walk', in vector space terms are represented using a vector of one 1 and many zeros [0 0 0 0 0 0 0 0 1 0 0 0], thus a 'one-hot' vector representation.
* These vectors end up being very long, depending on the vocabulary.
* Another problem: doesn't show any inherent relationship between words, eg. 'Seattle motel' and 'Seattle hotel' should be considered similar.

### Better approach:
* Find a way for the vectors to directly encode similarity &mdash; distributional similarity (the meaning of a word is the context it is in).
* "You shall know a word by the company it keeps." J. R. Firth 1957

* __Our goal:__ build a dense vector for each word, chosen so that it is good at predicting other words appearing in its context &mdash; the other words also represented by vectors of their own.
* Eg. linguistics = [0.286 0.792 -0.177 -0.107 0.109 -0.542 0.349 0.271]
* This type of dense vector representation is distributed &mdash; instead of an atomic 'one-hot' vector representation, the meaning of the word is 'smeared' over the whole vector with different numbers.

### Word2Vec &mdash; Set of algorithms developed by Tomas Mikolov in 2013

* These algorithms build dense vector representations for words, as outlined above.
* Became highly influential in statistical NLP.

### Basic idea:

* Define a model that aims to predict between a center word $w_t$ and context words in terms of word vectors:
    * $P(context \mid w_{t}) = \ldots$
    * Which has a loss function, eg.
    * $J = 1 - P(w_{-t} \mid w_{t})$ -- $w_{-t}$ refers to everything not $w_t$, hence all the words in the context
    * We look at many positions $t$ in a big language corpus.
    * We keep adjusting the vector representations of words to minimize this loss.
    * As a result, we obtain dense word vectors that are very powerful in representing meaning.

### Two algorithms of Word2Vec:

* __Skip-Gram (SG)__ &mdash; predicts context words given target
* __Continuous Bag of Words (CBOW)__ &mdash; predicts target word from bag-of-words context

### Two training methods:

* __Hierarchical softmax__
* __Negative sampling__

## Skip-Gram Model

<img src="SkipGramPrediction.png">

* For each word $t = 1 \ldots T$, predict surrounding words in a window of 'radius' $m$ of every word.
* Objective function: maximize the probability of any context word given the current center word.


* $\displaystyle J'(\theta) = \prod_{t=1}^T \prod_{\substack{-m \leq j \leq m\\ j \neq 0}} P(w_{t+j} \mid w_t ; \theta)$
* __Negative Log-Likelihood:__ $\displaystyle J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{\substack{-m \leq j \leq m\\ j \neq 0}} \log P(w_{t+j} \mid w_t)$
    * Where $\theta$ represents all variables we will optimize (the vector representations of the words)
    * Notes on second equation:
        * $-$ changes it to a minimization problem, since machine learning people like to minimize things, rather than maximizing them
        * $\frac{1}{T}$ &mdash; normalization, making it 'per word', instead of a probability for the whole corpus
        * $\log$ &mdash; our products turn into sums, making the math a lot easier

* __How is the probability itself measured?__


* For $P(w_{t+j} \mid w_t)$:
    * $P(o \mid c) = \frac{exp(u_{o}^T v_c)}{\sum_{w=1}^v exp(u_{w}^T v_c)}$ &mdash; _softmax_ form


* $o$: the outside/output word index
* $c$: the center word index
* $v_c$ and $u_o$: "center" and "outside" vectors of indices $c$ and $o$


* We put the dot product of the two vectors into a softmax form.
* The dot product will be bigger if $u$ and $v$ are more similar.
* Softmax form: standard way to turn numbers into a probability distribution.

    * $P_i = \frac{e^{u_i}}{\sum_{j}e^{u_j}}$


* Exponentiation makes sure everything is positive.
* Dividing by the sum of all the numbers gives a probability distribution.

* As a result of this function, each word has two representations: one for when it is the context word, another for when it is the center word.

## Final picture of the Skip-Gram model:


<img src="SkipGramDiagram.png">

## Training the model

$$\mathbf{\theta} = \left[\begin{array}
{r}
v_{aardvark} \\
v_{a} \\
\vdots \\
u_{aardvark} \\
u_{a} \\
\vdots \\
u_{zebra}
\end{array}\right]
\in {\rm I\!R}^{2dV}
$$

* $V$ &mdash; size of vocabulary
* $d$ &mdash; dimensionality of vector representations
* To train, compute the gradient for all vectors.

## Model Result

* The final word vectors obtained are able to capture some general and quite useful semantic information about words and their relationship to one another.
* In the latent vector space, different directions specialize towards different semantic and even grammatical relationships (male-female, verb tense, country-capital, and so on).
* This can be seen in the following graphs, using t-SNE for dimensionality reduction (from https://www.tensorflow.org/tutorials/word2vec):

<img src="W2VSemanticRelationships.png">

In addition, the following is a t-SNE visualization of learned Word2Vec embeddings, showing that similar words do indeed cluster nearby each other:

<img src="W2VVisualization.png">

# Building a Word2Vec Model

In [1]:
import gensim, logging, multiprocessing
import pandas as pd
import numpy as np
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Using TensorFlow backend.


In [2]:
## Configuring the word2vec model
w2vconfig = {
    'min_count': 6, # minimum frequency of a word to be included in the model
    'hs': 0, # training method: 0 --- negative sampling, 1 --- hierarchical softmax
    'negative': 5, # number of "negative" words selected to update model weights
    'size': 100, # dimension of the vectors
    'sg': 0, # 0 --- CBOW, 1 --- skip-gram 
    'batch_words': 10000, # batch size for training the model
    'iter': 20, # number of epochs
    'window': 5, # context window
    'workers': multiprocessing.cpu_count(),
}

In [3]:
## Initializing the word2vec model using the above configuration
model = gensim.models.Word2Vec(**w2vconfig)

In [64]:
## Preparing the word2vec input (the gensim word2vec module expects a sequence of sentences as its input (list of lists))
# Eg. sentences = [['sentence', 'one'], ['sentence', 'two'], ...]
# Prior to making the model, helps to pre-process data (tokenization, stemming, POS-tagging, stop word/punctuation removal, etc.)

In [4]:
# Here the brown corpus will be used as a demonstration
from nltk.corpus import brown
sentences = brown.sents(categories=brown.categories())
print(len(sentences))
print(sentences[:2])

57340
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']]


In [5]:
# Building the vocabulary and training the model
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

2018-02-03 16:22:40,243 : INFO : collecting all words and their counts
2018-02-03 16:22:40,247 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-03 16:22:40,952 : INFO : PROGRESS: at sentence #10000, processed 219770 words, keeping 23488 word types
2018-02-03 16:22:41,562 : INFO : PROGRESS: at sentence #20000, processed 430477 words, keeping 34367 word types
2018-02-03 16:22:42,306 : INFO : PROGRESS: at sentence #30000, processed 669056 words, keeping 42365 word types
2018-02-03 16:22:42,944 : INFO : PROGRESS: at sentence #40000, processed 888291 words, keeping 49136 word types
2018-02-03 16:22:43,438 : INFO : PROGRESS: at sentence #50000, processed 1039920 words, keeping 53024 word types
2018-02-03 16:22:43,882 : INFO : collected 56057 word types from a corpus of 1161192 raw words and 57340 sentences
2018-02-03 16:22:43,883 : INFO : Loading a fresh vocabulary
2018-02-03 16:22:44,067 : INFO : min_count=6 retains 13160 unique words (23% of original 56057

2018-02-03 16:23:50,918 : INFO : PROGRESS: at 73.70% examples, 171610 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:51,925 : INFO : PROGRESS: at 75.19% examples, 171822 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:52,949 : INFO : PROGRESS: at 76.43% examples, 172145 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:53,971 : INFO : PROGRESS: at 77.43% examples, 172155 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:54,987 : INFO : PROGRESS: at 78.22% examples, 171767 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:56,014 : INFO : PROGRESS: at 79.50% examples, 171447 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:57,035 : INFO : PROGRESS: at 80.71% examples, 171495 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:58,047 : INFO : PROGRESS: at 81.65% examples, 171194 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:23:59,063 : INFO : PROGRESS: at 82.60% examples, 171167 words/s, in_qsize 0, out_qsize 0
2018-02-03 16:24:00,075 : INFO : PROGRESS: at 83.84% examples, 171428 wor

15409703

In [6]:
## Testing the model

print("Most similar words to 'person':", model.most_similar(positive=["person"]), "\n")
print("Most similar words to 'Asia':", model.most_similar(positive=["Asia"]), "\n")
print("Most similar words to 'two':", model.most_similar(positive=["two"]), "\n")
print("Most similar words to 'Fred':", model.most_similar(positive=["Fred"]), "\n")
print("'England' - 'London' + 'Germany' (expected 'Berlin'):", model.most_similar(positive=["England", "Germany"], negative=["London"]), "\n")

# Similar words:
print("Similarity of 'two' and 'three':", model.similarity("two", "three"), "\n")
print("Similarity of 'go' and 'come':", model.similarity("go", "come"), "\n")

# Dissimilar words:
print("Similarity of 'go' and 'orange':", model.similarity("go", "orange"))

2018-02-03 16:27:59,492 : INFO : precomputing L2-norms of word weight vectors


Most similar words to 'person': [('child', 0.715959906578064), ('man', 0.6727136373519897), ('woman', 0.6627339124679565), ('difficulty', 0.616508960723877), ('patient', 0.612080454826355), ('teacher', 0.6093043684959412), ('artist', 0.5953537225723267), ('word', 0.5873693227767944), ('case', 0.5868275761604309), ('dancer', 0.5713365077972412)] 

Most similar words to 'Asia': [('Africa', 0.8071120977401733), ('Southeast', 0.7689347863197327), ('East', 0.7660007476806641), ('South', 0.7628188133239746), ('Viet', 0.7604211568832397), ('Europe', 0.7526821494102478), ('Nam', 0.7451930046081543), ('western', 0.7263959646224976), ('North', 0.7213359475135803), ('France', 0.7085673809051514)] 

Most similar words to 'two': [('three', 0.8061007261276245), ('four', 0.715324342250824), ('several', 0.6597403287887573), ('six', 0.6394720077514648), ('five', 0.6207407712936401), ('few', 0.5648008584976196), ('Two', 0.5595917701721191), ('eight', 0.5593658685684204), ('seven', 0.520887017250061), ('

In [18]:
model.most_similar(positive=["Korea"])

[('1883', 0.7338681221008301),
 ('Nebraska', 0.714241087436676),
 ('Buffalo', 0.7076287269592285),
 ('Horn', 0.6886376738548279),
 ('1921', 0.6870503425598145),
 ('Missouri', 0.684490442276001),
 ('Kentucky', 0.6808568835258484),
 ('Birmingham', 0.6747636795043945),
 ('Embassy', 0.6742736101150513),
 ('Plains', 0.6679900884628296)]

In [22]:
# Checking individual vectors
orange_vector = model.wv.syn0[model.wv.vocab["orange"].index]
print(orange_vector)

[-0.10400999 -0.28723761 -0.39321992 -0.07453883 -0.16070177  0.02608893
 -0.36562774  0.53426212 -0.10667827  0.40968025 -0.08523978  0.21267067
 -0.17858098 -0.08899118 -0.3743138   0.4602769   0.00985924 -0.52807665
  0.13695192  0.08563213 -0.37231773  0.09058453 -0.47096005  0.14891569
  0.07235859 -0.5782941  -0.27261162  0.62884688 -0.03339595 -0.40897548
  0.33745039 -0.09804494  0.49227691  0.37999123 -0.13616557 -0.24272589
 -0.22699361 -0.08595905 -0.34847173 -0.17083147  0.43903652 -0.1142322
 -0.04916618  0.32933396  0.08061542  0.51817828 -0.0835963  -0.23297638
 -0.23575093  0.19030149 -0.11645392 -0.1596815   0.10548559 -0.22564806
 -0.18969122 -0.16085964 -0.27025726  0.24448828 -0.07644074  0.25189272
  0.11502693 -0.07111529  0.41744712  0.24963957 -0.12875514  0.60899067
  0.07185046 -0.04045171  0.37370163 -0.24526568 -0.07696132  0.53539491
 -0.01748968  0.25131464 -0.05654695  0.08259479  0.14162362  0.05591664
  0.28641215  0.15084352 -0.22377631  0.47152653 -0.

* Alternatively, can use a pre-trained word2vec model.
* W2V model made by Google (1.5 GB, vocabulary of 3 million words, 100 billion words total): https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
* W2V models for many different languages: https://github.com/Kyubyong/wordvectors
* More W2V models: https://sites.google.com/site/rmyeid/projects/polyglot

* Example: loading the above word2vec model by Google
* large_model = gensim.models.KeyedVectors.load_word2vec_format('directory/GoogleNews-vectors-negative300.bin.gz', binary=True)