# Word2Vec

# Author
Julien ROSSI for the University of Amsterdam (2021).


Word2Vec is a model described by Mikolov et al in 2013, it is as well a patented algorithm by Google:
* "Efficient Estimation of Word Representations in Vector Space" [ArXiv](https://arxiv.org/abs/1301.3781)
* "Distributed representations of words and phrases and their compositionality" [ArXiv](https://arxiv.org/abs/1310.4546)
* "Computing numeric representations of words in a high-dimensional space" [Patent](https://patents.google.com/patent/US9037464B1/en)

A neural network with 1 hidden layer trains on the task of predicting a word given a few context words:
* For example, with a window of size 5
* The sample is a part of a sentence "my blue ship sails faster"
 * Context words: `my` `blue` `sails` `faster`
 * Central word: `ship`
* **Skip-Gram**: predict `my` `blue` `sails` `faster` from `ship`
* From a complete corpus, extract as many samples as possible
* The sample loss is the difference between predicted probabilities of each word of the dictionary versus ground truth (log likelyhood)
* Minimize the loss over all the dataset

Once the neural network is trained:
* Read the weights of the hidden layer as word embeddings
* This is also the values in the neurons of the hidden layer when the word is given as input (green area on the illustration)

<img src="https://miro.medium.com/max/700/1*HQeN5Q9FhN_XPbM4QuWIRg.jpeg"></img>

Image source: https://medium.com/@zeeshanmulla/word-embeddings-in-natural-language-processing-nlp-5be7d6fb1d73

The contribution of Mikolov et al. deals mainly with optimizations of the training so that it is actually tractable. We will not enter into these details.



# Use an existing model

Considering the effort, it is worth using a pretrained model.

What is a pretrained model:
* a dictionary
* each key is a word
* each value is a vector

**Warning**

It will download **1.6GB** of data.

In [None]:
import gensim.downloader as api

In [None]:
model = api.load('word2vec-google-news-300')

In [None]:
print(type(model))

Have a look at vectors.

In [None]:
model['cat']

In [None]:
model['sklsajhdgfjkhsosiuerhksjdhfkjsh']

## Most similar words

The similarity between words is computed as the cosine similarity between the vectors representing these words.

In [None]:
model.similarity('investment', 'flower')

In [None]:
model.most_similar(positive=['cat'])

## Composition

There are a few known vector equations, like:

$\overrightarrow{\textrm{king}} - \overrightarrow{\textrm{man}} + \overrightarrow{\textrm{woman}} = \overrightarrow{\textrm{queen}}$

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

$\overrightarrow{\textrm{paris}} - \overrightarrow{\textrm{france}} + \overrightarrow{\textrm{germany}} = \overrightarrow{\textrm{berlin}}$

In [None]:
model.most_similar(positive=['paris', 'germany'], negative=['france'])

# Training with a corpus

We will use the [Brown Corpus](http://korpus.uib.no/icame/manuals/BROWN/INDEX.HTM) as illustration.

This corpus is made of books published in 1961, written by native English speakers.

We will generate 100-dims vector for the words in the corpus.

In [None]:
import nltk
nltk.download('brown')

In [None]:
from gensim.models.word2vec import BrownCorpus

brown = BrownCorpus('/root/nltk_data/corpora/brown')

It is a list of tokenized sentences. Each word his also flagged with its Part-of-Speech tag (POS). 

* `pp` = personal pronoun
* `vb` = verb
* etc...

In [None]:
all_brown = list(brown)
print(all_brown[:1000])

In [None]:
print(f'Brown Corpus contains {len(all_brown)} sentences, and a total of {sum(map(len, brown))} tokens.')

In [None]:
from gensim.models import Word2Vec

w2v = Word2Vec(
    sentences=brown,
    size=100,
    window=3,
    iter=20
)

In [None]:
print(f'Word2Vec created for a vocabulary of {len(w2v.wv.vocab)} unique terms.')

In [None]:
w2v.wv['happy/jj']

Now we can evaluate and see that it is not performing well. 

We would need:
* More data
* More processing to train the neural network

In [None]:
w2v.wv.most_similar(positive=['happy/jj'])

## Evaluate

Evaluation is conducted by checking if a list in similarities in words (given by human) are reflected well as similarities in between vectors.

In [None]:
from gensim.test.utils import datapath
w2v.wv.evaluate_word_analogies(datapath('questions-words.txt'))

In [None]:
w2v.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))