Word2Vec Tutorial
==============

In [None]:
%matplotlib inline
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Word2Vec is widely featured as a member of the
“new wave” of Machine Learning algorithms based on Neural Networks, commonly
referred to as "Deep Learning" (though Word2Vec itself is rather shallow).
<br>
<br>
Using large amounts of unannotated plain text, Word2Vec learns relationships
between words automatically. The output are vectors, one vector per word,
with remarkable linear relationships that allow us to do things like:

* vec("king") - vec("man") + vec("woman") =~ vec("queen")

<br>
Word2Vec is very useful in _automatic text tagging_, _recommender systems_ and _machine translation_.

This tutorial is based on the Python package [_Gensim_](https://github.com/RaRe-Technologies/gensim) and:

1. Introduces Word2Vec as an improvement over traditional bag-of-words
2. Shows off a demo of Word2Vec using a pre-trained model
3. Demonstrates training a new model from your own data
4. Demonstrates loading and saving models
5. Introduces several training parameters and demonstrates their effect
6. Discusses memory requirements
7. Visualizes Word2Vec embeddings by applying dimensionality reduction

<br>
*Amazon uses Gensim to evaluate document similarity.

Review: Bag-of-words
--------------------

This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

<br>
The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

<br>
Each vector has 10 elements, where each element counts the number of times a particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.
<br>
<br>
Bag-of-words models are surprisingly effective, but have several weaknesses.
<br>
<br>
1) No information about word order: "John likes Mary" and "Mary likes John" are mapped to identical vectors.
<br>
<br>
Solution: *bag of n-grams* models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.
<br>
<br>
2) The model does not attempt to learn the _meaning_ of the underlying words, and as a consequence, the distance between vectors doesn't always reflect the difference in meaning.

The Word2Vec model addresses this second problem.

Introducing: Word2Vec
-----------------------------------

Word2Vec is a recent model that embeds words in a lower-dimensional
vector space using a shallow neural network (shallow means the neural network has few hidden layers).
<br>
<br>
The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.
<br>
<br>
Example: ``strong`` and ``powerful`` would be close together and ``strong`` and ``Paris`` would be relatively far.
<br>
There are two versions of this model and _Gensim_ implements them both:

1. _Skip-grams_ (SG)
2. _Continuous bag-of-words_ (CBOW)

Word2Vec Demo
-------------

To see what Word2Vec can do, let's download a pre-trained model and play around with it.
<br>
<br>
We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases.

Such a model can take _hours_ to train, but since it's already available, we can download and loading it using Gensim.
<br>
<br>
NB: The model is approximately 2GB, so you'll need a decent network connection to proceed (do it at home).

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

We can easily obtain vectors for terms the model is familiar with:




In [None]:
vec_king = wv['king']

Word2Vec has one limitation: it is unable to infer vectors for unfamiliar words.

In [None]:
try:
    vec_weapon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

Gensim Word2Vec supports several word similarity tasks out of the
box. You can see how the similarity intuitively decreases as the words get
less and less similar.

In [None]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

Print the 5 most similar words to "car" or "minivan"



In [None]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

Which of the below does not belong in the sequence?



In [None]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

Training Your Own Model
-----------------------

To start, you'll need some data for training the model.
<br>
<br>
For the following examples, we'll use the [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) (which you already have if you've installed Gensim).
<br>
<br>
This corpus is small enough to fit entirely in memory (RAM), but we'll implement a memory-friendly [_iterator_](https://www.w3schools.com/python/python_iterators.asp) that reads it line-by-line to demonstrate how you would handle a larger corpus.

In [1]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

If we wanted to do any custom preprocessing, e.g. decode a non-standard encoding, lowercase, remove numbers, extract named entities... All of this can be done inside the ``MyCorpus`` iterator and ``Word2Vec`` doesn’t need to know.
<br>
<br>
All that is required is that the input yields one sentence (list of utf-8 words) after another.
<br>
<br>
Let's go ahead and train a model on our corpus.  Don't worry about the training parameters much for now, we'll revisit them later.

In [6]:
import gensim.models

sentences = MyCorpus()  # iterator

# Word2Vec train on our corpus
# use sg=1 skip-gram; otherwise CBOW
model = gensim.models.Word2Vec(sentences=sentences)

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is ``model.wv`` , where "wv" stands for "word vectors".

In [None]:
vec_king = model.wv['king']

[('claims', 0.9989023208618164), ('days', 0.9988892078399658), ('as', 0.9988892078399658), ('believe', 0.9988856911659241), ('is', 0.9988771677017212)]


Storing and loading models
--------------------------

Training non-trivial models can take time. 

Once you've trained your model and it works as expected, you can save it to disk. That way, you don't have to spend time training it all over again later.

You can store/load models using standard methods provided by Gensim:

In [None]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    # Save the model
    model.save(temporary_filepath)
    
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    
    # Load a saved model
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

Training Parameters
-------------------

``Word2Vec`` accepts several parameters that affect both training speed and quality.

min_count
---------

``min_count`` is for pruning the internal dictionary. Words that appear only
once or twice in a billion-word corpus are probably uninteresting typos and
garbage. In addition, there’s not enough data to make any meaningful training
on those words, so it’s best to ignore them:

default value of min_count=5



In [None]:
model = gensim.models.Word2Vec(sentences, min_count=10)

size
----

``size`` is the number of dimensions (N) of the N-dimensional space that
gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more
accurate) models. Reasonable values are in the tens to hundreds.




In [None]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

workers
-------

``workers`` , the last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:
<br>
<br>
The ``workers`` parameter only has an effect if you have [Cython](http://cython.org/) installed. Without Cython, you’ll only be able to use one core because of the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (and ``word2vec``
training will be [miserably slow](http://rare-technologies.com/word2vec-in-python-part-two-optimizing/)).

In [None]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

Memory
------

At its core, ``word2vec`` model parameters are stored as matrices ([NumPy](https://numpy.org/) arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).
<br>
<br>
Three such matrices are held in RAM. So if your input contains 100,000 unique words, and you asked for layer ``size=200`` , the model will require approx.
``100,000*200*4*3 bytes = ~229MB``.
<br>
<br>
There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely long, memory footprint will be dominated by the three matrices above.

Visualising the Word Embeddings
-------------------------------

The word embeddings made by the model can be visualised by reducing
dimensionality of the words to 2 dimensions using tSNE.

Visualisations can be used to notice semantic and syntactic trends in the data.

Example:

* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by
* Syntactic: words like run, running or cut, cutting lie close together.

Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.

NB: The model used for the visualisation is trained on a small corpus. Thus some of the relations might not be so clear.




In [None]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)

The End
----------