This notebook lays out one possible method for generating word embeddings - that is, a way of representing of a dictionary of words (drawn from a training set) as numerical data. 

# Word2Vec and t-SNE


[_Word embeddings_](http://arxiv.org/pdf/1301.3781.pdf) provide a meaningful representation of text. Word embeddings, called such because they involve embedding a word in some high-dimensional space, that is, they map a word to some vector. Word embeddings are learned for a particular task, so they end up being meaningful representations.

For example, the relationships between words can be meaningful (image from the [TensorFlow documentation]((https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html)):

![Word embedding relationships](https://www.tensorflow.org/versions/r0.9/images/linear-relationships.png)

A notable property that emerges is that vector arithmetic can also be meaningful. Perhaps the most well-known example of this is:

$$
\text{king} - \text{man} + \text{woman} = \text{queen}
$$

So the positioning of these words in this space actually tells us something about *how these words are used*.

This allows us to do things like find the most similar words by looking at the closest words. You can project the resulting embeddings down to 2D so that use patterns can be visualized. 

- try using t-SNE ("t-Distributed Stochastic Neighbor Embedding") for this, which is a dimensionality reduction method that visualizes high-dimension data. 

As mentioned earlier, these word embeddings are trained to help with a particular task, which is learned through a neural network. Two tasks developed for training embeddings are _CBOW_ ("Continuous Bag Of Words") and _skip-grams_; together these methods of learning word embeddings are called "Word2Vec".

For the CBOW task, we take the context words (the words around the target word) and give the target word. We want to predict whether or not the target word belongs to the context.

A skip-gram is basically the inverse: we take the target word (the "pivot"), then give the context. We want to predict whether or not the context belongs to the word.

They are quite similar but have different properties, e.g. CBOW works better on smaller datasets, where as skip-grams works better for larger ones. In any case, the idea with word embeddings is that they can be trained to help with any task.

We're going to be using the skip-gram task here.

## Corpus

We need a reasonably-sized text corpus to learn from. Here we'll use State of the Union addresses retrieved from [The American Presidency Project](http://www.presidency.ucsb.edu/sou.php). These addresses tend to use similar patterns so we should be able to learn some decent word embeddings. Since the skip-gram task looks at context, texts that use words in a consistent way (i.e. in consistent contexts) we'll be able to learn better.

[The corpus is available here](/guides/data/sotu.tar.gz) as a compressed archive of .txt files. The texts were preprocessed a bit (mainly removing URL-encoded characters). (nb: this isn't the complete collection of texts but enough to work with here).

## Skip-grams

Before we go any further, let's get a bit more concrete about what the skip-gram task is.

Let's consider the sentence "I think cats are cool".

The skip-gram task is as follows:

- We take a word, e.g. `'cats'`, which we'll represent as $w_i$. We feed this as input into our neural network.
- We take the word's context, e.g. `['I', 'think', 'are', 'cool']`. We'll represent this as $\{w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}\}$ and we also feed this into our neural network.
- Then we just want our network to predict (i.e. classify) whether or not $\{w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}\}$ is the true context of $w_i$.

For this particular example we'd want the network to output 1 (i.e. yes, that is the true context).

If we set $w_i$ to 'frogs', then we'd want the network output 0. In our one sentence corpus, `['I', 'think', 'are', 'cool']` is not the true context for 'frogs'. Sorry frogs 🐸.

## Building the model

We'll use `keras` to build the neural network that we'll use to learn the embeddings.

First we'll import everything:

In [1]:
import sklearn
import matplotlib.pyplot as plt
import scipy

import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Flatten, Activation, Merge
from keras.preprocessing.text import Tokenizer, base_filter
from keras.preprocessing.sequence import skipgrams, make_sampling_table

Using Theano backend.


depending on which environment you're running this from, you may find yourself needing to upgrade one of these libraries, which you can do by opening an ipython terminal, launching python, and typing `pip install [name of library] --upgrade`

Then load in our data. We're actually going to define a generator to load our data in on-demand; this way we'll avoid having all our data sitting around in memory when we don't need it.

In [5]:
from glob import glob
text_files = glob('sotu/*.txt')

def text_generator():
    for path in text_files:
        with open(path, 'r') as f:
            yield f.read()
            
len(text_files)

84

Before we go any further, we need to map the words in our corpus to numbers, so that we have a consistent way of referring to them. First we'll fit a tokenizer to the corpus:

In [6]:
# our corpus is small enough where we
# don't need to worry about this, but good practice
max_vocab_size = 50000

# `filters` specify what characters to get rid of
# `base_filter()` includes basic punctuation;
# I like to extend it with common unicode punctuation
tokenizer = Tokenizer(nb_words=max_vocab_size,
                      filters=base_filter()+'“”–')

# fit the tokenizer
tokenizer.fit_on_texts(text_generator())

# we also want to keep track of the actual vocab size
# we'll need this later
# note: we add one because `0` is a reserved index in keras' tokenizer
vocab_size = len(tokenizer.word_index) + 1

Now the tokenizer knows what tokens (words) are in our corpus and has mapped them to numbers. The `keras` tokenizer also indexes them in order of frequency (most common first, i.e. index 1 is usually a word like "the"), which will come in handy later.

At this point, let's define the dimensions of our embeddings. It's up to you and your task to choose this number. Like many neural network hyperparameters, you may just need to play around with it.

In [7]:
embedding_dim = 25

Now let's define the model. When I described the skip-gram task, I mentioned two inputs: the target word (also called the "pivot") and the context. So we're going to build two separate models for each input and then merge them into one.

In [8]:
pivot_model = Sequential()
pivot_model.add(Embedding(vocab_size, embedding_dim, input_length=1))

context_model = Sequential()
context_model.add(Embedding(vocab_size, embedding_dim, input_length=1))

# merge the pivot and context models
model = Sequential()
model.add(Merge([pivot_model, context_model], mode='dot', dot_axes=2))
model.add(Flatten())

# the task as we've framed it here is
# just binary classification,
# so we want the output to be in [0,1],
# and we can use binary crossentropy as our loss
model.add(Activation('sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy')

Finally, we can train the model.

In [None]:
n_epochs = 1

# used to sample words (indices)
sampling_table = make_sampling_table(vocab_size)

for i in range(n_epochs):
    loss = 0
    for seq in tokenizer.texts_to_sequences_generator(text_generator()):
        # generate skip-gram training examples
        # - `couples` consists of the pivots (i.e. target words) and surrounding contexts
        # - `labels` represent if the context is true or not
        # - `window_size` determines how far to look between words
        # - `negative_samples` specifies the ratio of negative couples
        #    (i.e. couples where the context is false)
        #    to generate with respect to the positive couples;
        #    i.e. `negative_samples=4` means "generate 4 times as many negative samples"
        couples, labels = skipgrams(seq, vocab_size, window_size=5, negative_samples=4, sampling_table=sampling_table)
        if couples:
            pivot, context = zip(*couples)
            pivot = np.array(pivot, dtype='int32')
            context = np.array(context, dtype='int32')
            labels = np.array(labels, dtype='int32')
            loss += model.train_on_batch([pivot, context], labels)
    print('epoch %d, %0.02f'%(i, loss))

Wait a few minutes for training...

Now we can extract the embeddings, which are just the weights of the pivot embedding layer:

In [8]:
embeddings = model.get_weights()[0]

We also want to set aside the tokenizer's `word_index` for later use (so we can get indices for words) and also create a reverse word index (so we can get words from indices):

In [9]:
word_index = tokenizer.word_index
reverse_word_index = {v: k for k, v in word_index.items()}

That's it for learning the embeddings. Now we can try using them.

## Getting similar words

Each word embedding is just a mapping of a word to some point in space. So if we want to find words similar to some target word, we literally just need to look at the closest embeddings to that target word's embedding.

An example will make this clearer.

First, let's write a simple function to retrieve an embedding for a word:

In [10]:
def get_embedding(word):
    idx = word_index[word]
    # make it 2d
    return embeddings[idx][:,np.newaxis].T

Then we can define a function to get a most similar word for an input word:

In [11]:
from scipy.spatial.distance import cdist

ignore_n_most_common = 50

def get_closest(word):
    embedding = get_embedding(word)

    # get the distance from the embedding
    # to every other embedding
    distances = cdist(embedding, embeddings)[0]

    # pair each embedding index and its distance
    distances = list(enumerate(distances))

    # sort from closest to furthest
    distances = sorted(distances, key=lambda d: d[1])

    # skip the first one; it's the target word
    for idx, dist in distances[1:]:
        # ignore the n most common words;
        # they can get in the way.
        # because the tokenizer organized indices
        # from most common to least, we can just do this
        if idx > ignore_n_most_common:
            return reverse_word_index[idx]

Now let's give it a try (you may get different results):

In [12]:
print(get_closest('freedom'))
print(get_closest('justice'))
print(get_closest('america'))
print(get_closest('citizens'))
print(get_closest('citizen'))

advancement
appeasement
very
undo
intentioned


Do words have relations?
