# Word Embeddings with gluon

Research into word embeddings is one of the most interesting in the deep learning
world at the moment. The concept of word embeddings originate in the domain of NLP.

In this tutorial, we will discuss about embeddings, why they are needed and commonly
used model architectures to produce distributed words representation.
We will also implement one of the most widely used word2vec algorithm, called
"Continuous bag of words"(CBOW).

In creating this tutorial, I've borrowed heavily from PyTorch :

[Word Embeddings using Pytorch](http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)

## Introduction

Word embeddings are dense vectors of real numbers, one per word in your vocabulary.
So where can we use these word embeddings? We can use these as features
for many machine learning and NLP applications like sentiment analysis, document classification.

The semantic information in the vectors can be efficiently used for these tasks.
We can also measure the semantic similarity between two words are by calculating the
distance between corresponding word vectors.

Let’s see an example.

Suppose we are building a language model. Suppose we have seen the sentences

- The mathematician ran to the store.
- The physicist ran to the store.
- The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before seen in our training data:

- The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn’t it be much better if we could use the following two facts:

- We have seen mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.
- We have seen mathematician in the same role in this new unseen sentence as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen sentence? This is what we mean by a notion of similarity: we mean semantic similarity, not simply having similar orthographic representations.

## Word Embeddings with semantic information

[TODO]

Now let's see how we can actually encode semantic similarity in words. Maybe we think up some semantic attributes. For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the “is able to run” semantic attribute. Think of some other attributes, and imagine what you might score some common words on those attributes.

## Word Embeddings using Gluon package

Before we get to examples, a few quick notes about how to use embeddings in mxnet and in deep learning programming in general. Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings. These will be keys into a lookup table. That is, embeddings are stored as a |V|×D|V|×D matrix, where DD is the dimensionality of the embeddings, such that the word assigned index ii has its embedding stored in the ii‘th row of the matrix.

Now let's import required dependencies with gluon

In [3]:
# import dependencies
from __future__ import print_function
import numpy as np
import mxnet as mx
import mxnet.ndarray as F
from mxnet import gluon, autograd
from mxnet.gluon import nn
import logging
logging.getLogger().setLevel(logging.INFO)

The module that allows you to use embeddings is `nn.Embedding`, which takes two arguments: the vocabulary size, and the dimensionality of the embeddings.

We will use [Trainer](http://mxnet.io/api/python/gluon.html#trainer) class to apply the
[SGD optimizer](http://mxnet.io/api/python/optimization.html#mxnet.optimizer.SGD) on the
initialized parameters.

In [9]:
# Mapping of word to indices
word_map = {"The": 0, "quick": 1, "brown":2, "fox": 3, "jumps": 4, "over":5, "the": 6, "lazy": 7, "dog":8 }

# define simple network with embedding layer
net = nn.Sequential()
with net.name_scope():
    net.add(nn.Embedding(5, 10))  # 8 words in vocab, 10 dimensional embeddings

ctx = [mx.cpu(0), mx.cpu(1)]

# Initialize parameters on given context
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

# let's see the word embedding for "quick" now
data = mx.nd.array([word_map["quick"]])
with autograd.record():
    z = net(data)
    z.backward()

print(z)


[[-0.48877144  0.1629284  -0.54642481 -0.51822478  0.41569489  0.08368403
  -0.09226221 -0.13501686  0.4453131   0.2817077 ]]
<NDArray 1x10 @cpu(0)>


# N-Gram Language Modeling

N-gram language model is a type of probabilistic language model for predicting next
word given a sequence of words.

  $$ P(w_i|w_{i-1},w_{i-2},…,w_{i-n+1}) $$

Where wi is the i'th word of the sequence.

In this example, we will compute the loss function on training data and update the parameters with backpropagation and we will show decreasing loss over iterations.


In [10]:
# [Todo] Add code explanation

context_size = 2
embeding_dim = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
vocab_size = len(vocab)
print(vocab_size)

word_to_ix = {word: i for i, word in enumerate(vocab)}


class Net(nn.Block):
    def __init__(self, **kwargs):
        super(Net, self).__init__(**kwargs)
        with self.name_scope():
            self.embed = nn.Embedding(vocab_size, embeding_dim)
            self.fc1 = nn.Dense(embeding_dim * context_size)
            self.fc2 = nn.Dense(vocab_size)

    def forward(self, x):
        x = self.embed(x)
        # 0 means copy over size from corresponding dimension.
        # -1 means infer size from the rest of dimensions.
        x = x.reshape((1, -1))
        out = F.relu(self.fc1(x))
        out = self.fc2(out)
        #out = F.log_softmax(out)
        return out



net = Net()
net.collect_params().initialize(mx.init.Xavier())
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.001})

loss = gluon.loss.SoftmaxCrossEntropyLoss()

losses = []
for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:        
        context_idxs = [word_to_ix[w] for w in context]
        context_var = mx.nd.array(context_idxs)
        label = mx.nd.array([word_to_ix[target]])
        with autograd.record():        
            log_probs = net(context_var)
            L = loss(log_probs, label)
            L.backward()

        trainer.step(vocab_size)
        total_loss += mx.nd.array(L).asscalar()  
    losses.append(total_loss)
print(losses)

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
97
[517.7741060256958, 517.77305126190186, 517.77198457717896, 517.770920753479, 517.76986646652222, 517.76880025863647, 517.76774597167969, 517.76668167114258, 517.76562452316284, 517.764564037323]


## Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

The CBOW model is as follows. Given a target word wi and an NN context window on each side, $w_{i-1},…,w_{i-N}$ and $w_{i+1},…,w_{i+N} $, referring to all context words collectively as C, CBOW tries to minimize

  $$ - \log P(w_i | C) = - \log Softmax(A(\sum_{w \in C} q_w) + b)$$

where q_w is the embedding for word ww.

So basically, CBOW predicts a word given its context. The target word vector is now the output vector of the word at index i; the predicted word vector is the sum over all context input vectors.

Now let's implement this model using mxnet's gluon package.

In [7]:
# [Todo] Add code explanation

context_size = 4  # 2 words to the left, 2 to the right
embeding_dim = 10
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
context_arr = []
target_arr = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]

    context_arr.append([word_to_ix[word] for word in context])
    target_arr.append([word_to_ix[target]])

    data.append((context, target))
print(data[:5])
print(vocab_size)

batch_size=5
def get_batch(data, batch_size, i):
    return mx.nd.array(data[i:i+batch_size])

class Net(nn.Block):
    def __init__(self, **kwargs):
        super(Net, self).__init__(**kwargs)
        with self.name_scope():
            self.embed = nn.Embedding(vocab_size, embeding_dim)
            self.fc1 = nn.Dense(vocab_size, in_units = embeding_dim)

    def forward(self, x):
        x = self.embed(x)
        x = F.sum(data = x, axis=0)  
        x = x.reshape((1, -1))
        out = self.fc1(x)
        return out

ctx = mx.cpu(0)

net = Net()
net.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.003})

losses = []
loss = gluon.loss.SoftmaxCrossEntropyLoss()

for epoch in range(10):
    total_loss = 0
    total = 0
    for ibatch in range(0, len(context_arr)-1, batch_size):
        context_batch = get_batch(context_arr, batch_size, ibatch)
        target_batch = get_batch(target_arr, batch_size, ibatch)
        with autograd.record():
            for x, label in zip(context_batch,target_batch):
                log_probs = net(x)                
                L = loss(log_probs, label)
                L.backward()

                total_loss += mx.nd.array(L).asscalar()

        trainer.step(batch_size)
    losses.append(total_loss)
print(losses)

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]
49
[228.41619825363159, 228.39261412620544, 228.36905169487, 228.34551334381104, 228.32200312614441, 228.29851388931274, 228.27505040168762, 228.25160884857178, 228.22819089889526, 228.20479822158813]
