# Using Pre-trained Word Embeddings

This notebook will demonstrate how to use pre-trained word embeddings.

To see why word embeddings are useful, it's worth comparing them to the alternative.
Without word embeddings, we might represent each word with a one-hot vector `[0, ...,0, 1, 0, ... 0]`, that takes value `1` at the index corresponding to the appropriate vocabulary word, 
and value `0` everywhere else. 
The weight matrices connecting our word-level inputs to the network's hidden layers would each be $v \times h$,
where $v$ is the size of the vocabulary and $h$ is the size of the hidden layer. 
With 100,000 words feeding into an LSTM layer with $1000$ nodes, the model would need to learn
$4$ different weight matrices (one for each of the LSTM gates), each with 100M weights, and thus 400 million parameters in total.

Fortunately, it turns out that a number of efficient techniques 
can quickly discover broadly useful word embeddings in an *unsupervised* manner.
These embeddings map each word onto a low-dimensional vector $w \in R^d$ with $d$ commonly chosen to be roughly $100$.
Intuitively, these embeddings are chosen based on the contexts in which words appear. 
Words that appear in similar contexts (like "tennis" and "racquet") should have similar embeddings
while words that do not like (like "rat" and "gourmet") should have dissimilar embeddings.

Practitioners of deep learning for NLP typically inititalize their models 
using *pretrained* word embeddings, bringing in outside information, and reducing the number of parameters that a neural network needs to learn from scratch.


Two popular word embeddings are Word2Vec and fastText. 
The following examples uses pre-trained word embeddings drawn from the following sources:

* Word2Vec https://arxiv.org/abs/1301.3781
* fastText project website：https://fasttext.cc/

To begin, let's first import a few packages that we'll need for this example:

In [None]:
import mxnet as mx
from mxnet import gluon, nd
from mxnet.gluon import rnn, nn
import gluonnlp as nlp
import re
import d2l

import warnings
warnings.filterwarnings('ignore')

## Pre-trained Word Embeddings

GluonNLP provides a number of pre-trained Word Embeddings.

In [None]:
nlp.embedding.list_sources('fasttext')[:5]

For simplicity of demonstration, we use a smaller word embedding file, such as
the 50-dimensional one.

In [None]:
emb = nlp.embedding.create('fasttext', source='wiki.en')

### Word Similarity

Given an input word, we can find the nearest word from
the vocabulary by similarity. The
similarity between any pair of words can be represented by the cosine similarity
of their vectors.

In [None]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1) + 1E-10).reshape((-1,1))

In [None]:
def get_knn(emb, k, word):
    word_vec = emb[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(emb.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(emb.idx_to_token), )), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return [emb.idx_to_token[i] for i in indices[1:]] # Remove input tokens.

Let us find the 5 most similar words of 'baby' from the vocabulary (size:
400,000 words).

In [None]:
get_knn(emb, 5, 'baby')

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [None]:
from mxnet import nd
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

cos_sim(emb['baby'], emb['babies'])

### Word Analogy

We can also apply pre-trained word embeddings to the word
analogy problem. 

For instance, "man : woman :: son : daughter" is an analogy.

![](../img/analogy.svg)

The word analogy completion problem is defined as: for analogy 'a : b :: c : d',
given the first three words 'a', 'b', 'c', find 'd'. The idea is to find the
most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [None]:
def get_top_k_by_analogy(emb, k, word1, word2, word3):
    word_vecs = emb[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(emb.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(emb.idx_to_token), )), k=k, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return [emb.idx_to_token[i] for i in indices]

Complete word analogy 'man : woman :: son :'.

In [None]:
get_top_k_by_analogy(emb, 1, 'man', 'woman', 'son')

# Text Classification and Data Sets

Text classification is a common task in natural language processing, which transforms a sequence of text of indefinite length into a category of text.

## Text Sentiment Classification Data

Use Stanford's Large Movie Review Dataset as the data set for text sentiment classification.
- Contains parts for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb
- In each data set, the number of comments labeled as "positive" and "negative" is equal.

## Dataset in `gluon`

Datasets in Gluon have the following basic structure:

``` python
class Dataset(object):
    def __getitem__(self, idx):
        ...
    
    def __len__(self):
        ...

    def transform(self, fn, lazy=True):
        # Returns a new dataset with each sample
        # transformed by the function `fn`.
```

We can make list-like Python object (ie. that implements `__getitem__` meaning it  subscripted like `x[0]` etc.),
into a `gluon` `Dataset` by wrapping it, using `gluon.data.SimpleDataset` as follows:

In [None]:
def tokenize_while_preserving_label(sample):
    sentence, label = sample
    return sentence.split(), label

train_dataset, test_dataset = [nlp.data.IMDB(segment=segment).transform(
    tokenize_while_preserving_label, lazy=False)
    for segment in ('train', 'test')]

In [None]:
print(train_dataset[0])

In [None]:
def get_text(text, label):
    return text
train_tokens = train_dataset.transform(get_text)

import itertools
vocab = nlp.Vocab(nlp.data.count_tokens(itertools.chain.from_iterable(train_tokens)), min_freq=10)
print(vocab)
print(vocab.idx_to_token[:10] + ["..."])

In [None]:
# `length_clip` takes as input a list and outputs a list with maximum length 500.
length_clip = nlp.data.ClipSequence(500)

def preprocess(data, label):
    label = int(label > 5)
    data = vocab[length_clip(data)]
    return [data, label]


train_dataset = train_dataset.transform(preprocess)
test_dataset = test_dataset.transform(preprocess)
print(train_dataset[0])

#### `gluonnlp.data.batchify.Pad`

`gluonnlp.data.batchify.Pad` can be instantiated with the padding value.
The resulting function takes a list of variable length lists (or NDArrays or numpy arrays)
as input and returns a single NDArray where all shorter inputs are padded to the length
of the maximum length input element.

In [None]:
print(len(train_dataset[0][0]), len(train_dataset[1][0]))

In [None]:
# Pad data, stack label and lengths
pad_val = vocab[vocab.padding_token]
batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(axis=0, ret_length=True, pad_val=pad_val),
    nlp.data.batchify.Stack(dtype='float32'))
print('pad_val:', pad_val)

In [None]:
batchify_fn([train_dataset[0], train_dataset[1], train_dataset[2]])

#### `gluon.data.DataLoader`

Manually sampling sentences from the dataset and applying the pad function may be bothersome.
`gluon.data.DataLoader` automates this process.

In [None]:
print(gluon.data.DataLoader.__doc__)

In [None]:
batch_size = 128
data_loader = gluon.data.DataLoader(train_dataset, batchify_fn=batchify_fn,
                                    batch_size=batch_size)
print('Average length of batches is', sum(batch[0][0].shape[1] for batch in data_loader) / len(data_loader))

<img src="../img/fixed_bucket_strategy_ratio0.7.png" style="width: 100%;"/>

##### Custom `Sampler` for `DataLoader`

Sampling random sentences from the Dataset and padding them is sub-optimal as the number of padding elments is determined by the longest sentence.
`DataLoader` supports the specification of a `Sampler` to specify the sentences to select for a batch.

For example, `gluonnlp.data.FixedBucketSampler` assigns each data sample (sentence)
to a fixed bucket based on its length. Resulting batches will only contain sentences
from a single bucket.

This can significantly reduce the average number of elements per batch and consequently the amount of computation.

In [None]:
print(nlp.data.FixedBucketSampler.__doc__)

In [None]:
train_sampler = nlp.data.FixedBucketSampler(
    lengths=train_dataset.transform(lambda x: len(x[0]), lazy=False),
    batch_size=batch_size, shuffle=True)
test_sampler = nlp.data.FixedBucketSampler(
    lengths=test_dataset.transform(lambda x: len(x[0]), lazy=False),
    batch_size=batch_size, shuffle=True)

print(train_sampler.stats())

In [None]:
train_dataloader = gluon.data.DataLoader(train_dataset, batchify_fn=batchify_fn, batch_sampler=train_sampler)
test_dataloader = gluon.data.DataLoader(test_dataset, batchify_fn=batchify_fn, batch_sampler=test_sampler)
print('Average length of batches is',
      sum(batch[0][0].shape[1] for batch in train_dataloader) / len(train_dataloader))

In [None]:
next(iter(train_dataloader))

# Classification Models: Using a Bag of Context Free Embeddings

### Train and Evaluate the Model



In [None]:
def evaluate(test_data, ctx, net):
    accuracy = 0
    ctx = d2l.try_gpu()
    for i, ((inputs, _), labels) in enumerate(test_data):
        inputs = inputs.as_in_context(ctx)
        labels = labels.as_in_context(ctx)
        outs = net(inputs)
        accuracy += (outs.argmax(axis=1).squeeze() == labels).mean()
    print("Test Acc {}".format(accuracy.asscalar()/(i+1)))

In [None]:
def train(net, train_iter, test_iter, loss, trainer, num_epochs, ctx):
    num_batches = len(train_iter)
    params = [p for p in net.collect_params().values() if p.grad_req != 'null']
    for epoch in range(num_epochs):
        accuracy = mx.metric.Accuracy()
        running_loss = 0
        for i, ((features, _), labels) in enumerate(train_iter):           
            features = gluon.utils.split_and_load(features, ctx, even_split=False)
            labels = gluon.utils.split_and_load(labels, ctx, even_split=False)
            losses, preds = [], []
            with mx.autograd.record():
                for feature, label in zip(features, labels):
                    y = net(feature)
                    l = loss(y, label)
                    losses.append(l)
                    preds.append(y)
            mx.autograd.backward(losses)
            for l in losses:
                running_loss += l.mean().asscalar() / len(losses)
            # Gradient clipping
            trainer.allreduce_grads()
            nlp.utils.clip_grad_global_norm(params, 1)
            trainer.update(1)
            accuracy.update(labels, preds)
            if i % 25 == 0:
                print("Batch", i, "Acc", accuracy.get()[1],"Train Loss", running_loss/(i+1))
        print("Epoch {}, Acc {}, Train Loss {}".format(epoch, accuracy.get(), running_loss/(i+1)))
        evaluate(test_iter, ctx, net)

We define a prediction function:

In [None]:
def predict_sentiment(net, vocab, sentence):
    sentence = nd.array(vocab[sentence.split()], ctx=d2l.try_gpu())
    label = nd.argmax(net(sentence.reshape((1, -1))), axis=1)
    return 'positive' if label.asscalar() == 1 else 'negative'

# Using Recurrent Neural Networks

In this section, we will apply
pre-trained word vectors and bidirectional recurrent neural networks with
multiple hidden layers:

Maas, Andrew L., et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics, 2011.

## Bidirectional RNNs

![BiRNN](../img/birnn.svg)


## Use a Recurrent Neural Network Model

In this model, each word first obtains a feature vector from the embedding
layer. Then, we further encode the feature sequence using a bidirectional
recurrent neural network to obtain sequence information. Finally, we transform
the encoded sequence information to output through the fully connected
layer. Specifically, we can concatenate hidden states of bidirectional
long-short term memory in the initial time step and final time step and pass it
to the output layer classification as encoded feature sequence information. In
the `BiRNN` class implemented below, the `Embedding` instance is the embedding
layer, the `LSTM` instance is the hidden layer for sequence encoding, and the
`Dense` instance is the output layer for generated classification results.

In [None]:
class BiRNN(nn.Block):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, **kwargs):
        super(BiRNN, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.encoder = rnn.LSTM(num_hiddens, num_layers=num_layers,
                                bidirectional=True, input_size=embed_size)
        self.decoder = nn.Dense(2)

    def forward(self, inputs):
        embeddings = self.embedding(mx.nd.transpose(inputs))
        outputs = self.encoder(embeddings)
        encoding = mx.nd.concat(outputs[0], outputs[-1])
        outs = self.decoder(encoding)
        return outs

Create a bidirectional recurrent neural network with two hidden layers.

In [None]:
ctx, embed_size = d2l.try_all_gpus(), 300
num_hiddens, num_layers = 100, 2
net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)
net.initialize(mx.init.Xavier(), ctx=ctx)

In [None]:
emb = nlp.embedding.create('fasttext', source='wiki.simple', load_ngrams=True)
idx_to_vec = emb[vocab.idx_to_token]
idx_to_vec.shape

### Load Pre-trained Word Vectors

Because the training data set for sentiment classification is not very large, in order to deal with overfitting, we will directly use word vectors pre-trained on a larger corpus as the feature vectors of all words. Here, we load a 100-dimensional GloVe word vector for each word in the dictionary `vocab`.

In [None]:
net.embedding.weight.set_data(idx_to_vec)
net.embedding.collect_params().setattr('grad_req', 'null')

### Train and Evaluate the Model

Now, we can start training.

In [None]:
lr, num_epochs = 0.001, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr}, update_on_kvstore=False)
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train(net, train_dataloader, test_dataloader, loss, trainer, num_epochs, ctx)

In [None]:
predict_sentiment(net, vocab, 'this movie is so great')

In [None]:
predict_sentiment(net, vocab, 'this movie is so bad')

# Using Convolutional Neural Networks (textCNN)

Idea: treat text as a one-dimensional "image".

Then we can use one-dimensional convolutional neural networks to capture associations between adjacent words.

This section describes a groundbreaking approach to applying
convolutional neural networks to text analysis: textCNN :cite:`Kim.2014`.

## One-dimensional Convolutional Layer

![One-dimensional cross-correlation operation. The shaded parts are the first output element as well as the input and kernel array elements used in its calculation: $0\times1+1\times2=2$. ](../img/conv1d.svg)

Before introducing the model, let us explain how a one-dimensional convolutional layer works. Like a two-dimensional convolutional layer, a one-dimensional convolutional layer uses a one-dimensional cross-correlation operation. In the one-dimensional cross-correlation operation, the convolution window starts from the leftmost side of the input array and slides on the input array from left to right successively. When the convolution window slides to a certain position, the input subarray in the window and kernel array are multiplied and summed by element to get the element at the corresponding location in the output array. As shown in Figure 12.4, the input is a one-dimensional array with a width of 7 and the width of the kernel array is 2. As we can see, the output width is $7-2+1=6$ and the first element is obtained by performing multiplication by element on the leftmost input subarray with a width of 2 and kernel array and then summing the results.


Next, we implement one-dimensional cross-correlation in the `corr1d` function. It accepts the input array `X` and kernel array `K` and outputs the array `Y`.

In [None]:
def corr1d(X, K):
    w = K.shape[0]
    Y = nd.zeros((X.shape[0] - w + 1))
    for i in range(Y.shape[0]):
        Y[i] = (X[i: i + w] * K).sum()
    return Y

Let's reproduce the results of the one-dimensional cross-correlation operation seen in above Figure.

In [None]:
X, K = nd.array([0, 1, 2, 3, 4, 5, 6]), nd.array([1, 2])
corr1d(X, K)

## One-dimensional Convolutional Layer with multiple input channels


![One-dimensional cross-correlation operation with three input channels. The shaded parts are the first output element as well as the input and kernel array elements used in its calculation: $0\times1+1\times2+1\times3+2\times4+2\times(-1)+3\times(-3)=2$. ](../img/conv1d-channel.svg)

The one-dimensional cross-correlation operation for multiple input channels is also similar to the two-dimensional cross-correlation operation for multiple input channels. On each channel, it performs the one-dimensional cross-correlation operation on the kernel and its corresponding input and adds the results of the channels to get the output. Figure 12.5 shows a one-dimensional cross-correlation operation with three input channels.

Now, we reproduce the results of the one-dimensional cross-correlation operation with multi-input channel.

In [None]:
def corr1d_multi_in(X, K):
    return nd.add_n(*[corr1d(x, k) for x, k in zip(X, K)])

In [None]:
X = nd.array([[0, 1, 2, 3, 4, 5, 6],
              [1, 2, 3, 4, 5, 6, 7],
              [2, 3, 4, 5, 6, 7, 8]])
K = nd.array([[1, 2], [3, 4], [-1, -3]])
corr1d_multi_in(X, K)

This is equivalent to two-dimensional cross-correlation with a single input channel

![Two-dimensional cross-correlation operation with a single input channel. The highlighted parts are the first output element and the input and kernel array elements used in its calculation: $2\times(-1)+3\times(-3)+1\times3+2\times4+0\times1+1\times2=2$. ](../img/conv1d-2d.svg)

We can obtain multiple output channels by applying the cross-correlation multiple times with different kernels.

## Max-Over-Time Pooling Layer

Similarly, we have a one-dimensional pooling layer. The max-over-time pooling layer used in TextCNN actually corresponds to a one-dimensional global maximum pooling layer. Assuming that the input contains multiple channels, and each channel consists of values on different time steps, the output of each channel will be the largest value of all time steps in the channel. Therefore, the input of the max-over-time pooling layer can have different time steps on each channel.

To improve computing performance, we often combine timing examples of different lengths into a mini-batch and make the lengths of each timing example in the batch consistent by appending special characters (such as 0) to the end of shorter examples. Naturally, the added special characters have no intrinsic meaning. Because the main purpose of the max-over-time pooling layer is to capture the most important features of timing, it usually allows the model to be unaffected by the manually added characters.

## The TextCNN Model

![TextCNN design. ](../img/textcnn.svg)

Kim, Yoon. "Convolutional neural networks for sentence classification." EMNLP 2014.

TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer. Suppose the input text sequence consists of $n$ words, and each word is represented by a $d$-dimension word vector. Then the input example has a width of $n$, a height of 1, and $d$ input channels. The calculation of textCNN can be mainly divided into the following steps:

1. Define multiple one-dimensional convolution kernels and use them to perform convolution calculations on the inputs. Convolution kernels with different widths may capture the correlation of different numbers of adjacent words.
2. Perform max-over-time pooling on all output channels, and then concatenate the pooling output values of these channels in a vector.
3. The concatenated vector is transformed into the output for each category through the fully connected layer. A dropout layer can be used in this step to deal with overfitting.

Figure 12.7 gives an example to illustrate the textCNN. The input here is a sentence with 11 words, with each word represented by a 6-dimensional word vector. Therefore, the input sequence has a width of 11 and 6 input channels. We assume there are two one-dimensional convolution kernels with widths of 2 and 4, and 4 and 5 output channels, respectively. Therefore, after one-dimensional convolution calculation, the width of the four output channels is $11-2+1=10$, while the width of the other five channels is $11-4+1=8$. Even though the width of each channel is different, we can still perform max-over-time pooling for each channel and concatenate the pooling outputs of the 9 channels into a 9-dimensional vector. Finally, we use a fully connected layer to transform the 9-dimensional vector into a 2-dimensional output: positive sentiment and negative sentiment predictions.

Next, we will implement a textCNN model. Compared with the previous section, in addition to replacing the recurrent neural network with a one-dimensional convolutional layer, here we use two embedding layers, one with a fixed weight and another that participates in training.

In [None]:
class TextCNN(nn.Block):
    def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
                 **kwargs):
        super(TextCNN, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.constant_embedding = nn.Embedding(vocab_size, embed_size)
        self.dropout = nn.Dropout(0.5)
        self.decoder = nn.Dense(2)
        self.pool = nn.GlobalMaxPool1D()
        self.convs = nn.Sequential()
        for c, k in zip(num_channels, kernel_sizes):
            self.convs.add(nn.Conv1D(c, k, activation='relu'))

    def forward(self, inputs):
        embeddings = nd.concat(
            self.embedding(inputs), self.constant_embedding(inputs), dim=2)
        embeddings = embeddings.transpose((0, 2, 1))
        encoding = nd.concat(*[nd.flatten(
            self.pool(conv(embeddings))) for conv in self.convs], dim=1)
        outputs = self.decoder(self.dropout(encoding))
        return outputs

Create a TextCNN instance. It has 3 convolutional layers with kernel widths of 3, 4, and 5, all with 100 output channels.

In [None]:
kernel_sizes, nums_channels = [3, 4, 5], [100, 100, 100]
net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels)
net.initialize(mx.init.Xavier(), ctx=ctx)

### Load Pre-trained Word Vectors

As in the previous section, load pre-trained 100-dimensional GloVe word vectors and initialize the embedding layers `embedding` and `constant_embedding`. Here, the former participates in training while the latter has a fixed weight.

In [None]:
net.embedding.weight.set_data(idx_to_vec)
net.constant_embedding.weight.set_data(idx_to_vec)
net.constant_embedding.collect_params().setattr('grad_req', 'null')

### Train and Evaluate the Model

Now we can train the model.

In [None]:
lr, num_epochs = 0.001, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr},
                        update_on_kvstore=False)
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train(net, train_dataloader, test_dataloader, loss, trainer, num_epochs, ctx)

Below, we use the trained model to the classify sentiments of two simple sentences.

In [None]:
predict_sentiment(net, vocab, 'this movie is so great')

In [None]:
predict_sentiment(net, vocab, 'this movie is so bad')