# Using Pre-trained Word Embeddings

Here we introduce how to use pre-trained word embeddings, where each word is represened by a vector. Two popular word embeddings are GloVe and fastText. The used GloVe and fastText pre-trained word embeddings here are from the following sources:

* GloVe project website：https://nlp.stanford.edu/projects/glove/
* fastText project website：https://fasttext.cc/

Let us first import the following packages used in this example.

In [1]:
from mxnet import gluon
from mxnet import nd
import gluonnlp as nlp
from mxnet import nd
import re

## Creating Vocabulary with Word Embeddings

As a common use case, let us index words, attach pre-trained word embeddings for them, and use such embeddings in Gluon. We will assign a unique ID and word vector to each word in the vocabulary in just a few lines of code.

### Creating Vocabulary from Data Sets

To begin with, suppose that we have a simple text data set in the string format. 

In [2]:
text = " hello world \n hello nice world \n hi world \n"

We need a tokenizer to process this string

In [3]:
def simple_tokenize(source_str, token_delim=' ', seq_delim='\n'):
    return filter(None, re.split(token_delim + '|' + seq_delim, source_str))
counter = nlp.data.count_tokens(simple_tokenize(text))

In [4]:
counter

Counter({'hello': 2, 'world': 3, 'nice': 1, 'hi': 1})

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. This allows us to filter out infrequent words via `Vocab` arguments such as `max_size` and `min_freq`. Suppose that we want to build indices for all the keys in counter. We need a `Vocab` instance with counter as its argument.

In [5]:
vocab = nlp.Vocab(counter)

To attach word embedding to indexed words in `vocab`, let us go on to create a fastText word embedding instance by specifying the embedding name `fasttext` and the source name `wiki.simple`.

In [6]:
vocab.idx_to_token

['<unk>', '<pad>', '<bos>', '<eos>', 'world', 'hello', 'hi', 'nice']

In [7]:
fasttext_simple = nlp.embedding.create('fasttext', source='wiki.simple')

So we can attach word embedding `fasttext_simple` to indexed words in `vocab`.

In [14]:
vocab.set_embedding(fasttext_simple)

In [15]:
vocab.idx_to_token

['<unk>', '<pad>', '<bos>', '<eos>', 'world', 'hello', 'hi', 'nice']

In [16]:
vocab.embedding['beautiful']


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 300 @cpu(0)>

In [17]:
vocab.embedding['hello', 'world'][:, :5]


[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>

### Application of Pre-trained Word Embeddings

In [18]:
embedding = nlp.embedding.create('glove', source='glove.6B.50d')

In [19]:
vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))
vocab.set_embedding(embedding)

Below shows the size of `vocab` including a special unknown token.

In [20]:
len(vocab.idx_to_token)

400004

We can access attributes of `vocab`.

In [21]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

71424
beautiful


![](support/cosinesimilarity.png)

In [22]:
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

### Word Similarity

Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors.

In [23]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))

def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_vec)
    indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words).

In [24]:
get_knn(vocab, 5, 'baby')

['babies', 'boy', 'girl', 'newborn', 'pregnant']

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [25]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])


[0.838713]
<NDArray 1 @cpu(0)>

Let us find the 5 most similar words of 'run' from the vocabulary.

In [26]:
get_knn(vocab, 5, 'run')

['running', 'runs', 'went', 'start', 'ran']

Let us find the 5 most similar words of 'beautiful' from the vocabulary.

In [27]:
get_knn(vocab, 5, 'beautiful')

['lovely', 'gorgeous', 'wonderful', 'charming', 'beauty']

### Word Analogy

We can also apply pre-trained word embeddings to the word analogy problem. For instance, "man : woman :: son : daughter" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [28]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])
    
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()
    
    indices = dot_prod.topk(k=k, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    return vocab.to_tokens(indices)

## Semantic Analogy

In [29]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

['daughter']

Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')

In [30]:
def cos_sim_word_analogy(vocab, word1, word2, word3, word4):
    words = [word1, word2, word3, word4]
    vecs = vocab.embedding[words]
    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])

cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')


[0.9658343]
<NDArray 1 @cpu(0)>

In [31]:
get_top_k_by_analogy(vocab, 3, 'argentina', 'messi', 'france')

['anelka', 'ribery', 'zidane']

In [32]:
get_top_k_by_analogy(vocab, 1, 'argentina', 'football', 'india')

['cricket']

In [33]:
get_top_k_by_analogy(vocab, 1, 'france', 'crepes', 'argentina')

['quesadillas']

## Syntactic Analogy

In [34]:
get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')

['biggest']

In [35]:
get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')

['went']

# Application

- Language Modelling
- Neural Machine Translation
- Text classification