Installing (updating) the following libraries for your Sagemaker
instance.

In [None]:
!pip install -U mxnet-cu101mkl==1.6.0  # updating mxnet to at least v1.6
!pip install ..  # installing d2l


# Finding Synonyms and Analogies

:label:`sec_synonyms`


In :numref:`sec_word2vec_gluon` we trained a word2vec word embedding model
on a small-scale dataset and searched for synonyms using the cosine similarity
of word vectors. In practice, word vectors pre-trained on a large-scale corpus
can often be applied to downstream natural language processing tasks. This
section will demonstrate how to use these pre-trained word vectors to find
synonyms and analogies. We will continue to apply pre-trained word vectors in
subsequent sections.

## Using Pre-Trained Word Vectors

MXNet's `contrib.text` package provides functions and classes related to natural
language processing (see the [GluonNLP](https://gluon-nlp.mxnet.io/) tool package for more details). Next,
let us check out names of the provided pre-trained word embeddings.

In [1]:
import d2l
from mxnet import np, npx
import os

npx.set_np()

Given the name of the word embedding, we can see which pre-trained models are provided by the word embedding. The word vector dimensions of each model may be different or obtained by pre-training on different datasets.

In [2]:
# Saved in the d2l package for later use
d2l.DATA_HUB['GloVe.6B.50d'] = ('http://www.seal.ac.cn/glove.6B.50d.zip',
                       '0b8703943ccdb6eb788e6f091b8946e82231bc4d')

# Saved in the d2l package for later use
d2l.DATA_HUB['GloVe.6B.100d'] = ('http://www.seal.ac.cn/glove.6B.100d.zip',
                       'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')

# Saved in the d2l package for later use
d2l.DATA_HUB['GloVe.42B.300d'] = ('http://www.seal.ac.cn/glove.42B.300d.zip',
                       '99af83e02ad44850374880549768d89b66c1e0d1')

# Saved in the d2l package for later use
d2l.DATA_HUB['fastText.crawl'] = ('http://www.seal.ac.cn/crawl-300d-2M.zip',
                       '9898cc74f433d4da01cd04942aef57afc7710b7c')

In [3]:
# Saved in the d2l package for later use
class Embedding:
    def __init__(self, embedding_name):
        self.idx_to_token, self.idx_to_vec = self._load_embedding(embedding_name)
        self.unknown_idx = 0
        self.token_to_idx = {token : idx for idx, token in 
                             enumerate(self.idx_to_token)}
    def _load_embedding(self, embedding_name):
        idx_to_token, idx_to_vec = [], []
        data_dir = d2l.download_extract(embedding_name)
        with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
            for line in f:
                elems = line.rstrip().split(' ')
                token, elems = elems[0], [float(i) for i in elems[1:]]
                idx_to_token.append(token)
                idx_to_vec.append(elems)
        idx_to_token = ['<unk>'] + idx_to_token
        idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
        return idx_to_token, np.array(idx_to_vec)
    def __getitem__(self, tokens):
        indices = [self.token_to_idx.get(token, self.unknown_idx)
                   for token in tokens]
        vecs = self.idx_to_vec[np.array(indices)]
        return vecs
    def __len__(self):
        return len(self.idx_to_token)

The general naming conventions for pre-trained GloVe models are "model.(dataset.)number of words in dataset.word vector dimension.txt". For more information, please refer to the GloVe and fastText project sites [2, 3]. Below, we use a 50-dimensional GloVe word vector based on Wikipedia subset pre-training. The corresponding word vector is automatically downloaded the first time we create a pre-trained word vector instance.

In [4]:
glove_6b50d = Embedding('GloVe.6B.50d')

Downloading ../data/glove.6B.50d.zip from http://www.seal.ac.cn/glove.6B.50d.zip...


Print the dictionary size. The dictionary contains $400,000$ words and a special unknown token.

In [5]:
len(glove_6b50d)

400001

We can use a word to get its index in the dictionary, or we can get the word from its index.

In [6]:
glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]

(3367, 'beautiful')

## Applying Pre-Trained Word Vectors

Below, we demonstrate the application of pre-trained word vectors, using GloVe as an example.

### Finding Synonyms

Here, we re-implement the algorithm used to search for synonyms by cosine
similarity introduced in :numref:`sec_word2vec`

In order to reuse the logic for seeking the $k$ nearest neighbors when
seeking analogies, we encapsulate this part of the logic separately in the `knn`
($k$-nearest neighbors) function.

In [7]:
def knn(W, x, k):
    # The added 1e-9 is for numerical stability
    cos = np.dot(W, x.reshape(-1,)) / (
        np.sqrt(np.sum(W * W, axis=1) + 1e-9) * np.sqrt((x * x).sum()))
    topk = npx.topk(cos, k=k, ret_typ='indices')
    return topk, [cos[int(i)] for i in topk]

Then, we search for synonyms by pre-training the word vector instance `embed`.

In [8]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed[[query_token]], k+1)
    for i, c in zip(topk[1:], cos[1:]):  # Remove input words
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[int(i)])))

The dictionary of pre-trained word vector instance `glove_6b50d` already created contains 400,000 words and a special unknown token. Excluding input words and unknown words, we search for the three words that are the most similar in meaning to "chip".

In [9]:
get_similar_tokens('chip', 3, glove_6b50d)

cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics


Next, we search for the synonyms of "baby" and "beautiful".

In [10]:
get_similar_tokens('baby', 3, glove_6b50d)

cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl


In [11]:
get_similar_tokens('beautiful', 3, glove_6b50d)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful


### Finding Analogies

In addition to seeking synonyms, we can also use the pre-trained word vector to seek the analogies between words. For example, “man”:“woman”::“son”:“daughter” is an example of analogy, “man” is to “woman” as “son” is to “daughter”. The problem of seeking analogies can be defined as follows: for four words in the analogical relationship $a : b :: c : d$, given the first three words, $a$, $b$ and $c$, we want to find $d$. Assume the word vector for the word $w$ is $\text{vec}(w)$. To solve the analogy problem, we need to find the word vector that is most similar to the result vector of $\text{vec}(c)+\text{vec}(b)-\text{vec}(a)$.

In [12]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed[[token_a, token_b, token_c]]
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[int(topk[0])]  # Remove unknown words

Verify the "male-female" analogy.

In [13]:
get_analogy('man', 'woman', 'son', glove_6b50d)

'daughter'

“Capital-country” analogy: "beijing" is to "china" as "tokyo" is to what? The answer should be "japan".

In [14]:
get_analogy('beijing', 'china', 'tokyo', glove_6b50d)

'japan'

"Adjective-superlative adjective" analogy: "bad" is to "worst" as "big" is to what? The answer should be "biggest".

In [15]:
get_analogy('bad', 'worst', 'big', glove_6b50d)

'biggest'

"Present tense verb-past tense verb" analogy: "do" is to "did" as "go" is to what? The answer should be "went".

In [16]:
get_analogy('do', 'did', 'go', glove_6b50d)

'went'

## Summary

* Word vectors pre-trained on a large-scale corpus can often be applied to downstream natural language processing tasks.
* We can use pre-trained word vectors to seek synonyms and analogies.


## Exercises

1. Test the fastText results.
1. If the dictionary is extremely large, how can we accelerate finding synonyms and analogies?


## [Discussions](https://discuss.mxnet.io/t/2390)

![](../img/qr_similarity-analogy.svg)