# Gluon-NLP

1. Pre-trained word embeddings
2. Pre-trained language models
3. Fine-tuning BERT 

http://gluon-nlp.mxnet.io/

## Pre-trained word embeddings

Here we introduce how to use pre-trained word embeddings, where each word is represened by a vector. Two popular word embeddings are GloVe and fastText. The used GloVe and fastText pre-trained word embeddings here are from the following sources:

* GloVe project website：https://nlp.stanford.edu/projects/glove/
* fastText project website：https://fasttext.cc/

Let us first import the following packages used in this example.

![](https://cdn-images-1.medium.com/max/1600/1*2r1yj0zPAuaSGZeQfG6Wtw.png)

In [None]:
import mxnet as mx
from mxnet import gluon, nd
import gluonnlp as nlp
import re

We pick a specific pre-trained embedding

In [None]:
embedding = nlp.embedding.create('glove', source='glove.6B.50d')

In [None]:
vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))
vocab.set_embedding(embedding)

Below shows the size of `vocab` including a special unknown token.

In [None]:
len(vocab.idx_to_token)

We can access attributes of `vocab`.

In [None]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

![](support/cosinesimilarity.png)

In [None]:
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

### Word Similarity

Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors.

In [None]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))

def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_vec)
    indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words).

In [None]:
get_knn(vocab, 5, 'baby')

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [None]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])

Let us find the 5 most similar words of 'run' from the vocabulary.

In [None]:
get_knn(vocab, 5, 'research')

Let us find the 5 most similar words of 'beautiful' from the vocabulary.

In [None]:
get_knn(vocab, 5, 'computer')

**Challenge**

Try out the `get_knn` function with a word of your own

### Word Analogy

We can also apply pre-trained word embeddings to the word analogy problem. For instance, "man : woman :: son : daughter" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [None]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])
    
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()
    
    indices = dot_prod.topk(k=k+1, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    words = [w for w in vocab.to_tokens(indices) if w != word3]
    return words[:k]

### Semantic Analogy

![analogy](https://user-images.githubusercontent.com/3716307/53924875-a1497880-4032-11e9-847c-2d826d0ee0ee.png)


In [None]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')

In [None]:
def cos_sim_word_analogy(vocab, word1, word2, word3, word4):
    words = [word1, word2, word3, word4]
    vecs = vocab.embedding[words]
    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])

cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')

In [None]:
get_top_k_by_analogy(vocab, 1, 'celtics', 'nba', 'patriots')

In [None]:
get_top_k_by_analogy(vocab, 1, 'france', 'football', 'india')

In [None]:
get_top_k_by_analogy(vocab, 1, 'wine', 'red', 'sky')

In [None]:
get_top_k_by_analogy(vocab, 1, 'russia', 'moscow', 'france')

### Syntactic Analogy

In [None]:
get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')

In [None]:
get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')

**Challenge**

write one semantic and one syntactic analogy using `get_top_k_by_analogy`