## Advanced Word Embedding


In the last notebook, I trained a Word2Vec word embedding model on a small-scale data set and searched for synonyms based on the cosine similarity of the word vectors. 

Although Word2Vec has been able to successfully convert discrete words into continuous word vectors, and to some extent preserve the approximate relationship between words, the Word2Vec model is still not perfect, and it can be further improved:


1.  **Subword Embedding**：FastText, represent words more closely as a collection of subwords in a **fixed-size N-gram**, while BPE(Byte Pair Encoding) can automatically and dynamically generate a set of high-freq subwords based on the statistical information of the corpus


2.  [GloVe(Global Vectors for Word Representation)](https://nlp.stanford.edu/pubs/glove.pdf) : By transforming the conditional probability formula of the Word2Vec model, we can obtain a function expression for global loss, and further optimize the model based on this.

In practice, we often train these word embedding models on large-scale corpora, and apply the pre-trained word vectors to downstream NLP tasks. This notebook will use the **GloVe model** as an example to demonstrate how to use pre-trained word vectors to find synonyms and analogies.



- [GloVe Model](https://nlp.stanford.edu/pubs/glove.pdf)

Let's first look at the loss function for Word2Vec (using Skip-Gram and without negative samplin):


$$-\sum_{t=1}^T\sum_{-m\le j\le m,j\ne 0} \log P(w^{(t+j)}\mid w^{(t)})$$




$$P(w_j\mid w_i) = \frac{\exp(\boldsymbol{u}_j^\top\boldsymbol{v}_i)}{\sum_{k\in\mathcal{V}}\exp(\boldsymbol{u}_k^\top\boldsymbol{v}_i)}$$


where, $w_i$ is center，$w_j$ is context, the probability formula is $q_{ij}$。


Note that the loss function contains two summation symbols, which respectively enumerate each center and its corresponding context words in the corpus. In fact, we can also use another counting method, which is to directly enumerate each word as the center and the context word:

$$-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} x_{ij}\log q_{ij}$$


Where $x_{ij}$  represents the total number of times $w_j$ is the context word of $w_i$ in the entire dataset. 

We can then rewrite the formula into the form of cross-entropy as follows:


$$-\sum_{i\in\mathcal{V}}x_i\sum_{j\in\mathcal{V}}p_{ij} \log q_{ij}$$


where $x_i$ is the sum of the context window size $w_i$, $p_{ij}=x_{ij}/x_i$ is the proportion of $w_j$ in the context window of $w_i$.


It is easy to understand as in fact our word embedding method would like the model to learn how likely it is that $w_j$ is the context word of $w_i$, while the ground truth labels are the actual data on the corpus. At the same time, each word in the corpus has a different weight in the loss function according to the difference of $x_i$.



So far we have only rewritten Skip-Gram's loss function, and have not made any substantial changes to the model yet. The GloVe model has just made the following changes based on the previous:

1. Use the non-probability distribution variables $p'_{ij}=x_{ij}$ and $q′_{ij}=\exp(\boldsymbol{u}^\top_j\boldsymbol{v}_i)$, and take their logarithm,

2. Add two scalar model params for each word $w_i$: bias $b_i$ for the center and bias $c_i$ for the context, loosening the standard in the probability definition,

3. Replace the weight of each loss term $x_i$ with the function $h(x_{ij})$, the weight function $h(x)$ is a monotonically increasing function with a range over $[0,1]$, loosening the implicit assumption that the context is linearly related to $x_i$,

4. Use square loss function instead of the cross entropy loss.

So now we arrived at the loss function for GloVe:


$$\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} h(x_{ij}) (\boldsymbol{u}^\top_j\boldsymbol{v}_i+b_i+c_j-\log x_{ij})^2$$

Since these non-zero $x_{ij}$ are calculated in advance using entire dataset, and contain the global info, thus received the name "Global Vectors".


### Load pre-trained GloVe vectors

[GloVe](https://nlp.stanford.edu/projects/glove/) Provides a variety of pre-trained word vectors. The corpus uses text from Wikipedia, CommonCrawl, and Twitter, with the total number of words in the corpus ranging from 6 billion to 840 billion. It also provides a variety of word vector dimensions for models to use.

[`torchtext.vocab`](https://torchtext.readthedocs.io/en/latest/vocab.html) already supports GloVe, FastText, CharNGram and other commonly-used pre-trained word vectors. We can load pre-trained GloVe word vectors by calling [`torchtext.vocab.GloVe`](https://torchtext.readthedocs.io/en/latest/vocab.html#glove) 

In [1]:
# load torchtext module
import torch
import torchtext.vocab as vocab

print([key for key in vocab.pretrained_aliases.keys() if "glove" in key])
# 42B means 42 billion words in the vocab, 300-dim
cache_dir = "/home/kesci/input/GloVe6B5429"
glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir) # initiate glove model we need by size and dim

print("Contain %d words in total." % len(glove.stoi)) 
# bidirectional mapping: string to idx & idx to string
print(glove.stoi['beautiful'], glove.itos[3366]) 

  0%|          | 0/400000 [00:00<?, ?it/s]

['glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']


100%|█████████▉| 399249/400000 [00:10<00:00, 38432.53it/s]

Contain 400000 words in total.
3366 beautiful


## Find Synonyms and Analogies

- Find Synonyms

Since the cosine similarity of word vectors can measure the similarity of words' meaning, we can find the synonyms of a word by finding its K nearest neighbors in the vector space. 

In [23]:
def knn(W, x, k):
    '''
    @params:
        W: all vectors in space
        x: a specific vector
        k: neighbors number
    @outputs:
        topk: idx of top K vectors with maximum cosine similarity
        [...]: Cosine similarity
    '''
    cos = torch.matmul(W, x.view((-1,)))/( # reshape 
        (torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt()) # smoothing eps
    _, topk = torch.topk(cos, k=k)
    topk = topk.cpu().numpy()
    return topk, [cos[i].item() for i in topk]

def get_similar_tokens(query_token, k, embed):
    '''
    @params:
        query_token: the input word 
        k: number of synonyms
        embed: pre-trained word vectors
    '''
    topk, cos = knn(embed.vectors,
                    embed.vectors[embed.stoi[query_token]], k+1) # k+1 to include input token itself
    for i, c in zip(topk[1:], cos[1:]):  # remove the input token
        print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))

get_similar_tokens('beautiful', 4, glove)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful
cosine sim=0.825: charming


In [24]:
get_similar_tokens('computer', 5, glove)

cosine sim=0.917: computers
cosine sim=0.881: software
cosine sim=0.853: technology
cosine sim=0.813: electronic
cosine sim=0.806: internet


In [25]:
get_similar_tokens('coffee', 5, glove)

cosine sim=0.819: drink
cosine sim=0.818: drinks
cosine sim=0.814: wine
cosine sim=0.808: tea
cosine sim=0.804: beer


In [26]:
get_similar_tokens('sad', 3, glove)

cosine sim=0.810: awful
cosine sim=0.788: sorry
cosine sim=0.772: terrible


### Find Analogies


In addition to finding synonyms, we can also use pre-trained word vectors to find analogies for a given word, for example, "man" to "woman" is equivalent to "son" to "daughter". The analogy problem can be defined as:  For the 4 words in the analogy relationship  ***"$a$ to $b$ is equivalent to $c$ to $d$"***,  given the first 3 words $a, b, c$, find $d$. 

The idea of ​​analogy is to search for the word vector that is most similar to the result vector of **$\text{vec}(c)+\text{vec}(b)-\text{vec}(a)$**, where $\text{vec}(w)$ is the word vector of $w$.


In [27]:
def get_analogy(token_a, token_b, token_c, embed):
    '''
    @params:
        token_a: word a
        token_b: word b
        token_c: word c
        embed: pre-trained word vectors
    @outputs:
        res: analogy d
    '''
    vecs = [embed.vectors[embed.stoi[t]] 
                for t in [token_a, token_b, token_c]]
    x = vecs[1] - vecs[0] + vecs[2] # find word vector d
    topk, cos = knn(embed.vectors, x, 1) # use find synoynm function to find d itself
    res = embed.itos[topk[0]] # get d from idx to string
    return res

get_analogy('man', 'woman', 'son', glove)

'daughter'

In [28]:
get_analogy('beijing', 'china', 'tokyo', glove)

'japan'

In [29]:
get_analogy('bad', 'worst', 'big', glove)

'biggest'

In [30]:
get_analogy('do', 'did', 'go', glove)

'went'

In [31]:
get_analogy('coffee', 'morning', 'beer', glove)

'evening'

In this notebook, we explored how word2vec can be further improved with more complex word embedding models and had a taste of using GloVe to find synonyms and analogies.