<a href="https://colab.research.google.com/github/Shurui-Zhang/Deep_learning/blob/main/Lab7_2_WordEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 2: Word Embeddings

In [None]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer
    
try:
    import torchtext
except:
    !pip install torchtext
    
try:
    import spacy
except:
    !pip install spacy

try:
    spacy.load('en')
except:
    !python -m spacy download en

Collecting torchbearer
[?25l  Downloading https://files.pythonhosted.org/packages/ff/e9/4049a47dd2e5b6346a2c5d215b0c67dce814afbab1cd54ce024533c4834e/torchbearer-0.5.3-py3-none-any.whl (138kB)
[K     |██▍                             | 10kB 16.4MB/s eta 0:00:01[K     |████▊                           | 20kB 22.4MB/s eta 0:00:01[K     |███████▏                        | 30kB 27.5MB/s eta 0:00:01[K     |█████████▌                      | 40kB 18.2MB/s eta 0:00:01[K     |███████████▉                    | 51kB 13.9MB/s eta 0:00:01[K     |██████████████▎                 | 61kB 9.1MB/s eta 0:00:01[K     |████████████████▋               | 71kB 9.7MB/s eta 0:00:01[K     |███████████████████             | 81kB 10.7MB/s eta 0:00:01[K     |█████████████████████▍          | 92kB 11.7MB/s eta 0:00:01[K     |███████████████████████▊        | 102kB 9.9MB/s eta 0:00:01[K     |██████████████████████████      | 112kB 9.9MB/s eta 0:00:01[K     |████████████████████████████▌   | 122kB 

Word embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is a *sparse vector*, whilst the real valued vector is a *dense vector*. 密集向量的值就是一个普通的Double数组 而稀疏向量由两个并列的 数组indices和values组成 例如：向量(1.0,0.0,1.0,3.0)用密集格式表示为[1.0,0.0,1.0,3.0]，用稀疏格式表示为(4,[0,2,3],[1.0,1.0,3.0]) 第一个4表示向量的长度(元素个数)，[0,2,3]就是indices数组，[1.0,1.0,3.0]是values数组 表示向量0的位置的值是1.0，2的位置的值是1.0,而3的位置的值是3.0,其他的位置都是0

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

We'll talk about some of the well-known algorithms for learning embeddings in the lectures, but you might have already heard of a popular model called *word2vec*, which was first published in a rejected ICLR submission (it has some pretty damning reviews, but also has thousands of citations!). In this lab we'll use pre-trained *GloVe* vectors. *GloVe* is a different algorithm for computing word vectors, although the outcome is similar to *word2vec*. These pre-trained embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor. `nn.Embedding` layers can be trained from scratch, or they can be initialised (and optionally fixed) with pre-trained embedding data. The key thing to remember about an `nn.Embedding` is that it does not need to explicitly use a one-hot vector representation at any point; it just maps an index to a vector. This is important because it implies massive computational savings; more concretly an Emdedding is essentially a linear map in which the weight matrix of the linear layer is multiplied by a one-hot sparse-vector to produce a lower-dimensional (dense) output. This is exactly equivalent to just selecting the column of the weight matrix corresponding to the index represented by the sparse vector.

In this part of the lab we won't be training any models; instead we'll be looking at the word embeddings and investigating a few interesting things we can do with them.

## Loading the GloVe vectors

First, we'll load the pre-trained GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions. The first time you run this it will take time as the vectors need to be downloaded:

In [None]:
import torchtext.vocab

glove = torchtext.vocab.GloVe(name='6B', dim=100)

print(f'There are {len(glove.itos)} words in the vocabulary')

.vector_cache/glove.6B.zip: 862MB [02:44, 5.25MB/s]                           
100%|█████████▉| 398186/400000 [00:22<00:00, 21454.49it/s]

There are 400000 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [None]:
glove.vectors.shape

torch.Size([400000, 100])

We can see what word is associated with each row by checking the `itos` (int to string) list. 

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [None]:
glove.itos[:10]#取itos的前10个
print(type(glove.itos))

<class 'list'>


We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [None]:
glove.stoi['the']

0

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [None]:
glove.vectors[glove.stoi['the']]

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

We'll be doing this a lot. __Use the following block to create a function that takes in word embeddings and a word and returns the associated vector.__ You should throw an error if the word doesn't exist in the vocabulary:

In [None]:
def get_vector(embeddings, word):
    try:
      vector = embeddings.vectors[glove.stoi[word]]
      return vector
    except:
      print("don't find this word in the vocabulary")

    

As before, we use a word to get the associated vector.

In [None]:
get_vector(glove, 'the')

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

## Similar Contexts

Now to start looking at the context of different words. 

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 10 words to an input word vector:

In [None]:
import torch

def closest_words(embeddings, vector, n=10):
    distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]#由word和距离组成的列表
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'korea'. The closest word is the word 'korea' itself (not very interesting), however all of the words are related in some way. Pyongyang is the capital of North Korea, DPRK is the official name of North Korea, etc.

Interestingly, we also get 'Japan' and 'China',  implies that Korea, Japan and China are frequently talked about together in similar contexts. This makes sense as they are geographically situated near each other. 

In [None]:
closest_words(glove, get_vector(glove, 'korea'))

[('korea', 0.0),
 ('pyongyang', 3.9039554595947266),
 ('korean', 4.068886756896973),
 ('dprk', 4.2631049156188965),
 ('seoul', 4.340494155883789),
 ('japan', 4.551243782043457),
 ('koreans', 4.6156086921691895),
 ('south', 4.65822696685791),
 ('china', 4.839518070220947),
 ('north', 4.986356735229492)]

Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India (geographically), but Thailand and Malaysia are closer. So why is Australia closer to India in vector space? A plausible explaination is that India and Australia appear together in the context of [cricket](https://en.wikipedia.org/wiki/Cricket) matches.

In [None]:
closest_words(glove, get_vector(glove, 'india'))

[('india', 0.0),
 ('pakistan', 3.6954822540283203),
 ('indian', 4.114313125610352),
 ('delhi', 4.155975818634033),
 ('bangladesh', 4.261017799377441),
 ('lanka', 4.435846328735352),
 ('sri', 4.515717029571533),
 ('australia', 4.806082725524902),
 ('thailand', 4.994781494140625),
 ('malaysia', 5.009334087371826)]

We'll also create another function that will nicely print out the tuples returned by our closest_words function.

In [None]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}') 

Using the `print_tuples` function use the code block below to print out the 10 neighbours of 'jaguar':

In [None]:
print_tuples(closest_words(glove, get_vector(glove, 'jaguar')))
closest_words(glove, get_vector(glove, 'jaguar'))

(0.0000) jaguar
(4.0384) rover
(4.2649) mustang
(4.3939) e-type
(4.4494) xk8
(4.4579) xjs
(4.4906) xj6
(4.5109) xkr
(4.5336) sepecat
(4.5409) xk120


[('jaguar', 0.0),
 ('rover', 4.038439750671387),
 ('mustang', 4.264894962310791),
 ('e-type', 4.39393424987793),
 ('xk8', 4.4493913650512695),
 ('xjs', 4.457940578460693),
 ('xj6', 4.4905686378479),
 ('xkr', 4.510916233062744),
 ('sepecat', 4.533621788024902),
 ('xk120', 4.540858268737793)]

__Use the following block to explain the results.__ (hint: use Google if you don't know what any of the terms are!)

YOUR ANSWER HERE

## Analogies

Another property of word embeddings is that we can apply standard arithmetic operations. This can give interesting results.

We'll show an example of this first, and then explain it:

In [None]:
def analogy(embeddings, word1, word2, word3, n=5):
    
    candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)
    
    candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [None]:
print_tuples(analogy(glove, 'man', 'king', 'woman'))

man is to king as woman is to...
(4.0811) queen
(4.6429) monarch
(4.9055) throne
(4.9216) elizabeth
(4.9811) prince


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. For example, this gets an "acting career vector":

In [None]:
print_tuples(analogy(glove, 'man', 'actor', 'woman'))

man is to actor as woman is to...
(2.8133) actress
(5.0039) comedian
(5.1399) actresses
(5.2773) starred
(5.3085) screenwriter


__Use the following block to compute a 'capital city vector' that predicts the capital of England based on the capital and name of another country__:

In [None]:
print('washington' in glove.itos)
print_tuples(analogy(glove, 'china', 'beijing', 'england'))

True
china is to beijing as england is to...
(4.4431) birmingham
(4.5930) melbourne
(4.5997) manchester
(4.7599) leeds
(4.7985) perth


__Use the following block to compute an 'musical genre vector' that predicts the genre of music played by Eminem based on another musician/band and their genre__:

In [None]:
# YOUR CODE HERE
print_tuples(analogy(glove, 'eminem', 'china', 'england'))