### GloVe vectors


Embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is also known as a sparse vector, whilst the real valued vector is known as a dense vector.

The key concept in these word embeddings is that words that appear in similar contexts appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small.

https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb

In [1]:
import torch
import torchtext.vocab

#### Load GloVe vectors
First, we'll load the GloVe vectors. The name field specifies what the vectors have been trained on, here the 6B means a corpus of 6 billion words.

In [6]:
glove = torchtext.vocab.GloVe(name='6B', dim=100)
# 6 Billion words used to train, each word is represented with dimensionality of 100

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 400000 words in the vocabulary


['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

### In these set of GloVe vectors, every single word is lower-case only.
- 400,000 words, each represented by a vector of 100 

In [7]:
glove.vectors.shape

torch.Size([400000, 100])

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [24]:
# first 15 words in GloVe vocabulary
glove.itos[:15]

['the',
 ',',
 '.',
 'of',
 'to',
 'and',
 'in',
 'a',
 '"',
 "'s",
 'for',
 '-',
 'that',
 'on',
 'is']

### Numeric index of given words

In [8]:
glove.stoi['the']

0

In [9]:
glove.stoi['dazzle']

36623

In [17]:
glove.stoi['behnam']

166593

### Vector representation of a given word

In [19]:
glove.vectors[glove.stoi['behnam']]

tensor([-0.3169,  0.0307, -0.0961,  0.2953, -0.0797, -0.6273,  0.2274, -0.2679,
         0.1378, -0.0654,  0.1710,  0.7600, -0.3771, -0.2558,  0.2803,  0.0839,
        -0.0527,  0.1139,  0.0614,  0.0408, -0.4747,  0.4947,  0.2939,  0.1276,
        -0.9034,  0.5451, -0.5878,  0.0788,  0.1740,  0.1525,  0.2043, -0.8871,
         0.0424, -0.0807,  0.2236, -0.8022,  0.2143,  0.3548,  0.2322, -0.1880,
        -0.0302,  0.2244,  0.5588,  0.5244,  0.0565,  0.0974,  0.2689,  0.6710,
        -0.0384,  0.6108,  0.0954, -0.1304, -0.0603, -0.3533, -0.1242,  0.5028,
        -0.3027,  0.2162, -0.7939, -0.6337, -0.1156, -0.6282,  0.1086, -0.0864,
        -0.5286,  0.1353, -0.0608, -0.3036, -0.0048,  0.3765,  0.2213,  0.6235,
         0.6105, -0.8010,  0.2631, -0.4587, -0.1978,  0.1259,  0.0366,  0.3063,
         0.0573, -0.0450,  0.0377,  0.1521,  0.5279,  0.1083,  0.2403, -0.1780,
        -0.1725,  0.0634,  0.1030,  0.0447,  0.4559, -0.0587,  0.0996,  0.0363,
         0.1626,  0.5560, -0.7709,  0.11

### we'll create a function that takes in word embeddings and a word and returns the associated vector. It'll also throw an error if the word doesn't exist in the vocabulary.

In [26]:
def get_vector(embeddings, word):
    
    assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
    
    return embeddings.vectors[embeddings.stoi[word]]

In [40]:
get_vector(glove, 'behnam')

tensor([-0.3169,  0.0307, -0.0961,  0.2953, -0.0797, -0.6273,  0.2274, -0.2679,
         0.1378, -0.0654,  0.1710,  0.7600, -0.3771, -0.2558,  0.2803,  0.0839,
        -0.0527,  0.1139,  0.0614,  0.0408, -0.4747,  0.4947,  0.2939,  0.1276,
        -0.9034,  0.5451, -0.5878,  0.0788,  0.1740,  0.1525,  0.2043, -0.8871,
         0.0424, -0.0807,  0.2236, -0.8022,  0.2143,  0.3548,  0.2322, -0.1880,
        -0.0302,  0.2244,  0.5588,  0.5244,  0.0565,  0.0974,  0.2689,  0.6710,
        -0.0384,  0.6108,  0.0954, -0.1304, -0.0603, -0.3533, -0.1242,  0.5028,
        -0.3027,  0.2162, -0.7939, -0.6337, -0.1156, -0.6282,  0.1086, -0.0864,
        -0.5286,  0.1353, -0.0608, -0.3036, -0.0048,  0.3765,  0.2213,  0.6235,
         0.6105, -0.8010,  0.2631, -0.4587, -0.1978,  0.1259,  0.0366,  0.3063,
         0.0573, -0.0450,  0.0377,  0.1521,  0.5279,  0.1083,  0.2403, -0.1780,
        -0.1725,  0.0634,  0.1030,  0.0447,  0.4559, -0.0587,  0.0996,  0.0363,
         0.1626,  0.5560, -0.7709,  0.11

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 6 words to an input word vector:

In [50]:
def closest(embeddings, vector, n = 6):
    
    distances = []
    
    for neighbor in embeddings.itos: # iterates trough all 400,000 words
        # (each word in vocab, distance of each word in vocab with the input word)
        distances.append((neighbor, torch.dist(vector, get_vector(embeddings, neighbor))))
    
    return sorted(distances, key = lambda x: x[1])[:n]

In [51]:
get_vector(glove, 'ufc')

tensor([ 0.5941, -0.2980,  0.4845, -1.0484, -0.1234, -0.7606, -0.3824,  0.0547,
        -0.5145,  0.4720,  1.1300, -0.6752,  0.5322, -1.4753,  0.7268, -1.0578,
         0.7085, -0.0226,  0.6492, -0.5271,  0.5378, -1.3725,  0.6805,  0.3684,
         0.3946,  0.3264,  0.7379,  0.3787,  1.4240,  0.8743, -0.9630, -0.2236,
        -0.2297,  0.4366,  0.1340,  0.7371, -0.6388,  0.0802, -0.4933,  0.5903,
        -0.1161,  0.1407,  0.2577, -0.6073,  0.1068, -0.4910, -0.5081,  0.0193,
        -0.8991,  0.5141, -0.6505,  0.0244, -0.4841,  0.4380,  1.1311, -0.9572,
         0.2540,  0.0479, -1.1566,  0.1825,  0.1217, -0.2933, -0.9783,  0.0429,
        -0.1324, -1.1153, -0.1110, -0.4143, -0.7511,  0.3366,  0.0281, -0.4760,
        -1.0994,  1.0847,  0.1064,  0.0838, -0.3123,  0.3206, -0.6370,  0.4812,
         0.0880, -0.9473,  0.8987, -0.4394,  0.5618, -0.5753, -0.1427,  0.1543,
         0.3749, -0.0160, -0.0100, -0.1311, -0.1834,  1.4182, -0.0115, -0.5173,
         0.2775,  1.1060,  0.6482, -0.11

In [52]:
closest(glove, get_vector(glove, 'ufc'))

[('ufc', tensor(0.)),
 ('wec', tensor(4.1984)),
 ('strikeforce', tensor(5.0995)),
 ('wcw', tensor(5.1773)),
 ('ecw', tensor(5.3557)),
 ('wbc', tensor(5.3616))]

In [45]:
closest(glove, get_vector(glove, 'behnam'))

[('behnam', tensor(0.)),
 ('khedr', tensor(2.8868)),
 ('bahnam', tensor(2.8913)),
 ('bottinelli', tensor(2.9203)),
 ('samuela', tensor(2.9273)),
 ('shongwe', tensor(2.9329))]

In [39]:
closest(glove, get_vector(glove, 'shenanigans'))

[('shenanigans', tensor(0.)),
 ('chicanery', tensor(2.3785)),
 ('hijinks', tensor(2.6764)),
 ('escapades', tensor(2.7821)),
 ('machinations', tensor(2.8699)),
 ('gamesmanship', tensor(2.9044))]

we'll also create another function that will nicely print out the tuples returned by our closest function.

In [54]:
def print_tuples(tuples):
    
    for t in tuples:
        print('(%.4f) %s' % (t[1], t[0]))

In [57]:
print_tuples(closest(glove, get_vector(glove, 'ronaldo')))

(0.0000) ronaldo
(3.1684) ronaldinho
(3.2217) rivaldo
(3.2934) beckham
(3.4223) cristiano
(3.4708) robinho


#### Analogies

with a well-trained word vector space certain semantic relationships  can be captured with regular vector arithmetic.

In [61]:
def analogy(embeddings, w1, w2, w3, n = 6):
    
    print('\n[%s : %s :: %s : ?]' % (w1, w2, w3))
   
    closest_words = closest(embeddings, \
                            get_vector(embeddings, w2)
                          - get_vector(embeddings, w1) \
                          + get_vector(embeddings, w3), \
                            n + 3) # we add 3 to get rid of w1, w2, w3 later
 
    closest_words = [x for x in closest_words if x[0] not in [w1, w2, w3]][:n]
        
    return closest_words

night - moon ==  sun + ?
- day or morning

In [43]:
print_tuples(analogy(glove, 'moon', 'night', 'sun'))


[moon : night :: sun : ?]
(5.7069) morning
(5.7276) afternoon
(5.8023) evening
(6.1410) hours
(6.2797) saturday
(6.3056) sunday


In [44]:
print_tuples(analogy(glove, 'fly', 'bird', 'swim'))


[fly : bird :: swim : ?]
(5.9754) swimming
(6.2409) shark
(6.4822) dolphin
(6.5421) whale
(6.6276) cat
(6.6457) gorilla


#### Interesting failure mode
- GloVe detecs sun as a name and tries to complete it with another name

In [45]:
print_tuples(analogy(glove, 'earth', 'moon', 'sun')) 


[earth : moon :: sun : ?]
(6.2294) lee
(6.4125) kang
(6.4644) tan
(6.4757) yang
(6.4853) lin
(6.5220) chong


### The problem with the previous code was that sun doesn't have any moon like earth does.
- We change the sun with jupiter which has moons named: io, ganymede, europa, ...

In [62]:
print_tuples(analogy(glove, 'earth', 'moon', 'jupiter')) 


[earth : moon :: jupiter : ?]
(5.7522) io
(5.9079) moons
(5.9303) ganymede
(6.2325) saturn
(6.2599) neptune
(6.2854) uranus


In [63]:
print_tuples(analogy(glove, 'ufc', 'submission', 'boxing')) 


[ufc : submission :: boxing : ?]
(6.4599) courts
(6.6362) storytelling
(6.6554) mind
(6.6827) hand
(6.7017) tradition
(6.7031) practice
