# GloVe walkthrough

This notebook serves as an overview of GloVe algorithm, which is an unsupervised algorithm for learning vector representations for words, otherwise known as a "[word embedding](https://en.wikipedia.org/wiki/Word_embedding)".

In [179]:
import math
from collections import defaultdict

import numpy as np

from sklearn.cluster import KMeans

## Loading Pretrained Embeddings

I downloaded glove.twitter.27B.zip from the official website, which provides a number of different dimensions aka latent factors for the vectors learned for each word in the vocab.

In [182]:
%ls -lh data/*txt

-rw-rw-r--  1 lex  staff   331M  5 Aug  2014 data/glove.6B.100d.txt
-rw-rw-r--  1 lex  staff   586M  5 Aug  2014 data/glove.6B.200d.txt
-rw-rw-r--  1 lex  staff   163M  5 Aug  2014 data/glove.6B.50d.txt
-r--r--r--  1 lex  staff   246M 14 Aug  2014 data/glove.twitter.27B.25d.txt
-rw-rw-r--  1 lex  staff   487M 14 Aug  2014 data/glove.twitter.27B.50d.txt


Each text file includes each word (or character) in the corpus as the first character on a line, followed by a number of floats (based on the dimension size of the file) which represents the trained word vector.

In [183]:
fh = open('data/glove.twitter.27B.25d.txt', 'r')
next(fh)

'<user> 0.62415 0.62476 -0.082335 0.20101 -0.13741 -0.11431 0.77909 2.6356 -0.46351 0.57465 -0.024888 -0.015466 -2.9696 -0.49876 0.095034 -0.94879 -0.017336 -0.86349 -1.3348 0.046811 0.36999 -0.57663 -0.48469 0.40078 0.75345\n'

In [184]:
next(fh)

'. 0.69586 -1.1469 -0.41797 -0.022311 -0.023801 0.82358 1.2228 1.741 -0.90979 1.3725 0.1153 -0.63906 -3.2252 0.61269 0.33544 -0.57058 -0.50861 -0.16575 -0.98153 -0.8213 0.24333 -0.14482 -0.67877 0.7061 0.40833\n'

In [185]:
next(fh)

': 1.1242 0.054519 -0.037362 0.10046 0.11923 -0.30009 1.0938 2.537 -0.072802 1.0491 1.0931 0.066084 -2.7036 -0.14391 -0.22031 -0.99347 -0.65072 -0.030948 -1.0817 -0.64701 0.32341 -0.41612 -0.5268 -0.047166 0.71549\n'

In [186]:
def build_embedding(fh, vocab_size, dims):
    """Function used to load embedding from disk into NumPy matrix."""
    embedding = np.zeros((vocab_size, dims))

    labels = []

    for i, line in enumerate(fh):
        if i >= vocab_size:
            break

        sr = line.split()

        if len(sr[1:]) == dims - 1:
            sr = [' '] + sr  
            
        labels.append(sr[0])
            
        embedding[i] = np.array([float(j) for j in sr[1:]])
        
    return embedding, labels

In [187]:
embedding, labels = build_embedding(
    open('data/glove.twitter.27B.25d.txt', 'r'),
    vocab_size=10000,
    dims=25
)

In [188]:
embedding[0]

array([ 0.62415 ,  0.62476 , -0.082335,  0.20101 , -0.13741 , -0.11431 ,
        0.77909 ,  2.6356  , -0.46351 ,  0.57465 , -0.024888, -0.015466,
       -2.9696  , -0.49876 ,  0.095034, -0.94879 , -0.017336, -0.86349 ,
       -1.3348  ,  0.046811,  0.36999 , -0.57663 , -0.48469 ,  0.40078 ,
        0.75345 ])

In [189]:
labels[0]

'<user>'

In [190]:
print(' '.join(labels[:100]))

<user> . : rt , <repeat> <hashtag> <number> <url> ! i a " the ? you to ( <allcaps> <elong> ) me de <smile> ！ que and 。 - my no 、 is it … in n for / of la 's * do n't that on y ' e o u en this el so be 'm with just > your ^ like have te at ？ love se are < m r if all b ・ not but we es ya & follow up what get lol un ♥ lo when was “ ” one por si out


In [191]:
np.save(f'data/embedding.{10000}.glove.twitter.27B', embedding)
%ls -lh ./data/embedding*

-rw-r--r--  1 lex  staff   1.9M 25 Mar 14:29 ./data/embedding.10000.glove.twitter.27B.npy
-rw-r--r--  1 lex  staff    19M 24 Mar 23:57 ./data/embedding.100000.glove.twitter.27B.npy


## Examples

### Word Clustering

Build a K-Means model to find the nearest neighbours for each word

In [17]:
kmeans_model = KMeans(init='k-means++', n_clusters=1000, n_jobs=-1, n_init=10)
kmeans_model.fit(embedding)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=1000, n_init=10, n_jobs=-1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [18]:
cluster_labels = kmeans_model.labels_

In [19]:
cluster_to_words = defaultdict(list)
for c, i in enumerate(cluster_labels):
    cluster_to_words[i].append(labels[c])

In [20]:
for i, words in enumerate(cluster_to_words.values()):
    words_str = ', '.join(words)
    print(f'Cluster {i + 1}: {words_str}')

Cluster 1: <user>, rt, …, n, >, <, “, ”, >>, $, =, +, %, --, ---
Cluster 2: ., ,, <repeat>, !, ", ?, ;, normal
Cluster 3: :, <number>, (, <allcaps>, ), -, /, |, @
Cluster 4: <hashtag>, <url>, #
Cluster 5: i, you, it, n't, that, like, if, but, what, when, know, about, how, want, 're, she, why, there, really, think, never, did, say, them, someone, always, him, where, dont, tell, talk, mean, remember
Cluster 6: a, me, no
Cluster 7: the, and, my, for, of, 's, this, with, your, all, one, out, new, from, more, by, or, some, its, his, our, us, first, great, their, world, every, big, other, any, old, another, own, whole
Cluster 8: to, up, get, go, back, see, ca, need, make, 'll, then, come, let, way, take, wanna, better, gonna, give, keep, call, put, try, stay, trying, leave, ill, forget, bring, lets, hold, pick
Cluster 9: <elong>, <smile>, ya, <sadface>, <neutralface>, ah, xd, :o
Cluster 10: de, que, te, se, nos
Cluster 11: ！, ～, ★, ♪, ☆, ＆
Cluster 12: 。, 、, です, 今, そして, 今日, 明日, 今日は, また
Cluste

In [42]:
big_embedding, big_labels = build_embedding(
    open('data/glove.twitter.27B.25d.txt', 'r'),
    100000,
    25
)

In [43]:
np.save(f'data/embedding.{100000}.glove.twitter.27B', big_embedding)
%ls -lh ./data/embedding*

-rw-r--r--  1 lex  staff   1.9M 24 Mar 23:37 ./data/embedding.10000.glove.twitter.27B.npy
-rw-r--r--  1 lex  staff    19M 24 Mar 23:57 ./data/embedding.100000.glove.twitter.27B.npy


### Nearest Neighbours

Few examples of finding the nearest neighbours of the vectors. In the examples, we use cosine similarity as the distance metric between the two vectors.

#### Cosine similarity

From Wikipedia:

"
Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

$$ \frac{\mathbf{A}*\mathbf{B}}{||\mathbf{A}||\mathbf{B}||} = \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}} $$

It returns a number between -1 (exact opposite) and 1 (the same). In information retrieval, a TF-IDF vector would not have negative numbers in the vector, so the cosine similarity would be between 0 and 1.

If we had two Tweets represented as simple word count, let's calculate the distance by hand, then with NumPy and then using the Scipy cosine function:

* Tweet 1: "Hello world, what's up?"
* Tweet 2: "What's up, world?"

Word    |  Vec 1  | Vec 2
--------|:-------:|-------:
        |         |
hello   | 1       | 0     
world   | 1       | 1
what's  | 1       | 1
up      | 1       | 1

In [175]:
# Solution in pure Python

vect_1 = [1, 1, 1, 1]
vect_2 = [0, 1, 1, 1]

dot = sum([a * b for (a, b) in zip(vect_1, vect_2)])

norm_1 = math.sqrt(sum([i**2 for i in vect_1]))
norm_2 = math.sqrt(sum([i**2 for i in vect_2]))

dot / (norm_1 * norm_2)

0.8660254037844387

In [176]:
# Solution in Numpy

from numpy import dot
from numpy.linalg import norm

a = np.array(vect_1)
b = np.array(vect_2)

dot(a, b)/(norm(a)*norm(b))

0.8660254037844387

In [177]:
# Solution in Scipy
from scipy.spatial.distance import cosine

# Computes distance, not similarity, so you need to subtract from 1 to get the latter.
1 - cosine(a, b)

0.8660254037844387

#### Finding NN of words in the embedding

Firstly need to build a mapping of labels to ids.

In [65]:
label_to_id = {label: i for (i, label) in enumerate(big_labels)}

In [68]:
big_embedding[label_to_id['fashion']]

array([-1.0618e+00, -7.4601e-01, -2.8375e-01, -1.2646e+00,  1.2012e+00,
        1.1948e-01,  7.5129e-01,  2.2459e-01,  4.2224e-01, -7.7092e-01,
        3.5408e-01, -2.2724e-01, -3.4628e+00, -5.4871e-01, -8.1186e-01,
        1.6895e-01,  1.1222e-03, -7.4510e-01, -1.4037e-01,  1.1259e+00,
       -1.1187e+00, -7.8684e-01, -1.2874e+00, -8.4603e-01, -2.5663e-01])

In [123]:
def get_closest_n_for_word(word, n, label_to_id, embedding, labels):    
    word_id = label_to_id[word]
    word_vector = embedding[word_id]
    
    return get_closest_n(word_vector, n, label_to_id, embedding, labels)
    

def get_closest_n(word_vector, n, label_to_id, embedding, labels):
    distances = []
        
    for i, vect in enumerate(embedding):
        dist = cosine(word_vector, vect)
        distances.append((dist, i))
        
    top_n = sorted(distances, key=lambda r: r[0])[:n]
        
    return [(dist, labels[word_id]) for dist, word_id in top_n]

##### Twitter set

In [124]:
get_closest_n_for_word('fashion', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'fashion'),
 (0.11190469989651741, 'clothing'),
 (0.1187808663462051, 'designer'),
 (0.12109410226928441, 'art'),
 (0.12462542597575377, 'design'),
 (0.12729323025077588, 'photography'),
 (0.12768769439763372, 'models'),
 (0.13050617357645844, 'style'),
 (0.13300507371641712, 'collection'),
 (0.13377968671232177, 'urban')]

In [126]:
get_closest_n_for_word('technology', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'technology'),
 (0.05116892368819159, 'systems'),
 (0.059884984791405804, 'development'),
 (0.06896174308464964, 'tech'),
 (0.07422950971306108, 'management'),
 (0.07441314944022426, 'system'),
 (0.07464261828360197, 'based'),
 (0.07930825619577364, 'industry'),
 (0.0821470074938977, 'solutions'),
 (0.08470183076648641, 'resources')]

In [127]:
get_closest_n_for_word('food', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'food'),
 (0.07890411373118678, 'coffee'),
 (0.09949482327427706, 'wine'),
 (0.10243739947630437, 'breakfast'),
 (0.10485338571785352, 'drink'),
 (0.1071093788351859, 'tea'),
 (0.11201323994634982, 'starbucks'),
 (0.11314538035245958, 'beer'),
 (0.11525931057212013, 'eat'),
 (0.12103496672975256, 'drinks')]

In [128]:
get_closest_n_for_word('sports', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'sports'),
 (0.0655900492228515, 'football'),
 (0.08204431781553279, 'soccer'),
 (0.08520921025986539, 'tech'),
 (0.09501407059278166, 'games'),
 (0.09730687129513615, 'baseball'),
 (0.09952443942114464, 'basketball'),
 (0.11310389857437686, 'golf'),
 (0.11485610398447621, 'player'),
 (0.11901453714009425, 'teams')]

In [129]:
get_closest_n_for_word('drugs', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'drugs'),
 (0.09824980130172845, 'drug'),
 (0.11359491972249869, 'meth'),
 (0.11480835569515646, 'smoking'),
 (0.11861642222632496, 'weed'),
 (0.12874489890566798, 'feed'),
 (0.1312791384324209, 'cocaine'),
 (0.13276736076344542, 'marijuana'),
 (0.13398141374625738, 'junk'),
 (0.13968763029470954, 'illegal')]

In [130]:
get_closest_n_for_word('frog', 10, label_to_id, big_embedding, big_labels)

[(0.0, 'frog'),
 (0.057426365817694225, 'turtle'),
 (0.08396356648444625, 'rabbit'),
 (0.08423825426104536, 'monkey'),
 (0.0845456071455245, 'dinosaur'),
 (0.08893787816517595, 'unicorn'),
 (0.09344485925733348, 'demon'),
 (0.0974034543947101, 'snake'),
 (0.10358710139020255, 'goat'),
 (0.10431914282370536, 'witch')]

##### Wikipedia 2014 + Gigaword 5 set

In [194]:
embedding_6b, labels_6b = build_embedding(
    open('data/glove.6B.50d.txt', 'r'),
    vocab_size=1000000,
    dims=50
)

In [195]:
label_to_id_6b = {label: i for (i, label) in enumerate(labels_6b)}

In [196]:
get_closest_n_for_word('president', 10, label_to_id_6b, embedding_6b, labels_6b)

  dist = 1.0 - uv / np.sqrt(uu * vv)


[(0.0, 'president'),
 (0.14428115871894032, 'vice'),
 (0.1830153983848719, 'met'),
 (0.1904861477448797, 'secretary'),
 (0.22021414207893242, 'presidency'),
 (0.23046086977833125, 'chairman'),
 (0.23805329181108092, 'leader'),
 (0.23959343258166754, 'administration'),
 (0.24563664635866045, 'former'),
 (0.2545339628159693, 'clinton')]

In [197]:
get_closest_n_for_word('king', 10, label_to_id_6b, embedding_6b, labels_6b)

  dist = 1.0 - uv / np.sqrt(uu * vv)


[(0.0, 'king'),
 (0.17638203066643, 'prince'),
 (0.21609569890358837, 'queen'),
 (0.2253769969364895, 'ii'),
 (0.2263752375127076, 'emperor'),
 (0.2332806045393414, 'son'),
 (0.23728490559349258, 'uncle'),
 (0.2457838875991536, 'kingdom'),
 (0.24600857317183578, 'throne'),
 (0.250758815387503, 'brother')]

In [198]:
get_closest_n_for_word('fashion', 10, label_to_id_6b, embedding_6b, labels_6b)

  dist = 1.0 - uv / np.sqrt(uu * vv)


[(0.0, 'fashion'),
 (0.23926544544129025, 'style'),
 (0.24712236850455038, 'fashions'),
 (0.24841786388296982, 'designer'),
 (0.24885286852908617, 'chic'),
 (0.2549340749641188, 'designers'),
 (0.2778697451660558, 'shows'),
 (0.28345347853858394, 'couture'),
 (0.2844381708689645, 'glamour'),
 (0.28575764985460117, 'show')]

### Word arithmetic

Another really interesting property of word embeddings is their ability to support word arithmetic. Common examples:

* king - man + woman = queen
* wolf + pet = dog

In [199]:
result_vector = (
    embedding_6b[label_to_id_6b['king']] - embedding_6b[label_to_id_6b['man']] + embedding_6b[label_to_id_6b['woman']])

In [202]:
# Note that the nearest neighbour is usually the actual reference word, so we're most interested in the 2 word, usually.
get_closest_n(result_vector, 2, label_to_id_6b, embedding_6b, labels_6b)[1]

  dist = 1.0 - uv / np.sqrt(uu * vv)


(0.1390418741421059, 'queen')

In [218]:
result_vector = (
    embedding_6b[label_to_id_6b['wolf']] +
    embedding_6b[label_to_id_6b['pet']])
get_closest_n(result_vector, 4, label_to_id_6b, embedding_6b, labels_6b)[1]

  dist = 1.0 - uv / np.sqrt(uu * vv)


(0.1336948473612377, 'dog')