# Gensim word vector visualization

This material is lifted almost wholesale from CS224n, Natural Language Processing with Deep Learning. In it we investigate how word embeddings can encode contextual similarity in a high-dimensional vector space.

In [None]:
import numpy as np
import os

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

Gensim is a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.

One homegrown Stanford offering is GloVe word vectors, which are similar to word2vec embeddings, with slight differences.

Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/).

I'm hosting a compressed copy of the 50d and 100d word vectors here:
* http://web.stanford.edu/~sjespers/mse231/glove.6B.50d.txt.gz
* http://web.stanford.edu/~sjespers/mse231/glove.6B.100d.txt.gz

The 100d vectors are pretty useful for our purposes today.

(I use the 100d vectors below as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.)

### Convert GloVe to word2vec

In [None]:
# glove.6B.100d.txt should be in the working directory
glove_file = datapath(os.path.abspath('glove.6B.100d.txt'))
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

### Load word2vec model

In [None]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

### EDA

Let's try some things out:

In [None]:
model.most_similar('trump')

In [None]:
model.most_similar('banana')

Sometimes the results are a little ridiculous:

In [None]:
model.most_similar(negative='banana')

Here is where the embedding clearly encodes some interesting information:

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [None]:
analogy('japan', 'japanese', 'australia')

In [None]:
analogy('germany', 'beer', 'france')

In [None]:
analogy('australia', 'beer', 'france')

In [None]:
analogy('denmark', 'lutefisk', 'france')

In [None]:
analogy('obama', 'clinton', 'reagan')

In [None]:
analogy('tall', 'tallest', 'long')

In [None]:
analogy('good', 'fantastic', 'bad')

In [None]:
model.doesnt_match(['breakfast', 'cereal', 'dinner', 'lunch'])

## Visualization

Here we will use PCA (principal components analysis) to reduce the dimensionality of the word embeddings so that we can visualize similar words.

In [None]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [word for word in model.vocab]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(12,12))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

Now for the actual plot:

In [None]:
plt.figure(figsize=(20, 20))

Let's see what it looks like on a small selection of words:

In [None]:
display_pca_scatterplot(
    model, 
    ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
     'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
     'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
     'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
     'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
     'homework', 'assignment', 'problem', 'exam', 'test', 'class',
     'school', 'college', 'university', 'institute'])

Pretty nice! It definitely captures the notion that similar words should be nearby to each other.

Now let's take a look at a random sample:

In [None]:
display_pca_scatterplot(model, sample=200)