# Class 10.1: Intro to Word Embeddings

## Installing gensim and its dependencies and launching a Jupyter notebook

``python3 -m pip install numpy``

``python3 -m pip install scipy``

``python3 -m pip install gensim``

``python3 -m pip install scikit-learn``

``jupyter notebook``


## Getting a pre-trained word2vec model

You can get a pre-trained word2vec model built on billions words of Google newsfrom here:

https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz

Just click on the "download" icon next to where it says "Raw".


## Importing some libraries

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import gensim
import re
import nltk
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from gensim.models import Word2Vec



## Loading and using the pre-trained word2vec model

<b>Note: When you run the code below, it will take a minute or two to load the model!</b> Wait until you see <code>"big model loaded"</code> printed out below the cell. You can also check for the <code>*</code> in the brackets to the left of the cell you are executing.

In [None]:
bigmodel = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300-SLIM.bin.gz", binary=True)
print("big model loaded!")

Now we can look at the word embedding (i.e., word vector) for any word that's in the model like this: 

In [None]:
bigmodel["dog"]

This is not especially interesting because these numbers are not human interpretable. Instead we might like to look at how words are similar to each other, like this

In [None]:
bigmodel.similarity('dog', 'computer')

In [None]:
bigmodel.similarity("dog", "cat")

In [None]:
bigmodel.similarity("dog", "leash")

In [None]:
bigmodel.similarity("dog", "barking")

In [None]:
bigmodel.similarity("dog", "enhance")

We can also see what words are most similar to some word, like this:

In [None]:
bigmodel.most_similar("dog")

In [None]:
bigmodel.most_similar("soccer")

We can also ask the model to pick out the word that doesn't belong.

In [None]:
bigmodel.doesnt_match(["fret", "neck", "string", "key"])

In [None]:
bigmodel.doesnt_match(["breakfast", "lunch", "dinner", "chair"])

It doesn't always work the way you think it might. 

In [None]:
bigmodel.doesnt_match(["set", "list", "dictionary", "elephant"])

We can also do cool things by combining word vectors. We can solve analogies just like on a standardized test.

"*Woman* is to *man* as ________ is to *boy*"

In [None]:
bigmodel.most_similar(positive=['woman', 'boy'], negative=['man'])

"*Woman* is to *man* as ________ is to *king*"

In [None]:
bigmodel.most_similar(positive=['woman', 'king'], negative=['man'])

## Visualizing word vectors

In class, I showed you some plots of word vectors, where the 300 dimensions had been projected down to 2 dimensions. Here we will use a dimensionality reduction method to reduce our word vectors to 2 dimensions so that we can visualize them. The code in the cell below will take word pairs, in which the first word is related in some way to the second word, and plot them in two dimensions. Execute this cell, and you should see a nice graph underneath.

In [None]:
wordpairs = {"Madrid":"Spain", "Paris":"France",  "Berlin":"Germany", "Beijing":"China", "Tokyo":"Japan"}

# Go get the word vectors for these words and 
# then store them so you can use them later on.
vecwords = []  # stores the words above
vecs = []      # stores the vectors for each word
for k,v in wordpairs.items():
    kvec = bigmodel[k]
    vvec = bigmodel[v]
    vecs.append(kvec)
    vecwords.append(k)
    vecs.append(vvec)
    vecwords.append(v)
    
# PCA is a way to project multiple dimensions down to 
# fewer dimensions, which we are doing here so we can 
# visualize the word vectors.
pca = PCA(n_components=2, whiten=True)
vectors2d = pca.fit(vecs).transform(vecs)


# This is just some ugly matplotlib code for plotting
# the 2-D vectors and visualizing them with different colors.
i = 2
for point, word in zip(vectors2d, vecwords):
    if i%2 == 0:
        plt.scatter(point[0], point[1], c='r')
    else:
        plt.scatter(point[0], point[1], c='b')
    i += 1
    
    plt.annotate(
            word, 
            xy=(point[0], point[1]),
            xytext=(7, 6),
            textcoords='offset points',
            ha='left' ,
            va='top',
            size="medium"
            )


The cell below does something similar: it takes two lists of words and plots one in blue and one in red. If the words in one list are very related to one another and the words in the other list are very related to one another, you should see the red and blue dots clustering in different parts of the space.

In [None]:
# Some words associated with 2 different categories: work and school

vecwords1 = "commute boss office paperwork".split()  
vecwords2 = "teacher studying library exams".split()
vecs = []
vecwords = []

# Get their vectors
for w in vecwords1:
    v = bigmodel[w]
    vecs.append(v)
    vecwords.append(w)

for w in vecwords2:
    v = bigmodel[w]
    vecs.append(v)
    vecwords.append(w)

    
#tsne = TSNE(n_components=2, random_state=0)
#vectors2d = tsne.fit_transform(vecs)

# Do the PCA to reduce to 2 dimensions
pca = PCA(n_components=2, whiten=True)
vectors2d = pca.fit(vecs).transform(vecs)

# Again, ugly matplotlib code to create visualization
i = 0
for point, word in zip(vectors2d, vecwords):
    if i < len(vecwords1):
        plt.scatter(point[0], point[1], c='r')
    else:
        plt.scatter(point[0], point[1], c='b')
    i += 1
    
    plt.annotate(
            word, 
            xy=(point[0], point[1]),
            xytext=(7, 6),
            textcoords='offset points',
            ha='left' ,
            va='top',
            size="medium"
            )