# Word Vectors, Part 1

In this notebook, you'll build word vectors using counts.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter

from scipy import sparse
from scipy.sparse.linalg import svds
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize

In this notebook, we'll be working with the full text of Moby Dick.

In [2]:
with open('../data/moby_dick.txt') as fi:
    moby = fi.read()

Let's first create a Counter object, `moby_counts` which shows the number of times each token appears in the book.

In [4]:
moby_words = [x.lower() for x in regexp_tokenize(moby, '\w+')]
moby_counter = Counter(moby_words)

Since it will help us out later, we'll build a couple of dictionaries that will let us translate between tokens and indices.

In [5]:
word_index = {word: i for i, word in enumerate(moby_counter.keys())}
index_word = {i: word for i, word in enumerate(moby_counter.keys())}

In [6]:
word_index['whale']

59

In [8]:
index_word[59]

'whale'

Now, let's split the text into sentences.

In [9]:
sentences = sent_tokenize(moby)

Our next goal is to create a Counter object, `coocurrence_counter` which has keys equal to each pair of tokens and whose values are equal to the number of times this pair of words appears within a window size of 2 of each other.

To do this, fill in the following for loop.

In [None]:
window_size = 2

coocurrence_counter = Counter()

for sentence in sentences:
    # First, tokenize the sentence
    sentence = # Fill this in
    for i, word in enumerate(sentence):
        window = # Grab the two words before and the two words after as a list

        for other_word in window:
            coocurrence_counter[(word, other_word)] += 1

In [10]:
window_size = 2

coocurrence_counter = Counter()

for sentence in sentences:
    # First, tokenize the sentence
    sentence = [x.lower() for x in regexp_tokenize(sentence, '\w+')]
    
    # Then, we'll build the window around each word
    for i, word in enumerate(sentence):
        window = sentence[max(0, i-2): i] + sentence[i+1: i+3]

        # Then, we'll up the counter value for that pair
        for other_word in window:
            coocurrence_counter[(word, other_word)] += 1

Now that we have our coocurrence counts, we need to build our coocurrence matrix.
For this task, we'll use a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_array.html#scipy-sparse-csr-array) from scipy.

This can be created by passing in a tuple (values, (row indices, column indices)).

In [20]:
row_idx = []
col_idx = []
counts = []

for (word1, word2) in coocurrence_counter.keys():
    row_idx.append(word_index[word1])
    col_idx.append(word_index[word2])
    counts.append(coocurrence_counter[(word1, word2)])
    
# We also need to add the diagonal entries
for word in moby_counter:
    row_idx.append(word_index[word])
    col_idx.append(word_index[word])
    counts.append(moby_counter[word])

cooccurence_matrix = sparse.csc_matrix((counts, (row_idx, col_idx)), dtype = 'float')

You can extract out the row at a particular index by using the `.getrow` method.

In [21]:
cooccurence_matrix.getrow(word_index['whale'])

<1x17429 sparse matrix of type '<class 'numpy.float64'>'
	with 1002 stored elements in Compressed Sparse Row format>

You can use the [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function to compute similarity between two vectors.

In [22]:
cosine_similarity(cooccurence_matrix.getrow(word_index['boat']),
                  cooccurence_matrix.getrow(word_index['ship']))

array([[0.34003176]])

**Question:** Which word is most similar to "ocean"?

In [23]:
# Your Code Here

In [29]:
word = 'ocean'
similarities = pd.DataFrame({
    'word': word_index.keys(),
    'similarity': cosine_similarity(
        cooccurence_matrix.getrow(word_index[word]),
        cooccurence_matrix)[0]
})

similarities.sort_values('similarity', ascending = False).head(5)

Unnamed: 0,word,similarity
525,ocean,1.0
10407,divides,0.616405
12371,overruns,0.607599
10,the,0.600684
2154,inns,0.533608


Now, let's take the singular value decomposition to get lower-dimensional vectors. We can use the [`svds`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html) function from scipy's sparse module.

In [30]:
dimension = 50

U, D, V = svds(cooccurence_matrix, k = dimension)

word_vectors = U * D

**Question:** Using these new word vectors, which word is most similar to ocean?

In [31]:
# Your Code Here

In [33]:
word = 'ocean'

similarities = pd.DataFrame({
    'word': word_index.keys(),
    'similarity': cosine_similarity(word_vectors[word_index[word], :].reshape(1, -1), word_vectors)[0]
})

similarities.sort_values('similarity', ascending = False).head(5)

Unnamed: 0,word,similarity
525,ocean,1.0
330,sea,0.974975
10,the,0.974912
1120,line,0.972599
460,seas,0.97148


Recall that the Positive Pointwise Mutual Information (PPMI) between two words is given by 

$$PPMI(w_1, w_2) = \max\left(\log_2\frac{P(w_1, w_2)}{P(w_1)\cdot P(w_2)}\right)$$

Write a function that takes as input two words and returns the PPMI between those words.

In [34]:
# Your Code Here

In [35]:
total_words = sum(moby_counter.values())
total_pairs = sum(coocurrence_counter.values())

def ppmi(word1, word2):
    numerator = coocurrence_counter[(word1, word2)] / total_pairs
    denominator = moby_counter[word1] * moby_counter[word2] / total_words**2
    return max(0, np.log2(numerator / denominator))

Now, use this function to build a PPMI matrix.

In [37]:
# Your Code Here

In [38]:
row_idx = []
col_idx = []
ppmis = []

for (word1, word2) in coocurrence_counter.keys():
    row_idx.append(word_index[word1])
    col_idx.append(word_index[word2])
    ppmis.append(ppmi(word1, word2))

ppmi_matrix = sparse.csc_matrix((ppmis, (row_idx, col_idx)), dtype = 'float')

Apply singular value decomposition to this matrix to get 50-dimensional word vectors.

In [39]:
# Your Code Here

In [62]:
dimension = 50

U, D, V = svds(ppmi_matrix, k = dimension)

word_vectors_ppmi = U * D

How similar are "sea" and "ocean" using these vectors?

In [63]:
# Your Code Here

In [64]:
def word_similarity(word1, word2):
    return cosine_similarity(word_vectors_ppmi[word_index[word1], :].reshape(1, -1),
                  word_vectors_ppmi[word_index[word2], :].reshape(1, -1))[0][0]

In [65]:
word_similarity('sea', 'ocean')

0.7506885163022181