# Word Vectors, Part 1

In this notebook, you'll build word vectors using counts.

In [None]:
import pandas as pd
from collections import Counter

from scipy import sparse
from scipy.sparse.linalg import svds
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize

In this notebook, we'll be working with the full text of Moby Dick.

In [None]:
with open('moby_dick.txt', encoding = 'utf-8') as fi:
    moby = fi.read()

Let's first create a Counter object, `moby_counts` which shows the number of times each token appears in the book.

In [None]:
moby_words = [x.lower() for x in regexp_tokenize(moby, '\w+')]
moby_counter = Counter(moby_words)

Since it will help us out later, we'll build a couple of dictionaries that will let us translate between tokens and indices.

In [None]:
word_index = {word: i for i, word in enumerate(moby_counter.keys())}
index_word = {i: word for i, word in enumerate(moby_counter.keys())}

In [None]:
word_index['whale']

In [None]:
index_word[59]

Now, let's split the text into sentences.

In [None]:
sentences = sent_tokenize(moby)

Our next goal is to create a Counter object, `coocurrence_counter` which has keys equal to each pair of tokens and whose values are equal to the number of times this pair of words appears within a window size of 2 of each other.

To do this, fill in the following for loop.

In [None]:
window_size = 2

cooccurrence_counter = Counter()

for sentence in sentences:
    sentence = [x.lower() for x in regexp_tokenize(sentence, '\w+')]
    for i, word in enumerate(sentence):
        window = # Grab the two words before and the two words after as a list

        for other_word in window:
            cooccurrence_counter[(word, other_word)] += 1

# Also, add the counts for each word with itself
for word in moby_counter.keys():
    cooccurrence_counter[(word, word)] += moby_counter[word]

Now that we have our coocurrence counts, we need to build our coocurrence matrix.
For this task, we'll use a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_array.html#scipy-sparse-csr-array) from scipy.

This can be created by passing in a tuple (values, (row indices, column indices)).

In [None]:
row_idx = []
col_idx = []
counts = []

for (word1, word2) in cooccurrence_counter.keys():
    row_idx.append(word_index[word1])
    col_idx.append(word_index[word2])
    counts.append(cooccurrence_counter[(word1, word2)])
    
# We also need to add the diagonal entries
for word in moby_counter:
    row_idx.append(word_index[word])
    col_idx.append(word_index[word])
    counts.append(moby_counter[word])

cooccurrence_matrix = sparse.csc_matrix((counts, (row_idx, col_idx)), dtype = 'float')

You can extract out the row at a particular index by using the `.getrow` method.

In [None]:
cooccurrence_matrix.getrow(word_index['whale'])

You can use the [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function to compute similarity between two vectors.

In [None]:
cosine_similarity(cooccurrence_matrix.getrow(word_index['boat']),
                  cooccurrence_matrix.getrow(word_index['ship']))

**Question:** Which word is most similar to "ocean"?

In [None]:
# Your Code Here

Now, let's take the singular value decomposition to get lower-dimensional vectors. We can use the [`svds`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html) function from scipy's sparse module.

In [None]:
dimension = 50

U, D, V = svds(cooccurrence_matrix, k = dimension)

word_vectors = U * D

**Question:** Using these new word vectors, which word is most similar to ocean?

In [None]:
# Your Code Here

Recall that the Positive Pointwise Mutual Information (PPMI) between two words is given by 

$$PPMI(w_1, w_2) = \max\left(\log_2\frac{P(w_1, w_2)}{P(w_1)\cdot P(w_2)}\right)$$

Write a function that takes as input two words and returns the PPMI between those words.

**Hint:** You will need to find the total number of words and the total number of pairs of words in order to compute the PPMI.

In [None]:
# Your Code Here

Now, use this function to build a PPMI matrix.

In [None]:
# Your Code Here

Apply singular value decomposition to this matrix to get 50-dimensional word vectors.

In [None]:
# Your Code Here

How similar are "sea" and "ocean" using these vectors?

In [None]:
# Your Code Here

**Bonus:** While we can get some reasonable word vectors using just Moby Dick, we could probably do much better by including a larger corpus. Build word vectors using the larger corpus of Project Gutenberg books and explore the results.