In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from scipy import spatial
%matplotlib inline

# Project 3: Word2Vec (70 pt)
The goal of this project is to obtain the vector representations for words from text.

The main idea is that words appearing in similar contexts have similar meanings. Because of that, word vectors of similar words should be close together. Models that use word vectors can utilize these properties, e.g., in sentiment analysis a model will learn that "good" and "great" are positive words, but will also generalize to other words that it has not seen (e.g. "amazing") because they should be close together in the vector space.

Vectors can keep other language properties as well, like analogies. The question "a is to b as c is to ...?", where the answer is d, can be answered by looking into word vector space and calculating $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$, and finding the word vector that is the closest to the result.

## Your task
Complete the missing code in this notebook. Make sure that all the functions follow the provided specification, i.e. the output of the function exactly matches the description in the docstring. 

We are given a text that contains $N$ unique words $\{ x_1, ..., x_N \}$. We will focus on the Skip-Gram model in which the goal is to predict the context window $S = \{ x_{i-l}, ..., x_{i-1}, x_{i+1}, ..., x_{i+l} \}$ from current word $x_i$, where $l$ is the window size. 

We get a word embedding $\mathbf{u}_i$ by multiplying the matrix $\mathbf{U}$ with a one-hot representation $\mathbf{x}_i$ of a word $x_i$. Then, to get output probabilities for context window, we multiply this embedding with another matrix $\mathbf{V}$ and apply softmax. The objective is to minimize the loss: $-\mathop{\mathbb{E}}[P(S|x_i;\mathbf{U}, \mathbf{V})]$.

You are given a dataset with positive and negative reviews. Your task is to:
+ Construct input-output pairs corresponding to the current word and a word in the context window
+ Implement forward and backward propagation with parameter updates for Skip-Gram model
+ Train the model
+ Test it on word analogies and sentiment analysis task

## General remarks
Do not add or modify any code outside of the following comment blocks, or where otherwise explicitly stated.

``` python
##########################################################
# YOUR CODE HERE
...
##########################################################
```
After you fill in all the missing code, restart the kernel and re-run all the cells in the notebook.

The following things are **NOT** allowed:
- Using additional `import` statements
- Copying / reusing code from other sources (e.g. code by other students)

If you plagiarise even for a single project task, you won't be eligible for the bonus this semester.

# 1. Load data (5 pts)

We'll be working with a subset of reviews for restaurants in Las Vegas. The reviews that we'll be working with are either 1-star or 5-star. You can download the used data set (`task03_data.npy`) from:

* ([download link](https://syncandshare.lrz.de/getlink/fiQWKmLp3RmNbJEoLtBr3DFu/task03_data.npy)) the preprocessed set of 1-star and 5-star reviews 

In [4]:
data = np.load("task03_data.npy", allow_pickle=True)
reviews_1star = [[x.lower() for x in s] for s in data.item()["reviews_1star"]]
reviews_5star = [[x.lower() for x in s] for s in data.item()["reviews_5star"]]

In [22]:
vocabulary = [x for s in reviews_1star + reviews_5star for x in s]
vocabulary, counts = zip(*Counter(vocabulary).most_common(500))

We generate the vocabulary by taking the top 500 words by their frequency from both positive and negative sentences. We could also use the whole vocabulary, but that would be slower.

In [6]:
VOCABULARY_SIZE = len(vocabulary)
EMBEDDING_DIM = 100

VOCABULARY_SIZE

500

In [7]:
print('Number of positive reviews:', len(reviews_1star))
print('Number of negative reviews:', len(reviews_5star))
print('Number of unique words:', VOCABULARY_SIZE)

Number of positive reviews: 1000
Number of negative reviews: 2000
Number of unique words: 500


You have to create two dictionaries: `word_to_ind` and `ind_to_word` so we can go from text to numerical representation and vice versa. The input into the model will be the index of the word denoting the position in the vocabulary.

In [32]:
"""
Implement
---------
word_to_ind: dict
    The keys are words (str) and the value is the corresponding position in the vocabulary
ind_to_word: dict
    The keys are indices (int) and the value is the corresponding word from the vocabulary
ind_to_freq: dict
    The keys are indices (int) and the value is the corresponding count in the vocabulary
"""

##########################################################
# YOUR CODE HERE

#Version 1
# ind_to_word = {}
# word_to_ind = {}
# ind_to_freq = {}
# for index, word in enumerate(vocabulary):
#     word_to_ind[word] = index
#     ind_to_word[index] = word
#     ind_to_freq[index] = counts[index]


# Version 2: But it seems that we are supposed to use Counter for frequency
ind_to_word = {ind: word for ind, word in enumerate(vocabulary)}
word_to_ind = {word: ind for ind, word in enumerate(vocabulary)}
# ind_to_freq = {ind: freq for ind, freq in enumerate(dict(Counter(counts)))}
# But also seems no need to use Counter
ind_to_freq = {ind: freq for ind, freq in enumerate(counts)}
##########################################################

'\nImplement\n---------\nword_to_ind: dict\n    The keys are words (str) and the value is the corresponding position in the vocabulary\nind_to_word: dict\n    The keys are indices (int) and the value is the corresponding word from the vocabulary\nind_to_freq: dict\n    The keys are indices (int) and the value is the corresponding count in the vocabulary\n'

In [33]:
print('Word \"%s\" is at position %d appearing %d times' % 
      (ind_to_word[word_to_ind['the']], word_to_ind['the'], ind_to_freq[word_to_ind['the']]))

Word "the" is at position 0 appearing 2017 times


# 2. Create word pairs (10pts)

We need all the word pairs $\{ x_i, x_j \}$, where $x_i$ is the current word and $x_j$ is from its context window. These will correspond to input-output pairs. We want them to be represented numericaly so you should use `word_to_ind` dictionary.

In [34]:
def get_window(sentence, window_size):
    sentence = [x for x in sentence if x in vocabulary]
    pairs = []

    """
    Iterate over all the sentences
    Take all the words from (i - window_size) to (i + window_size) and save them to pairs
    
    Parameters
    ----------
    sentence: list
        A list of sentences, each sentence containing a list of words of str type
    window_size: int
        A positive scalar
        
    Returns
    -------
    pairs: list
        A list of tuple (word index, word index from its context) of int type
    """

    ##########################################################
    # YOUR CODE HERE
    
    
    for ind, word in enumerate(sentence):
        start = max(0, ind-window_size)
        end = min(len(sentence), ind+window_size+1)
        for j in range(start, ind):
            pairs.append((word_to_ind[word], word_to_ind[sentence[j]]))
        for j in range(ind+1, end):
            pairs.append((word_to_ind[word], word_to_ind[sentence[j]]))
    ##########################################################

    return pairs

In [35]:
data = []
for x in reviews_1star + reviews_5star:
    data += get_window(x, window_size=3)
data = np.array(data)

print('First 5 pairs:', data[:5].tolist())
print('Total number of pairs:', data.shape[0])

First 5 pairs: [[10, 6], [10, 64], [10, 320], [6, 10], [6, 64]]
Total number of pairs: 152322


We calculate a weighting score to counter the imbalance between the rare and frequent words. Rare words will be sampled more frequently. See https://arxiv.org/pdf/1310.4546.pdf

In [36]:
probabilities = [1 - np.sqrt(1e-3 / ind_to_freq[x]) for x in data[:,0]]
probabilities /= np.sum(probabilities)

# 3. Model definition (45 pts)

In this part you should implement forward and backward propagation together with update of the parameters i.e.:
+ One-hot encoding of the words(5 pts)
+ Loss implementation & computation (10 pts)
+ Softmax (5 pts)
+ Forward pass (10 pts)
+ Backward pass (10 pts)
+ Parameter update (5 pts)

In [37]:
class Embedding():
    def __init__(self, N, D, seed=None):
        """
        Parameters
        ----------
        N: int
            Number of unique words in the vocabulary
        D: int
            Dimension of the word vector embedding
        seed: int
            Sets the random seed, if omitted weights will be random
        """

        self.N = N
        self.D = D
        
        self.init_weights(seed)
    
    def init_weights(self, seed=None):
        if seed is not None:
            np.random.seed(seed)

        """
        We initialize weight matrices U and V of dimension (D, N) and (N, D) respectively
        """
        self.U = np.random.normal(0, np.sqrt(2. / self.D / self.N), (self.D, self.N))
        self.V = np.random.normal(0, np.sqrt(2. / self.D / self.N), (self.N, self.D))

    def one_hot(self, x, N):
        """
        Given a vector returns a matrix with rows corresponding to one-hot encoding
        
        Parameters
        ----------
        x: array
            M-dimensional vector containing integers from [0, N]
        N: int
            Number of posible classes
        
        Returns
        -------
        one_hot: array
            (N, M) matrix where each column is N-dimensional one-hot encoding of elements from x 
        """

        ##########################################################
        
        # YOUR CODE HERE
        M = x.shape[0]
        one_hot = np.zeros((N, M))
        for i in range(M):
            one_hot[x[i]-1][i] = 1
        ##########################################################

        assert one_hot.shape == (N, x.shape[0])
        return one_hot

    def loss(self, y, prob):
        """
        Parameters
        ----------
        y: array
            (N, M) matrix of M samples where columns are one-hot vectors for true values
        prob: array
            (N, M) column of M samples where columns are probability vectors after softmax

        Returns
        -------
        loss: int
            Cross-entropy loss calculated as: 1 / M * sum_i(sum_j(y_ij * log(prob_ij)))
        """

        ##########################################################
        # YOUR CODE HERE
        M = y.shape[1]
        
        loss = - np.multiply(y, np.log(prob)).sum() / M
        ##########################################################
        
        return loss
    
    def softmax(self, x, axis):
        """
        Parameters
        ----------
        x: array
            A non-empty matrix of any dimension
        axis: int
            Dimension on which softmax is performed
            
        Returns
        -------
        y: array
            Matrix of same dimension as x with softmax applied to 'axis' dimension
        """
        
        ##########################################################
        # YOUR CODE HERE
        x_exp = np.exp(x)
        den = x_exp.sum(axis=axis)
        y = x_exp / den
        ##########################################################

        return y
    
    def step(self, x, y, learning_rate=1e-3):
        """
        Performs forward and backward propagation and updates weights
        
        Parameters
        ----------
        x: array
            M-dimensional mini-batched vector containing input word indices of int type
        y: array
            Output words, same dimension and type as 'x'
        learning_rate: float
            A positive scalar determining the update rate
            
        Returns
        -------
        loss: float
            Cross-entropy loss
        d_U: array
            Partial derivative of loss w.r.t. U
        d_V: array
            Partial derivative of loss w.r.t. V
        """
        
        # Input transformation
        """
        Input is represented with M-dimensional vectors
        We convert them to (N, M) matrices such that columns are one-hot 
        representations of the input
        """
        x = self.one_hot(x, self.N) # size N*M
        y = self.one_hot(y, self.N) # size N*M

        
        # Forward propagation
        """
        Returns
        -------
        embedding: array
            (D, M) matrix where columns are word embedding from U matrix
        logits: array
            (N, M) matrix where columns are output logits
        prob: array
            (N, M) matrix where columns are output probabilities
        """
        
        ##########################################################
        # YOUR CODE HERE
        # note that
        # self.U with size (D, N)
        # self.V with size (N, D)
        
        embedding = np.dot(self.U, x)
        logits = np.dot(self.V, embedding)
        prob = self.softmax(logits, axis=0)
        
        ##########################################################

        assert embedding.shape == (self.D, x.shape[1])
        assert logits.shape == (self.N, x.shape[1])
        assert prob.shape == (self.N, x.shape[1])
    
        # Loss calculation
        """
        Returns
        -------
        loss: int
            Cross-entropy loss using true values and probabilities
        """
        ##########################################################
        # YOUR CODE HERE
        loss = self.loss(y, prob)
        ##########################################################

        # Backward propagation
        """
        Returns
        -------
        d_U: array
            (N, D) matrix of partial derivatives of loss w.r.t. U
        d_V: array
            (D, N) matrix of partial derivatives of loss w.r.t. V
        """
        ##########################################################
        # YOUR CODE HERE
        d_V = (prob-y).dot(embedding.T)
        d_U = self.V.T.dot((prob-y).dot(x.T))
        ##########################################################
        
        assert d_V.shape == (self.N, self.D)
        assert d_U.shape == (self.D, self.N)

        # Update the parameters
        """
        Updates the weights with gradient descent such that W_new = W - alpha * dL/dW, 
        where alpha is the learning rate and dL/dW is the partial derivative of loss w.r.t. 
        the weights W
        """
        ##########################################################
        # YOUR CODE HERE
        self.U -= learning_rate * d_U
        self.V -= learning_rate * d_V
        ##########################################################

        return loss, d_U, d_V

## 3.1 Gradient check

The following code checks whether the updates for weights are implemented correctly. It should run without an error.

In [38]:
def get_loss(model, old, variable, epsilon, x, y, i, j):
    delta = np.zeros_like(old)
    delta[i, j] = epsilon

    model.init_weights(seed=132) # reset weights
    setattr(model, variable, old + delta) # change one weight by a small amount
    loss, _, _ = model.step(x, y) # get loss

    return loss

def gradient_check_for_weight(model, variable, i, j, k, l):
    x, y = np.array([i]), np.array([j]) # set input and output
    
    old = getattr(model, variable)
    
    model.init_weights(seed=132) # reset weights
    _, d_U, d_V = model.step(x, y) # get gradients with backprop
    grad = { 'U': d_U, 'V': d_V }
    
    eps = 1e-4
    loss_positive = get_loss(model, old, variable, eps, x, y, k, l) # loss for positive change on one weight
    loss_negative = get_loss(model, old, variable, -eps, x, y, k, l) # loss for negative change on one weight
    
    true_gradient = (loss_positive - loss_negative) / 2 / eps # calculate true derivative wrt one weight

    assert abs(true_gradient - grad[variable][k, l]) < 1e-5 # require that the difference is small

def gradient_check():
    N, D = VOCABULARY_SIZE, EMBEDDING_DIM
    model = Embedding(N, D)

    # check for V
    for _ in range(20):
        i, j, k = [np.random.randint(0, d) for d in [N, N, D]] # get random indices for input and weights
        gradient_check_for_weight(model, 'V', i, j, i, k)

    # check for U
    for _ in range(20):
        i, j, k = [np.random.randint(0, d) for d in [N, N, D]]
        gradient_check_for_weight(model, 'U', i, j, k, i)

    print('Gradients checked - all good!')

gradient_check()

Gradients checked - all good!


# 4. Training

We train our model using stochastic gradient descent. At every step we sample a mini-batch from data and update the weights.

The following function samples words from data and creates mini-batches. It subsamples frequent words based on previously calculated probabilities.

In [39]:
def get_batch(data, size, prob):
    i = np.random.choice(data.shape[0], size, p=prob)
    return data[i, 0], data[i, 1]

Training the model can take some time so plan accordingly.

In [40]:
np.random.seed(123)
model = Embedding(N=VOCABULARY_SIZE, D=EMBEDDING_DIM)

losses = []

MAX_ITERATIONS = 150000
PRINT_EVERY = 10000

for i in range(MAX_ITERATIONS):
    x, y = get_batch(data, 128, probabilities)
    loss, _, _ = model.step(x, y, 1e-2)
    losses.append(loss)

    if (i + 1) % PRINT_EVERY == 0:
        print('Iteration:', i + 1, 'Loss:', np.mean(losses[-PRINT_EVERY:]))

Iteration: 10000 Loss: 5.188057727781102
Iteration: 20000 Loss: 4.968123764064562
Iteration: 30000 Loss: 4.877519371601387
Iteration: 40000 Loss: 4.8171588197707
Iteration: 50000 Loss: 4.774618553000592
Iteration: 60000 Loss: 4.748090886455194
Iteration: 70000 Loss: 4.733156209061149
Iteration: 80000 Loss: 4.722194902183064
Iteration: 90000 Loss: 4.7150081182584636
Iteration: 100000 Loss: 4.71196030097628
Iteration: 110000 Loss: 4.707932552891967
Iteration: 120000 Loss: 4.70409871477294
Iteration: 130000 Loss: 4.704832245718909
Iteration: 140000 Loss: 4.702872467912909
Iteration: 150000 Loss: 4.699801183471255


The embedding matrix is given by $\mathbf{U}^T$, where the $i$th row is the vector for $i$th word in the vocabulary.

In [41]:
emb_matrix = model.U.T
emb_matrix

array([[-0.1692214 ,  0.2357828 ,  0.01328998, ..., -0.07980522,
        -0.01014586, -0.18005849],
       [ 0.04080247,  0.0569008 ,  0.17361053, ...,  0.20508263,
         0.11050595, -0.23530198],
       [-0.07614155,  0.06376927,  0.27290664, ...,  0.3755116 ,
         0.0059213 , -0.26390035],
       ...,
       [ 0.52681371,  0.1099546 ,  0.56406356, ...,  0.03838889,
         0.59215841,  0.32092684],
       [-0.00341859, -0.40480223,  0.47133028, ...,  0.59510354,
        -0.17981483, -0.53306781],
       [-0.08016061, -0.0148033 ,  0.25950851, ...,  0.11054143,
         0.05997889, -0.10148754]])

# 5. Analogies (10 pts)

As mentioned before, vectors can keep some language properties like analogies. Given a relation a:b and a query c, we can find d such that c:d follows the same relation. We hope to find d by using vector operations. In this case, finding the real word vector $\mathbf{u}_d$ closest to $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$ gives us d. Note that the quality of the analogy results is not expected to be excellent.

In [44]:
triplets = [['go', 'going', 'come'], ['look', 'looking', 'come'], ['i', 'you', 'we'], 
            ['what', 'that', 'when'], ['find', 'found', 'enjoy']]

for triplet in triplets:
    a, b, c = triplet

    """
    Returns
    -------
    candidates: list
        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c
    """
    ##########################################################
    # YOUR CODE HERE
    candidates = []

    emb_new = emb_matrix[word_to_ind[b]] - emb_matrix[word_to_ind[a]] + emb_matrix[word_to_ind[c]]

    dis=[]
    for emb in emb_matrix:
      dis.append(spatial.distance.cosine(emb, emb_new))
    
    index = np.argsort(dis)[0:5]
    for ind in index:
      candidates.append(ind_to_word[ind])

    ##########################################################
    
    print('%s is to %s as %s is to [%s]' % (a, b, c, '|'.join(candidates)))    

'\n    Returns\n    -------\n    candidates: list\n        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c\n    '

go is to going as come is to [come|going|take|pho|it's]


'\n    Returns\n    -------\n    candidates: list\n        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c\n    '

look is to looking as come is to [looking|come|whole|well|worth]


'\n    Returns\n    -------\n    candidates: list\n        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c\n    '

i is to you as we is to [you|we|there's|but|tasted]


'\n    Returns\n    -------\n    candidates: list\n        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c\n    '

what is to that as when is to [when|that|taste|has|both]


'\n    Returns\n    -------\n    candidates: list\n        A list of 5 closest words, measured with cosine similarity, to the vector u_b - u_a + u_c\n    '

find is to found as enjoy is to [found|enjoy|town|all|asian]
