# Semantic word representations

Semantic Word Representations are representations that are learned to capture the 'meaning' of a word based on the contexts it occurs in. These are low-dimensional vectors that contain some semantic properties. In this notebook we are going to build state-of-the art approaches to obtain semantic word representations using the `word2vec` modelling approach. We will use these vectors in some tasks to understand the utility of these representations. Later we will also learn how to use such already pre-trained representations (BERT) in those tasks.

We begin by loading some of the libraries that are necessary for building our model. We are using [pytorch](https://pytorch.org/), an open source deep learning platform, as our backbone library in the course.

In [1]:
import torch
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F
from scipy.spatial.distance import euclidean, cosine
from tqdm import tqdm 
import codecs
from sklearn.metrics.pairwise import cosine_distances
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/billchen/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
corpus = [
    'he is a king',
    'she is a queen',
    'he is a Man',
    'she Is a woman',
    'london is, the capital of England',
    'Berlin is ... the capital of germany',
    'paris is the capital of france.',
    'He will eat cake, pie, and/or brownies',
    "she didn't like the brownies"
]

We will reuse the pre-processing and vocabulary methods we have created in the previous lab.

In [3]:
import string
import re
from nltk import word_tokenize

def preprocess_corpus(corpus):
  tokenized_corpus = [] 
  for sentence in corpus:
    sentence = " ".join(word_tokenize(sentence))
    tokenized_sentence = []
    for token in sentence.split(' '):
      token = token.lower()
      tokenized_sentence.append(token)
    tokenized_corpus.append(tokenized_sentence)
  return tokenized_corpus

def get_vocabulary(tokenized_corpus):
  vocabulary = []
  for sentence in tokenized_corpus:
    for token in sentence:
        vocabulary.append(token)
  return set(vocabulary)

tokenized_corpus = preprocess_corpus(corpus)
print(tokenized_corpus)

vocabulary = get_vocabulary(tokenized_corpus)
vocabulary_size = len(vocabulary)
print(vocabulary_size)




[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'man'], ['she', 'is', 'a', 'woman'], ['london', 'is', ',', 'the', 'capital', 'of', 'england'], ['berlin', 'is', '...', 'the', 'capital', 'of', 'germany'], ['paris', 'is', 'the', 'capital', 'of', 'france', '.'], ['he', 'will', 'eat', 'cake', ',', 'pie', ',', 'and/or', 'brownies'], ['she', 'did', "n't", 'like', 'the', 'brownies']]
29


##Helper functions
These are some of other common helper functions that are used for NLP models:

`word2idx`: Maintains a dictionary of word and the corresponding index

`idx2word`: Maintains a mapping from index to word

Print the word2idx and idx2word, we will be using these in future exercises.

In [4]:
word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

print(word2idx)
print(idx2word)

{'man': 0, 'she': 1, 'london': 2, 'capital': 3, 'germany': 4, 'paris': 5, 'and/or': 6, '...': 7, 'france': 8, ',': 9, 'queen': 10, 'cake': 11, 'the': 12, 'brownies': 13, 'berlin': 14, 'like': 15, 'will': 16, '.': 17, 'is': 18, 'king': 19, 'he': 20, 'pie': 21, 'england': 22, 'a': 23, 'of': 24, 'eat': 25, 'did': 26, "n't": 27, 'woman': 28}
{0: 'man', 1: 'she', 2: 'london', 3: 'capital', 4: 'germany', 5: 'paris', 6: 'and/or', 7: '...', 8: 'france', 9: ',', 10: 'queen', 11: 'cake', 12: 'the', 13: 'brownies', 14: 'berlin', 15: 'like', 16: 'will', 17: '.', 18: 'is', 19: 'king', 20: 'he', 21: 'pie', 22: 'england', 23: 'a', 24: 'of', 25: 'eat', 26: 'did', 27: "n't", 28: 'woman'}


## Look-up table

This is a table that maps from an index to a one hot vector. It is a categorical variable binary representation where each row has one feature with a value of 1, and the other features with value 0. 1 in a particular column will tell the correct category for each example.

**Q. Print one-hot vectors corresponding to the words 'the', 'he' and ''england'**

*A. See code below*

In [5]:
def look_up_table(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x
  
# This is a one hot representation

# Q. try printing it for word_idx = 1

word_idx = word2idx['he']
print(look_up_table(word_idx))

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])


## Extracting contexts and the focus word

We will be building the **skip-gram** model. Given a corpus this model tries to use the current word (focus word) to predict its neighbors (its context)

We first begin by obtaining the set of contexts and focus words.

Let us say we have a sentence (represented as vocabulary indicies): [0, 2, 3, 6, 7].
For every word in the sentence, we want to get the words which are window_size around it.
So if window_size==2, for the word '0', we obtain: [[0, 2], [0, 3]]
For the word '2', we obtain: [[2, 0], [2, 3], [2, 6]]
For the word '3', we obtain: [[3, 0], [3, 2], [3, 6], [3, 7]]

**Q. Print some of the index pairs and trace them back to their words.**

In [6]:
window_size = 2
idx_pairs = []

# variables of interest: 
#   center_word_pos: center word position
#   context_word_pos: context_word_position
#   add sentence length as a constraint

for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]
    
    for center_word_pos in range(len(indices)):
        
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
                
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

print(idx_pairs)

[[20 18]
 [20 23]
 [18 20]
 [18 23]
 [18 19]
 [23 20]
 [23 18]
 [23 19]
 [19 18]
 [19 23]
 [ 1 18]
 [ 1 23]
 [18  1]
 [18 23]
 [18 10]
 [23  1]
 [23 18]
 [23 10]
 [10 18]
 [10 23]
 [20 18]
 [20 23]
 [18 20]
 [18 23]
 [18  0]
 [23 20]
 [23 18]
 [23  0]
 [ 0 18]
 [ 0 23]
 [ 1 18]
 [ 1 23]
 [18  1]
 [18 23]
 [18 28]
 [23  1]
 [23 18]
 [23 28]
 [28 18]
 [28 23]
 [ 2 18]
 [ 2  9]
 [18  2]
 [18  9]
 [18 12]
 [ 9  2]
 [ 9 18]
 [ 9 12]
 [ 9  3]
 [12 18]
 [12  9]
 [12  3]
 [12 24]
 [ 3  9]
 [ 3 12]
 [ 3 24]
 [ 3 22]
 [24 12]
 [24  3]
 [24 22]
 [22  3]
 [22 24]
 [14 18]
 [14  7]
 [18 14]
 [18  7]
 [18 12]
 [ 7 14]
 [ 7 18]
 [ 7 12]
 [ 7  3]
 [12 18]
 [12  7]
 [12  3]
 [12 24]
 [ 3  7]
 [ 3 12]
 [ 3 24]
 [ 3  4]
 [24 12]
 [24  3]
 [24  4]
 [ 4  3]
 [ 4 24]
 [ 5 18]
 [ 5 12]
 [18  5]
 [18 12]
 [18  3]
 [12  5]
 [12 18]
 [12  3]
 [12 24]
 [ 3 18]
 [ 3 12]
 [ 3 24]
 [ 3  8]
 [24 12]
 [24  3]
 [24  8]
 [24 17]
 [ 8  3]
 [ 8 24]
 [ 8 17]
 [17 24]
 [17  8]
 [20 16]
 [20 25]
 [16 20]
 [16 25]
 [16 11]
 

In [7]:
# We'll sample 5 elements at random and trace these back to their word pairs
from random import sample

random_pairs = sample(list(idx_pairs), 5)
print(random_pairs)

tokens_from_idx_pairs = []
for random_pair in random_pairs:
    focus_word_idx, context_word_idx = random_pair[0], random_pair[1]
    focus_word = idx2word[focus_word_idx]
    context_word = idx2word[context_word_idx]
    tokens_from_idx_pairs.append([focus_word, context_word])

print()
print(tokens_from_idx_pairs)

[array([9, 9]), array([18, 20]), array([10, 18]), array([25, 20]), array([20, 18])]

[[',', ','], ['is', 'he'], ['queen', 'is'], ['eat', 'he'], ['he', 'is']]


## Parameters and hyperparameters¶
For our toy task, let us set the embedding dimensions to 5. Let us run the algorithm for 10 epochs (number of times the training algorithm sees all the training data). Let us choose the learning rate of 0.001. We have two parameter matrices $W_1$ and $W_2$ - the embedding matrix and the weight matrix.

**Q. What are the dimensionalities of $W_1$ and $W_2$?** 

*A. shape(W1) = [embedding_dims x vocabulary_size]
shape(W2) = [vocabulary_size x embedding_dims]*

 


In [8]:
# Hyperparameters:
embedding_dims = 5
num_epochs = 100
learning_rate = 0.001

# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)




## Training the model

In the code below, we are going to compute the log probability of the correct context (target) given the word.

**Q. Fill in the gaps.**
 
**Q. Print the loss and see if the loss goes down.**


**Q. What is the meaning of the negative log-likelihood loss (NLL) ?**

The better the prediction the lower the NLL loss. Minimising the negative likelihood is the same as maximising the average log probability for the context window. We do it to be consistent with the rule of loss functions. Log is used for numerical stability. Likelihood refers to the chances that our outputs produce the target distribution.

**Q. Which of the weight matrices will be extracted as word vectors ?**

A. W1

In [9]:
for epoch in range(num_epochs):
  
    loss_val = 0
    
    for data, target in idx_pairs:
      
        x = torch.Tensor(look_up_table(data))

        # Q. what would y_true be? 
        y_true = torch.Tensor([target]).long()

        # A. [index] of the target word
        # 
        z1 = torch.matmul(W1, x) 
        
        z2 = torch.matmul(W2, z1)
        # Q. what is the above operation?
        # A. matrix multiplication
 
        # Let us obtain prediction over the vocabulary
        log_softmax = F.log_softmax(z2, dim=0)
                
        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        loss_val += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

    print(f'\nFinal epoch loss: {loss_val/len(idx_pairs)}')



Final epoch loss: 5.367302208800208

Final epoch loss: 5.28241472991256

Final epoch loss: 5.203510970554569

Final epoch loss: 5.130393026033779

Final epoch loss: 5.062788114919291

Final epoch loss: 5.000346880067479

Final epoch loss: 4.94265700779952

Final epoch loss: 4.889269152632007

Final epoch loss: 4.83972472029847

Final epoch loss: 4.793579118205355

Final epoch loss: 4.750420883491442

Final epoch loss: 4.709881956314112

Final epoch loss: 4.671639164933911

Final epoch loss: 4.635414730805855

Final epoch loss: 4.60097231028916

Final epoch loss: 4.5681133490878265

Final epoch loss: 4.536669080133562

Final epoch loss: 4.506497740745544

Final epoch loss: 4.477479602609362

Final epoch loss: 4.44951267443694

Final epoch loss: 4.4225097213472635

Final epoch loss: 4.396395971248676

Final epoch loss: 4.371106797224515

Final epoch loss: 4.346585970420342

Final epoch loss: 4.3227835669145955

Final epoch loss: 4.299656014937859

Final epoch loss: 4.277164452261739

Fi

# Assignment 1: Training with negative sampling

Fill in the gaps in the code below for training the skipgram model with negative sampling. Train the model using the following corpus:  https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/train.txt. Use can use the GPU Colab Runtime. Submit the code and a short report (150 words) answering the following questions:  

1. How did you preprocess your dataset and why?
2. How did you pick negative examples? How many of them did you choose? Why did you decide to do so?
3. How did you compute the final loss? 


In [None]:
!wget https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/train.txt


In [None]:
with open("train.txt", "r") as f:
    big_corpus = f.read()

print(len(big_corpus))

In [None]:
from nltk import word_tokenize
nltk.download('wordnet')

tokenized = word_tokenize(big_corpus.lower())
vocabulary = set(tokenized)

vocabulary_size = len(vocabulary)
print(vocabulary_size)

In [None]:
# Hyperparameters for Negative Sample:
embedding_dims = 5
num_epochs = 100
learning_rate = 0.001

# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)


In [10]:
# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)

for epoch in range(num_epochs):
    print("EPOCH{}".format(epoch))
    epoch_loss = 0
    for data, target in idx_pairs:
        x_var = Variable(look_up_table(data)).float() 
        
        y_pos = Variable(torch.from_numpy(np.array([target])).long())
        y_pos_var = Variable(look_up_table(target)).float()

        # TO DO: pick a negative sample
        context_words = []
        for one in idx_pairs:
            if (one[0] == data):
                context_words.append(one[1])

        non_context_words = [i for i in range(vocabulary_size) if i not in set(context_words)]
        # 5 negative samples
        neg_indices = []
        for i in range(5):
            neg_indices.append(non_context_words[np.random.randint(len(non_context_words))])

        neg_sample = Variable(torch.from_numpy(np.array(neg_indices)).long())
        #TODO
        # Now for test only
        neg_sample =  Variable(torch.from_numpy(np.array([non_context_words[np.random.randint(len(non_context_words))]])).long())
        y_neg_var = Variable(look_up_table(neg_sample)).float()
        
        #TO DO: use the look up table
         
        x_emb = torch.matmul(W1, x_var) 
        y_pos_emb = torch.matmul(W2, y_pos_var)
        y_neg_emb = torch.matmul(W2, y_neg_var)
        
        # get positive sample score
        pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb))
        
        # get negsample score
        neg_loss = F.logsigmoid(-1 * torch.matmul(x_emb, y_neg_emb))
        
        # TO DO: compute loss
        loss = -(pos_loss + neg_loss)
        epoch_loss += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()
        
    if epoch % 10 == 0:    
        print(f'Loss at epoch {epoch}: {epoch_loss/len(idx_pairs)}')

EPOCH0
Loss at epoch 0: 1.9394938046862553
EPOCH1
EPOCH2
EPOCH3
EPOCH4
EPOCH5
EPOCH6
EPOCH7
EPOCH8
EPOCH9
EPOCH10
Loss at epoch 10: 1.87205912743683
EPOCH11
EPOCH12
EPOCH13
EPOCH14
EPOCH15
EPOCH16
EPOCH17
EPOCH18
EPOCH19
EPOCH20
Loss at epoch 20: 2.019749760337464
EPOCH21
EPOCH22
EPOCH23
EPOCH24
EPOCH25
EPOCH26
EPOCH27
EPOCH28
EPOCH29
EPOCH30
Loss at epoch 30: 1.868371529958223
EPOCH31
EPOCH32
EPOCH33
EPOCH34
EPOCH35
EPOCH36
EPOCH37
EPOCH38
EPOCH39
EPOCH40
Loss at epoch 40: 1.7476570551078041
EPOCH41
EPOCH42
EPOCH43
EPOCH44
EPOCH45
EPOCH46
EPOCH47
EPOCH48
EPOCH49
EPOCH50
Loss at epoch 50: 1.5949684595810127
EPOCH51
EPOCH52
EPOCH53
EPOCH54
EPOCH55
EPOCH56
EPOCH57
EPOCH58
EPOCH59
EPOCH60
Loss at epoch 60: 1.520374598586327
EPOCH61
EPOCH62
EPOCH63
EPOCH64
EPOCH65
EPOCH66
EPOCH67
EPOCH68
EPOCH69
EPOCH70
Loss at epoch 70: 1.6091697681937125
EPOCH71
EPOCH72
EPOCH73
EPOCH74
EPOCH75
EPOCH76
EPOCH77
EPOCH78
EPOCH79
EPOCH80
Loss at epoch 80: 1.553691165304029
EPOCH81
EPOCH82
EPOCH83
EPOCH84
EPOC