# Preprocessing and Embeddings

We will explore preprocessing of word representations using the package nltk.

Note that other NLP tools, such as [SpaCy](https://spacy.io/), or [Stanza](https://stanfordnlp.github.io/stanza/) are popular alternatives which provide higher levels of abstractions than NLTK. When working on an NLP task, the use of one of those two libraries is recommended over NLTK.

The code below will run through implementing a Word2Vec algorithm from scratch. A fuller and wholesome tutorial can be found in the "DeeperDiveIntoWordEmbeddings.zip" folder. Feel free to take a read through the notebook there if you want more information, or if you found yourself struggling to understand every concept from the taught theory.

Finally, we will evaluate word representations from pre-trained word embeddings.


In [4]:
!pip install nltk



# Semantic word representations


Recall that *semantic Word Representations* are representations that are learned to capture the 'meaning' of a word. These are low-dimensional vectors that contain some semantic properties. In this notebook we are going to build state-of-the art approaches to obtain semantic word representations using the **word2vec** modelling approach. We will also use these vectors in some  tasks to understand the utility of these representations.

We begin by loading some of the libraries that are necessary for building our model. We are using [pytorch](https://pytorch.org/), an open source deep learning platform, as our backbone library in the course.



In [5]:
#@title Loading packages

import torch
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F
from scipy.spatial.distance import euclidean, cosine
from tqdm import tqdm
import codecs
from sklearn.metrics.pairwise import cosine_distances

In [6]:
#@title Sample corpora

corpus = [
    'he is a king',
    'she is a queen',
    'he is a Man',
    'she Is a woman',
    'london is, the capital of England',
    'Berlin is ... the capital of germany',
    'paris is the capital of france.',
    'He will eat cake, pie, and/or brownies',
    "she didn't like the brownies"
]

# Tokenization

Q: What is a token and why do we need to tokenize?

Q: Print the tokenized corpus above. What mistakes do you find in the code below?

Q: What could be a nice way of fixing these mistakes?


In [None]:
tokenized_corpus = [] # Let us put the tokenized corpus in a list
for sentence in corpus:
  tokenized_sentence = []
  for token in sentence.split(' '): # simplest split is
    lastletter = token[-1:]
    stopwords = ["..."]
    if token in stopwords:
      pass
    elif lastletter == ",":
      token = token[:-1]
      tokenized_sentence.append(token)
    else:
      tokenized_sentence.append(token)
  tokenized_corpus.append(tokenized_sentence)

print(tokenized_corpus)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'Man'], ['she', 'Is', 'a', 'woman'], ['london', 'is', 'the', 'capital', 'of', 'England'], ['Berlin', 'is', 'the', 'capital', 'of', 'germany'], ['paris', 'is', 'the', 'capital', 'of', 'france.'], ['He', 'will', 'eat', 'cake', 'pie', 'and/or', 'brownies'], ['she', "didn't", 'like', 'the', 'brownies']]


# Pre-processing

Tokenization is a crucial pre-processing step in the NLP domain`*`. However, other pre-processing techniques also exist, many of which were extensively employed in rule-based and statistical NLP. While we don't utilise these pre-processing techniques in neural-based NLP anymore, they are still worth a recap. Typically, **stop words** and **punctuation** removal are employed, along with *either* **stemming** or **lemmatization**. However, in the following code, we will demonstrate each of the techniques separately (mainly due to our corpus being so small)

### Stop Word removal
Stop words are generally the most common words in the language which who's meaning in a sequenece is ambiguous. Some examples of stop words are: The, a, an, that.

### Punctuation removal
Old school NLP techniques (and some modern day ones) struggle to understand the semantics of punctuation. Thus, they were also removed.

## Stemming and Lemmatization
Stemming and Lemmatization are two distinct word normalization techniques. Essentially this means that, given our corpora, we wish to have variants of a word in a 'normal' form. For example, [playing, plays, played] may be normalised to "Play". The sentence "the boy's cars are different colours" may be normalised to "the boy car be differ colour"

### Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are assigned. Stemming a word may result in the word not actually being a word. For example, some stemming algorithms may stem [trouble, troubling, troubled] as "troubl".

### Lemmatization
Lemmatization attempts to properly reduce unnormalized tokens to a word that belongs in the language. The root word is called a **lemma**, and is the canonical form of a set of words. For example, [runs, running, ran] are all forms of the word "run.



Q. Think of two or three other stop words, and add them to the list of stop words below.

Q. Using the examples provided below, write some code which both removes stop words and punctuation from our corpus

Q. The examples of stemming and lemmatization below are on words/sequences not in our corpus. Extend the code so it works on our corpus.


N.B. We are not going to use these techniques in this file after this section, so we will demonstrate how to perform these techniques distinctly on our toy corpus.

`*`Recently there has been newer approaches to "tokenization" which goes further than one token being one word. One example is [SentencePiece](https://github.com/google/sentencepiece). These approaches are out of scope for this lab session, but may appear in future sessions

In [8]:
import nltk
nltk.download('punkt') # Download the tokenizer model
nltk.download('wordnet') # Download the wordnet corpora

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [9]:
# STOP WORD REMOVAL
stop_words_list = ["the", "a", "an", "that", "or", "of", "and", "..."] # Add stop words from Q1 here.
punctuation = [",", ".", ")", "(", ":", ";"]
# SWR = stop words removed
tokenized_corpus_SWR = []
for sentence in corpus:
    tokenized_sentence_SWR = []
    for token in sentence.split(" "):
        lastletter = token[-1:]
        if lastletter in punctuation and lastletter not in punctuation:
          token = token[:-1]
          tokenized_sentence_SWR.append(token)
        elif token not in stop_words_list:
            tokenized_sentence_SWR.append(token)

    if tokenized_sentence_SWR: # Only append to corpus if tokenized_sentence_SWR isn't empty
        tokenized_corpus_SWR.append(tokenized_sentence_SWR)

print(tokenized_corpus_SWR)

[['he', 'is', 'king'], ['she', 'is', 'queen'], ['he', 'is', 'Man'], ['she', 'Is', 'woman'], ['london', 'is,', 'capital', 'England'], ['Berlin', 'is', 'capital', 'germany'], ['paris', 'is', 'capital', 'france.'], ['He', 'will', 'eat', 'cake,', 'pie,', 'and/or', 'brownies'], ['she', "didn't", 'like', 'brownies']]


In [10]:
# PUNCTUATION REMOVAL
import re # regex

re_punctuation_string = '[\s,/.\']'

# PR = punctuation removed
tokenized_corpus_PR = []
for sentence in corpus:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list
    tokenized_corpus_PR.append(tokenized_sentence_PR)

print(tokenized_corpus_PR)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'Man'], ['she', 'Is', 'a', 'woman'], ['london', 'is', 'the', 'capital', 'of', 'England'], ['Berlin', 'is', 'the', 'capital', 'of', 'germany'], ['paris', 'is', 'the', 'capital', 'of', 'france'], ['He', 'will', 'eat', 'cake', 'pie', 'and', 'or', 'brownies'], ['she', 'didn', 't', 'like', 'the', 'brownies']]


In [11]:
# ANSWER Q2 HERE

In [12]:
# STEMMING
from nltk.stem import PorterStemmer

porter = PorterStemmer()

stemming_word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()

for word in stemming_word_list:
      print("{0:20}{1:20}".format(word,porter.stem(word)))

Word                Stemmed variant     

friend              friend              
friendship          friendship          
friends             friend              
friendships         friendship          
stabil              stabil              
destabilize         destabil            
misunderstanding    misunderstand       
railroad            railroad            
moonlight           moonlight           
football            footbal             


In [13]:
# LEMMATIZATION
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

to_lemmatize_sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# lemmatization requires punctuation removal
to_lemmatize_sentence = re.split(re_punctuation_string, to_lemmatize_sentence)
to_lemmatize_sentence = list(filter(None, to_lemmatize_sentence))

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Word                Lemma               

He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 


In [14]:
# Why didn't the above do anything?
# It's because the lemmatizer requires parts of speech (POS) context about the word it is currently parsing.
# We would need to use a POS model to identify what the POS for a token in its context is.
# In the above example (and for Q3), we'll just pass in the VERB context for every token

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))

Word                Lemma               

He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


In [15]:
# ANSWER Q3 HERE (make sure not to overwrite our "tokenized_corpus" variable from the "")

re_punctuation_string = '[\s,/.\']'

# For Stemming -- Use "tokenized_corpus".
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()

for line in tokenized_corpus:
  for word in line:
      print("{0:20}{1:20}".format(word,porter.stem(word)))
# For Lemmatization -- Remove punctuation first (See example above).
print("---------------------------")
print("{0:20}{1:20}".format("Word","Lemma"))
print()
for line in corpus:
  to_lemmatize_sentence = re.split(re_punctuation_string, line)
  #to_lemmatize_sentence = list(filter(None, line)) Con esto te lo pone en letras????

  for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))


Word                Stemmed variant     

he                  he                  
is                  is                  
a                   a                   
king                king                
she                 she                 
is                  is                  
a                   a                   
queen               queen               
he                  he                  
is                  is                  
a                   a                   
Man                 man                 
she                 she                 
Is                  is                  
a                   a                   
woman               woman               
london              london              
is                  is                  
the                 the                 
capital             capit               
of                  of                  
England             england             
Berlin              berlin              
is             

# Vocabulary

The code below obtains the vocabulary of the corpus.

Q. Print the size of the vocabulary.

Q. A programatically cleaner (and shorter) way of writing the code below by using a set instead of a list. Can you implement the code below using a set?

##### 3 mins

In [16]:
vocabulary = [] # Let us put all the tokens (mostly words)
                # appearing in the vocabulary in a list

for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)


print(vocabulary)



# Q. what is the size of the vocabulary?

vocabulary_size = len(vocabulary)
vocabulary_size

['he', 'is', 'a', 'king', 'she', 'queen', 'Man', 'Is', 'woman', 'london', 'the', 'capital', 'of', 'England', 'Berlin', 'germany', 'paris', 'france.', 'He', 'will', 'eat', 'cake', 'pie', 'and/or', 'brownies', "didn't", 'like']


27

# Helper functions

* These are some of the common helper functions that are used for NLP models:

    * `word2idx`:  Maintains a dictionary of word and the corresponding index
    
    * `idx2word`: Maintains a mapping from index to word
    
    
* Print the word2idx and idx2word, we will be using these in future exercises.

##### 3 mins

In [17]:

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

print(word2idx)
print(idx2word)



{'he': 0, 'is': 1, 'a': 2, 'king': 3, 'she': 4, 'queen': 5, 'Man': 6, 'Is': 7, 'woman': 8, 'london': 9, 'the': 10, 'capital': 11, 'of': 12, 'England': 13, 'Berlin': 14, 'germany': 15, 'paris': 16, 'france.': 17, 'He': 18, 'will': 19, 'eat': 20, 'cake': 21, 'pie': 22, 'and/or': 23, 'brownies': 24, "didn't": 25, 'like': 26}
{0: 'he', 1: 'is', 2: 'a', 3: 'king', 4: 'she', 5: 'queen', 6: 'Man', 7: 'Is', 8: 'woman', 9: 'london', 10: 'the', 11: 'capital', 12: 'of', 13: 'England', 14: 'Berlin', 15: 'germany', 16: 'paris', 17: 'france.', 18: 'He', 19: 'will', 20: 'eat', 21: 'cake', 22: 'pie', 23: 'and/or', 24: 'brownies', 25: "didn't", 26: 'like'}


# Look-up table

* This is a table that maps from an index to a one hot vector.

Q. Print one-hot vectors corresponding to the words 'this', 'he' and 'england'

##### 3 mins

In [18]:

def look_up_table(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x

# This is a one hot representation

# See an example for 'is'
word_idx = word2idx['is']
print(look_up_table(word_idx))

# Q: Print one-hot vectors for 'this', 'he', and 'england'
words = ["this", "he", "england"]
for word in words:
  try:
    print(word, ":")
    print(look_up_table(word2idx[word]))
  except:
    pass

tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])
this :
he :
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])
england :


# Extracting contexts and the focus word


Recall that we are building the skip-gram model.

**We first begin by obtaining the set of contexts and focus words.**
* Let's say we have a sentence (represented as vocabulary indicies): `[0, 2, 3, 6, 7]`.
* For every word in the sentence, we want to get the words which are `window_size` around it.
* So if `window_size==2`, for the word '0', we obtain: `[[0, 2], [0, 3]]`
* For the word '2', we obtain: `[[2, 0], [2, 3], [2, 6]]`
* For the word '3', we obtain: `[[3, 0], [3, 2], [3, 6], [3, 7]]`

Q. Print some of the index pairs and trace them back to their words.


##### 10 mins

In [19]:
window_size = 2

idx_pairs = []

# variables of interest:
#   center_word_pos: center word position
#   context_word_pos: context_word_position
#   add sentence length as a constraint

for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]

    for center_word_pos in range(len(indices)):

        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w

            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue

            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

print(idx_pairs)

[[ 0  1]
 [ 0  2]
 [ 1  0]
 [ 1  2]
 [ 1  3]
 [ 2  0]
 [ 2  1]
 [ 2  3]
 [ 3  1]
 [ 3  2]
 [ 4  1]
 [ 4  2]
 [ 1  4]
 [ 1  2]
 [ 1  5]
 [ 2  4]
 [ 2  1]
 [ 2  5]
 [ 5  1]
 [ 5  2]
 [ 0  1]
 [ 0  2]
 [ 1  0]
 [ 1  2]
 [ 1  6]
 [ 2  0]
 [ 2  1]
 [ 2  6]
 [ 6  1]
 [ 6  2]
 [ 4  7]
 [ 4  2]
 [ 7  4]
 [ 7  2]
 [ 7  8]
 [ 2  4]
 [ 2  7]
 [ 2  8]
 [ 8  7]
 [ 8  2]
 [ 9  1]
 [ 9 10]
 [ 1  9]
 [ 1 10]
 [ 1 11]
 [10  9]
 [10  1]
 [10 11]
 [10 12]
 [11  1]
 [11 10]
 [11 12]
 [11 13]
 [12 10]
 [12 11]
 [12 13]
 [13 11]
 [13 12]
 [14  1]
 [14 10]
 [ 1 14]
 [ 1 10]
 [ 1 11]
 [10 14]
 [10  1]
 [10 11]
 [10 12]
 [11  1]
 [11 10]
 [11 12]
 [11 15]
 [12 10]
 [12 11]
 [12 15]
 [15 11]
 [15 12]
 [16  1]
 [16 10]
 [ 1 16]
 [ 1 10]
 [ 1 11]
 [10 16]
 [10  1]
 [10 11]
 [10 12]
 [11  1]
 [11 10]
 [11 12]
 [11 17]
 [12 10]
 [12 11]
 [12 17]
 [17 11]
 [17 12]
 [18 19]
 [18 20]
 [19 18]
 [19 20]
 [19 21]
 [20 18]
 [20 19]
 [20 21]
 [20 22]
 [21 19]
 [21 20]
 [21 22]
 [21 23]
 [22 20]
 [22 21]
 [22 23]
 [22 24]
 

In [20]:
# Answer Q here.
# Sample 5 elements at random and trace them back to their word pairs.
from random import sample

random_pairs = sample(list(idx_pairs), 5)
print(random_pairs)

# TODO: Trace these samples back to their word pairs.

word_pairs = []
for random_pair in random_pairs:
  word1 = idx2word[random_pair[0]]
  word2 = idx2word[random_pair[1]]
  word_pairs.extend([[word1, word2]])
print(word_pairs)

[array([12, 10]), array([12, 11]), array([11, 12]), array([20, 19]), array([ 4, 26])]
[['of', 'the'], ['of', 'capital'], ['capital', 'of'], ['eat', 'will'], ['she', 'like']]


# Parameters and hyperparameters

* For our toy task, let us set the embedding dimensions to 5
* Let us run the algorithm for 10 epochs (number of times the training algorithm looks at the corpus/training data)
* Let us choose the learning rate as 0.001

We have two parameter matrices $W_1$ and $W_2$ - the embedding matrix and the weight matrix.

Q. What are the dimensionalities of $W_1$ and $W_2$?



##### 3 mins

In [21]:
# Hyperparameters:
embedding_dims = 20
num_epochs = 1000
learning_rate = 0.001

# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)


# Training the model

(Refer to Lecture 2 slides 30-31)

In the code below, we are going to compute the log probability of the correct context (target) given the word.

Before running the code, answer the question commented in the code -> fill `y_true`.

Print the loss and see if the loss goes down.

##### 10 mins



In [22]:
for epoch in range(num_epochs):

    loss_val = 0

    for data, target in idx_pairs:

        x = torch.Tensor(look_up_table(data)) # x is a One-hot tensor

        y_true = torch.Tensor([target]).long()
        # Q. what would y_true be? The actual word

        z1 = torch.matmul(W1, x)
        # Q. what is z1? The result of the word being processed by the first layer, staying just with the encoded word -> embedded word

        z2 = torch.matmul(W2, z1)
        # Q. what is the above operation? Context matrix

        # Let us obtain prediction over the vocabulary
        log_softmax = F.log_softmax(z2, dim=0)

        # Our loss is a negative log-likelihood loss
        # (what does this mean?) We want to minimize the error using a probability of multiple classes, we use the log to avoid exploding gradient

        loss = F.nll_loss(log_softmax.view(1,-1), y_true)

        loss_val += loss.item()

        # propagate the error
        loss.backward()

        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

    if epoch % 10 == 0:
        print(f'Loss at epoch {epoch}: {loss_val/len(idx_pairs)}')

print(f'\nFinal epoch loss: {loss_val/len(idx_pairs)}')

Loss at epoch 0: 9.182967710566635
Loss at epoch 10: 7.00546034265023
Loss at epoch 20: 5.745183305155773
Loss at epoch 30: 4.92009689784967
Loss at epoch 40: 4.316935610312682
Loss at epoch 50: 3.868524597126704
Loss at epoch 60: 3.5323581285201584
Loss at epoch 70: 3.2798559844493864
Loss at epoch 80: 3.084092174126552
Loss at epoch 90: 2.926767704807795
Loss at epoch 100: 2.8002222780997936
Loss at epoch 110: 2.6954578062662713
Loss at epoch 120: 2.605726859202752
Loss at epoch 130: 2.5270517855882644
Loss at epoch 140: 2.456618318878687
Loss at epoch 150: 2.3924814393887153
Loss at epoch 160: 2.3333781843002024
Loss at epoch 170: 2.2784937069966245
Loss at epoch 180: 2.2272899439701668
Loss at epoch 190: 2.1793969039733594
Loss at epoch 200: 2.134549072614083
Loss at epoch 210: 2.0925440618625055
Loss at epoch 220: 2.053217203112749
Loss at epoch 230: 2.0164282874419137
Loss at epoch 240: 1.9820541395590856
Loss at epoch 250: 1.9499837877658697
Loss at epoch 260: 1.9201165543152736

Q. Given that we are interested in distributed representations, what is the major bottleneck in our setup? Is it the dimensionality of the representations? Is it the learning rate? Is it the corpus?

Q. What hyperparameters would you tune to improve the representations?  The size of the embedding is the problem, it is too small

Q. Train the algorithm with a bigger corpus.

##### 10 mins

You can either copy and paste the corpus and bring it to the same format as the corpus above or use following code.

In [23]:
# Example code for getting corpora from the internet
# import urllib
# txt = [line.strip() for line in urllib.request.urlopen('https://raw.githubusercontent.com/luonglearnstocode/Seinfeld-text-corpus/master/corpus.txt').readlines()]
# decoded_txt = [line.decode('utf-8') for line in txt]
# lines = 0
# for line in decoded_txt:
#   lines += 1
# print(lines)
# print(decoded_txt)

In [24]:
# re_punctuation_string = '[\s,/.\']'

# # PR = punctuation removed
# tokenized_corpus_txt = []
# for sentence in decoded_txt:
#     tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
#     tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list
#     tokenized_corpus_txt.append(tokenized_sentence_PR)

In [25]:
# tokenized_corpus_txt

In [26]:
# vocabulary = set() # Let us put all the tokens (mostly words)
#                 # appearing in the vocabulary in a list

# for sentence in tokenized_corpus_txt:
#     for token in sentence:
#       vocabulary.add(token)
# print(vocabulary)
# print(len(vocabulary))  # Abismal difference of execution time with sets
# vocabulary_size = len(vocabulary)

In [27]:
# def look_up_table(word_idx):
#     x = torch.zeros(vocabulary_size).float()
#     x[word_idx] = 1.0
#     return x

In [28]:

# word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
# idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

# print(word2idx)
# print(idx2word)

In [29]:
# window_size = 2

# idx_pairs = []

# # variables of interest:
# #   center_word_pos: center word position
# #   context_word_pos: context_word_position
# #   add sentence length as a constraint

# for sentence in tokenized_corpus_txt:
#     indices = [word2idx[word] for word in sentence]

#     for center_word_pos in range(len(indices)):

#         for w in range(-window_size, window_size + 1):
#             context_word_pos = center_word_pos + w

#             if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
#                 continue

#             context_word_idx = indices[context_word_pos]
#             idx_pairs.append((indices[center_word_pos], context_word_idx))

# idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

# print(idx_pairs)

In [30]:
# # Hyperparameters:
# embedding_dims = 15
# num_epochs = 5
# learning_rate = 0.01

# # The two weight matrices:
# W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
# W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)

# Using word embeddings

One of the simplest ways of exploiting word representations is to find similar words. There are many ways of measuring the semantic similarity between two words. As we are using word representations which are vectors in the euclidean space, distance metrics defined in the euclidean space are the most popular choice. This is because words that share common contexts in the corpus are located in close proximity to one another in the euclidean space.  One such metric is the eucldeian distance.

Q. What is the euclidean distance between 'the' and 'a' (in the sample corpus and the new corpus)? Reflect on their differences.  ??

Q. What other distance metrics can we use for two vectors? Cosine



##### 10 mins


In [31]:
# Let us get two vectors from the trained model

x = torch.Tensor(look_up_table(0))
x_emb = torch.matmul(W1, x).detach().numpy()
y = torch.Tensor(look_up_table(1))
y_emb = torch.matmul(W1, y).detach().numpy()

# let us print the euclidean distance
print(euclidean(x_emb, y_emb)) # Res: 5

5.278243541717529


In [32]:
# Q. What is the euclidean distance between 'the' and 'a' (sample corpus)

In [33]:
# Q. What is the euclidean distance between 'the' and 'a' (new corpus)
# x = torch.Tensor(look_up_table(0))
# x_emb = torch.matmul(W1, x).detach().numpy()
# y = torch.Tensor(look_up_table(1))
# y_emb = torch.matmul(W1, y).detach().numpy()

# # let us print the euclidean distance
# print(euclidean(x_emb, y_emb)) # Res: 7.17 --> hiher dim?? check

# ADVANCED: Training with negative sampling




Refer to skipgram models in the slides.

Q. What happens when we have a very large vocabulary? Computing the softmax funtion is extremely expensive

Q. What is a negative sample? A way of replacing the softmax function for a sigmoid function for each word that should appear in the context of the target embedding and for the noisy words (k = n of noisy words). This reduces computational cost a lot.


Below is the code for training the model with negative sampling.


##### 10 mins


In [34]:
vocabulary_size

27

In [35]:
# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)

for epoch in range(num_epochs):
    epoch_loss = 0
    for data, target in idx_pairs:
        x_var = Variable(look_up_table(data)).float()

        y_pos = Variable(torch.from_numpy(np.array([target])).long())
        y_pos_var = Variable(look_up_table(target)).float()

        neg_sample = np.random.choice(list(range(vocabulary_size)),size=(1))[0]
        y_neg = Variable(torch.from_numpy(np.array([neg_sample])))
        y_neg_var = Variable(look_up_table(neg_sample)).float()

        x_emb = torch.matmul(W1, x_var)
        y_pos_emb = torch.matmul(W2, y_pos_var)
        y_neg_emb = torch.matmul(W2, y_neg_var)

        # get positive sample score
        pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb))

        # get negsample score
        neg_loss = F.logsigmoid(-1 * torch.matmul(x_emb, y_neg_emb))

        loss = - (pos_loss + neg_loss)
        epoch_loss += loss.item()

        # propagate the error
        loss.backward()

        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

    if epoch % 10 == 0:
        print(f'Loss at epoch {epoch}: {epoch_loss/len(idx_pairs)}')

Loss at epoch 0: 3.4653482764565315
Loss at epoch 10: 3.1629490110611256
Loss at epoch 20: 2.7135122798383238
Loss at epoch 30: 2.8822830957002363
Loss at epoch 40: 2.5501048560325916
Loss at epoch 50: 2.4085746922136213
Loss at epoch 60: 2.2398041631363763
Loss at epoch 70: 2.22645751937794
Loss at epoch 80: 1.844106201043066
Loss at epoch 90: 1.9829661493404553
Loss at epoch 100: 1.9941923734937939
Loss at epoch 110: 1.7761964209377765
Loss at epoch 120: 1.7467388963523822
Loss at epoch 130: 1.670601920210398
Loss at epoch 140: 1.6782184679491015
Loss at epoch 150: 1.7948420465350725
Loss at epoch 160: 1.3815086213585275
Loss at epoch 170: 1.3519017037299748
Loss at epoch 180: 1.3884420606092764
Loss at epoch 190: 1.3425367538745587
Loss at epoch 200: 1.467068009594312
Loss at epoch 210: 1.2983398839974634
Loss at epoch 220: 1.3470229580711859
Loss at epoch 230: 1.165724522641036
Loss at epoch 240: 1.3344883180008484
Loss at epoch 250: 1.3262900595195017
Loss at epoch 260: 1.26118968

KeyboardInterrupt: 

* In the current setup, we are only exploiting a very small sample of negative examples. This is suboptimal.

* Given a sufficiently large vocabulary, we would ideally sample the negative samples from a noise distribution whose probabilities match the frequency of vocabulary.


Q.  Using this code as the basis, build an object oriented negative sampling based model and train it on the fairly large corpus.



In [36]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-01-22 13:02:40--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-01-22 13:02:40--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-01-22 13:02:40--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [37]:
def tokenize(corpus):
  re_punctuation_string = '[\s,/.\']'
  tokenized_corpus_txt = []
  #decoded_txt = [line.decode('utf-8') for line in corpus]
  for line in tqdm(corpus.readlines()):
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
        tokenized_sentence_PR = re.split(re_punctuation_string, line) # in python's regex, [...] is an alternative to writing .|.|.
        tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list
        tokenized_corpus_txt.append(tokenized_sentence_PR)
  return tokenized_corpus_txt

def set_vocabulary(tokenized_corpus):
  vocabulary = set() # Let us put all the tokens (mostly words)
                  # appearing in the vocabulary in a list

  for sentence in tokenized_corpus:
      for token in sentence:
        vocabulary.add(token)
  vocabulary_size = len(vocabulary)

def look_up_table(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x

def set_idx_pairs(tokenized_corpus):

  window_size = 2

  idx_pairs = []

  for sentence in tokenized_corpus:
      indices = [word2idx[word] for word in sentence]

      for center_word_pos in range(len(indices)):

          for w in range(-window_size, window_size + 1):
              context_word_pos = center_word_pos + w

              if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                  continue

              context_word_idx = indices[context_word_pos]
              idx_pairs.append((indices[center_word_pos], context_word_idx))

  idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

In [38]:
# # this is a large file, it will take a while to load in the memory!
# with codecs.open('glove.6B.50d.txt', 'r','utf-8') as f:
#   index = 0
#   tokenized_corpus = tokenize(f)
#   print(tokenized_corpus)
#   # set_vocabulary(tokenized_corpus=tokenized_corpus)
#   # print(vocabulary_size)
#   # word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
#   # idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}
#   # set_idx_pairs(tokenized_corpus=tokenized_corpus)

In [39]:
vocabulary_size

27

In [42]:
# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)


for epoch in range(num_epochs):
    epoch_loss = 0
    for data, target in idx_pairs:
        x_var = Variable(look_up_table(data)).float()

        y_pos = Variable(torch.from_numpy(np.array([target])).long())
        y_pos_var = Variable(look_up_table(target)).float()

        neg_loss = 0
        for i in range(20):
          neg_sample = np.random.choice(list(range(vocabulary_size)),size=(1))[0]
          y_neg = Variable(torch.from_numpy(np.array([neg_sample])))
          y_neg_var = Variable(look_up_table(neg_sample)).float()
          y_neg_emb = torch.matmul(W2, y_neg_var)
          neg_loss += F.logsigmoid(-torch.matmul(W1 @ x_var, y_neg_emb)) ### ojo que hay que cambiar esta linea para que funciones por un problema del grafo del gradiente


        y_neg = Variable(torch.from_numpy(np.array([neg_sample])))
        y_neg_var = Variable(look_up_table(neg_sample)).float()

        x_emb = torch.matmul(W1, x_var)
        y_pos_emb = torch.matmul(W2, y_pos_var)

        # get positive sample score
        pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb))

        loss = - (pos_loss + neg_loss)
        epoch_loss += loss.item()

        # propagate the error
        loss.backward()

        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

    if epoch % 10 == 0:
        print(f'Loss at epoch {epoch}: {epoch_loss/len(idx_pairs)}')

Loss at epoch 0: 34.637506396953874
Loss at epoch 10: 20.489668721419115
Loss at epoch 20: 13.417021434123699
Loss at epoch 30: 9.81436321735382
Loss at epoch 40: 8.35036869599269
Loss at epoch 50: 6.2288029854114235
Loss at epoch 60: 5.310239357214708
Loss at epoch 70: 4.652064749827752
Loss at epoch 80: 4.098232934108148
Loss at epoch 90: 3.926960084071526
Loss at epoch 100: 3.701324500487401
Loss at epoch 110: 3.4742994849498454
Loss at epoch 120: 3.3432182486240682
Loss at epoch 130: 3.269650130088513
Loss at epoch 140: 3.211483122752263
Loss at epoch 150: 3.059455436926622
Loss at epoch 160: 3.023767922474788
Loss at epoch 170: 3.0083436058117794
Loss at epoch 180: 2.936440192736112
Loss at epoch 190: 2.911389868076031
Loss at epoch 200: 2.8877983065751884
Loss at epoch 210: 2.8458718098126923
Loss at epoch 220: 2.772542239152468
Loss at epoch 230: 2.7805497077795174
Loss at epoch 240: 2.766512055580433
Loss at epoch 250: 2.7000071883201597


KeyboardInterrupt: 

# Pre-trained representations

We have seen from the above that word embeddings are learned in an unsupervised manner, i.e., we don't have any labelled data. These representations can be used to `bootstrap' models in NLP. There are many word representation inducing algorithms : [word2vec](https://arxiv.org/abs/1301.3781), [GloVe](https://nlp.stanford.edu/pubs/glove.pdf), [Fasttext](https://arxiv.org/abs/1607.04606) are some of the popular choices. There are differences in the algorithms but they are all based on the distributional hypothesis.

We will now use one of these pre-trained representations: GloVe.

Q. What is the dimensionality of the representations below?

##### 2 mins

In [43]:
w2i = [] # word2index
i2w = [] # index2word
wvecs = [] # word vectors

# this is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.50d.txt', 'r','utf-8') as f:
  index = 0
  for line in tqdm(f.readlines()):
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:

      (word, vec) = (line.strip().split()[0],
                     list(map(float,line.strip().split()[1:])))

      wvecs.append(vec)
      w2i.append((word, index))
      i2w.append((index, word))
      index += 1

w2i = dict(w2i)
i2w = dict(i2w)
wvecs = np.array(wvecs)

100%|██████████| 400000/400000 [00:15<00:00, 25038.58it/s]


For the following experiments, we recommend  using `wvecs` - the pretrained representations.

# Evaluating word representation models

## Inrtinsic Evaluation

* Intrinsic evaluation of word representations involves evaluating  set of word vectors generated by an embedding technique on specific  subtasks that in someways are directly related to the distributional hypothesis. These are typically simple and fast to compute and thereby allow us to help understand representation learning algorithms.

* An intrinsic evaluation should typically return to us a scalar quantity that measures the performance of those word vectors on the evaluation subtask.



## Word Similarity

The first task we consider is evaluating if the representations are good at computing if two words are similar. In this task, you will use both euclidean distance or cosine distance as similarity measures.

* Print similarity scores for word pairs in https://github.com/iraleviant/eval-multilingual-simlex/blob/master/evaluation/ws-353/wordsim353-english-sim.txt

     (Format of the file: two words and the corresponding human score for the two words)

* Obtain pearson's correlation with predicted scores and the human generated scores.


##### 15 mins



In [50]:
from scipy.stats import pearsonr

In [52]:
words = [["tiger", "cat"], ["plane", "plane"], ["water", "pencil"], ["lion", "dinosaur"]]
print("Word 1\tWord2\tCosine\tEuclidean\tPearson")
for word in words:
  emb1 = wvecs[w2i[word[0]]]
  emb2 = wvecs[w2i[word[1]]]
  print(word[0], "\t", word [1], "\t", cosine(emb1, emb2), "\t", euclidean(emb1, emb2), "\t", pearsonr(emb1, emb2)[0])

Word 1	Word2	Cosine	Euclidean	Pearson
tiger 	 cat 	 0.38492675815948374 	 4.122908683917116 	 0.6271592806407994
plane 	 plane 	 0.0 	 0.0 	 0.9999999999999999
water 	 pencil 	 0.7877518228411655 	 7.017885353644521 	 0.2680004470852179
lion 	 dinosaur 	 0.6055773368557676 	 5.33600626592766 	 0.39051543656939486


##  Exploring Analogies

The second task we consider **completing analogies**. We are given an incomplete analogy of the form:


* $a : b : : c :~?$


We would then identify the word vector which maximizes the cosine similarity.
This metric has an intuitive interpretation. Ideally, we want $\phi(b) - \phi(a) = \phi(d) - \phi(c)$ where $\phi(.)$ is the word vector.
For instance,

* *london $-$ england = paris $-$ france* .

Thus we identify the vector $\phi(d)$ which maximizes the normalized dot-product between the two word
vectors (i.e. cosine similarity).



* You can either use your own method to compute the correct word or use the code below.

* Use original analogies dataset https://github.com/svn2github/word2vec/blob/master/questions-words.txt

Q. When does it fail?

Q. What are the possible reasons for failure?

##### 15mins



In [53]:
def cosine_distance(u, v):
    distance = 0.0
    dot = np.dot(u,v)
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    distance = dot/(norm_u)/norm_v
    return distance


def find_analogy(word_a, word_b, word_c, word_vectors, word2index):
    word_a = word_a.lower()
    word_b = word_b.lower()
    word_c = word_c.lower()

    (e_a, e_b, e_c) = (word_vectors[word2index[word_a]],
                       word_vectors[word2index[word_b]],
                       word_vectors[word2index[word_c]])


    max_cosine_sim = -999
    best_word = None

    for (w, i) in word2index.items():
        if w in [word_a, word_b, word_c]:
            continue
        cosine_sim = cosine_distance(e_b - e_a, word_vectors[i] - e_c)

        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w

    return best_word

find_analogy('france', 'paris', 'england', wvecs, w2i)

'london'

# Advanced: Compositionality

* Given access to only word representations, how can we build representations for phrases and sentences?

  (Hint: algebraic operation is one way)


* Compute the similarity score between two sentences on the STS.input.MSRpar.txt dataset from https://github.com/alvations/stasis/tree/master/STS-data/STS2012-train

  (Please use 00-readme.txt in the corpus for details on the format)
  
* Measure the pearson correlation with the human scores in STS.gs.MSRpar.txt

Q. What problems did you encounter when computing the scores?

Q. What are alternative ways of computing the scores?

Q. Using your composition method, compute representations for the following expressions and also list the top-5 most similar words:

* New York
* kick the bucket
* post office

  Does it work? What are the possible reasons?






In [None]:
txt = [line.strip() for line in urllib.request.urlopen('https://github.com/alvations/stasis/tree/master/STS-data/STS2012-train').readlines()]
decoded_txt = [line.decode('utf-8') for line in txt]

# References


* [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf): word2vec reference

* [Eluciating the properties of semantic word representations](http://www.offconvex.org/2016/02/14/word-embeddings-1/): A global perspective

* [Understanding the algebraic notions of semantic word representations](http://www.offconvex.org/2015/12/12/word-embeddings-1/): Why does the word-analogies task work with simple algebraic manipulations?

* [Stemming And Lemmatization](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)