<a href="https://colab.research.google.com/github/ChanMunFai/NLPCourse/blob/main/lab01-preprocessing-and-word-embeddings/lab01_PreprocessingAndEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install nltk



# Semantic word representations


Recall that *semantic Word Representations* are representations that are learned to capture the 'meaning' of a word. These are low-dimensional vectors that contain some semantic properties. In this notebook we are going to build state-of-the art approaches to obtain semantic word representations using the **word2vec** modelling approach. We will also use these vectors in some  tasks to understand the utility of these representations. 

We begin by loading some of the libraries that are necessary for building our model. We are using [pytorch](https://pytorch.org/), an open source deep learning platform, as our backbone library in the course. 



In [None]:
#@title Loading packages

import torch
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F
from scipy.spatial.distance import euclidean, cosine
from tqdm import tqdm 
import codecs
from sklearn.metrics.pairwise import cosine_distances

In [None]:
#@title Sample corpora

corpus = [
    'he is a king',
    'she is a queen',
    'he is a Man',
    'she Is a woman',
    'london is, the capital of England',
    'Berlin is ... the capital of germany',
    'paris is the capital of france.',
    'He will eat cake, pie, and/or brownies',
    "she didn't like the brownies"
]

# Tokenization

Q: What is a token and why do we need to tokenize? 

Tokens are subunits of a text. I don't really know why we need to tokenise, but I can't imagine how to approach an NLP problem without tokenisation. I guess tokenisation is akin to splitting up text into features. 

Q: Print the tokenized corpus above. What mistakes do you find in the code below? 
1. ~Contains duplicates.~ This is not a mistake!  
2. Contains punctuation -> though this is not necessarily wrong! 
3. Lower and upper words are considered different tokens. 

Q: What could be a nice way of fixing these mistakes? 
See code below. 



##### 10 mins 

In [None]:
tokenized_corpus = [] # Let us put the tokenized corpus in a list
for sentence in corpus:
  tokenized_sentence = []
  for token in sentence.split(' '): # simplest split is 
    # Add code here for Q3.
    if any(c.isalpha() for c in token): 
      token = token.lower()
      tokenized_sentence.append(token)
  # tokenized_corpus.extend(tokenized_sentence)
  tokenized_corpus.append(tokenized_sentence)


# tokenized_corpus = list(dict.fromkeys(tokenized_corpus)) # remove duplicates

print(tokenized_corpus)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'man'], ['she', 'is', 'a', 'woman'], ['london', 'is,', 'the', 'capital', 'of', 'england'], ['berlin', 'is', 'the', 'capital', 'of', 'germany'], ['paris', 'is', 'the', 'capital', 'of', 'france.'], ['he', 'will', 'eat', 'cake,', 'pie,', 'and/or', 'brownies'], ['she', "didn't", 'like', 'the', 'brownies']]


# Pre-processing

Tokenization is a crucial pre-processing step in the NLP domain`*`. However, other pre-processing techniques also exist, many of which were extensively employed in rule-based and statistical NLP. While we don't utilise these pre-processing techniques in neural-based NLP anymore, they are still worth a recap. Typically, **stop words** and **punctuation** removal are employed, along with *either* **stemming** or **lemmatization**. However, in the following code, we will demonstrate each of the techniques separately (mainly due to our corpus being so small)

### Stop Word removal
Stop words are generally the most common words in the language which who's meaning in a sequenece is ambiguous. Some examples of stop words are: The, a, an, that.

### Punctuation removal
Old school NLP techniques (and some modern day ones) struggle to understand the semantics of punctuation. Thus, they were also removed.

## Stemming and Lemmatization
Stemming and Lemmatization are two distinct word normalization techniques. Essentially this means that, given our corpora, we wish to have variants of a word in a 'normal' form. For example, [playing, plays, played] may be normalised to "Play". The sentence "the boy's cars are different colours" may be normalised to "the boy car be differ colour"

### Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are assigned. Stemming a word may result in the word not actually being a word. For example, some stemming algorithms may stem [trouble, troubling, troubled] as "troubl".

### Lemmatization
Lemmatization attempts to properly reduce unnormalized tokens to a word that belongs in the language. The root word is called a **lemma**, and is the canonical form of a set of words. For example, [runs, running, ran] are all forms of the word "run.



Q. Think of two or three other stop words, and add them to the list of stop words below.

Done


Q. Write some code which both removes stop words and punctuation from our corpus

Done


Q. The examples of stemming and lemmatization below are on words/sequences not in our corpus. Extend the code so it works on our corpus.

##### 10 mins 

N.B. We are not going to use these techniques in this file after this section, so we will demonstrate how to perform these techniques distinctly on our toy corpus.

`*`Recently there has been newer approaches to "tokenization" which goes further than one token being one word. One example is [SentencePiece](https://github.com/google/sentencepiece). These approaches are out of scope for this lab session, but may appear in future sessions

In [None]:
import nltk
nltk.download('punkt') # Download the tokenizer model
nltk.download('wordnet') # Download the wordnet corpora

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/chanmunfai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chanmunfai/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# STOP WORD REMOVAL
stop_words_list = ["the", "a", "an", "that", "is", "will"] # Add stop words from Q1 here.

# SWR = stop words removed
tokenized_corpus_SWR = []
for sentence in corpus:
    tokenized_sentence_SWR = []
    
    for token in sentence.split(" "):
        if token not in stop_words_list:
            tokenized_sentence_SWR.append(token)

    if tokenized_sentence_SWR: # Only append to corpus if tokenized_sentence_SWR isn't empty
        tokenized_corpus_SWR.append(tokenized_sentence_SWR)
        
print(tokenized_corpus_SWR)

[['he', 'king'], ['she', 'queen'], ['he', 'Man'], ['she', 'Is', 'woman'], ['london', 'is,', 'capital', 'of', 'England'], ['Berlin', '...', 'capital', 'of', 'germany'], ['paris', 'capital', 'of', 'france.'], ['He', 'eat', 'cake,', 'pie,', 'and/or', 'brownies'], ['she', "didn't", 'like', 'brownies']]


In [None]:
# PUNCTUATION REMOVAL
import re # regex

re_punctuation_string = '[\s,/.\']'

# PR = punctuation removed
tokenized_corpus_PR = []
for sentence in corpus:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list 
    tokenized_corpus_PR.append(tokenized_sentence_PR)
        
print(tokenized_corpus_PR)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'Man'], ['she', 'Is', 'a', 'woman'], ['london', 'is', 'the', 'capital', 'of', 'England'], ['Berlin', 'is', 'the', 'capital', 'of', 'germany'], ['paris', 'is', 'the', 'capital', 'of', 'france'], ['He', 'will', 'eat', 'cake', 'pie', 'and', 'or', 'brownies'], ['she', 'didn', 't', 'like', 'the', 'brownies']]


In [None]:
# ANSWER Q2 HERE

tokenized_corpus_PR_SWR = []
tokenized_sentence_PR_SWR = []

for sentence in corpus:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list 

    for token in tokenized_sentence_PR:
        if token not in stop_words_list:
            tokenized_corpus_PR_SWR.append(token)
        
print(tokenized_corpus_PR_SWR)

['he', 'king', 'she', 'queen', 'he', 'Man', 'she', 'Is', 'woman', 'london', 'capital', 'of', 'England', 'Berlin', 'capital', 'of', 'germany', 'paris', 'capital', 'of', 'france', 'He', 'eat', 'cake', 'pie', 'and', 'or', 'brownies', 'she', 'didn', 't', 'like', 'brownies']


In [None]:
# STEMMING
from nltk.stem import PorterStemmer

porter = PorterStemmer()

stemming_word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()

for word in stemming_word_list:
      print("{0:20}{1:20}".format(word,porter.stem(word)))

Word                Stemmed variant     

friend              friend              
friendship          friendship          
friends             friend              
friendships         friendship          
stabil              stabil              
destabilize         destabil            
misunderstanding    misunderstand       
railroad            railroad            
moonlight           moonlight           
football            footbal             


In [None]:
# LEMMATIZATION
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

to_lemmatize_sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# lemmatization requires punctuation removal
to_lemmatize_sentence = re.split(re_punctuation_string, to_lemmatize_sentence)
to_lemmatize_sentence = list(filter(None, to_lemmatize_sentence))

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Word                Lemma               

He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 


In [None]:
# Why didn't the above do anything?
# It's because the lemmatizer requires parts of speech (POS) context about the word it is currently parsing.
# We would need to use a POS model to identify what the POS for a token in its context is.
# In the above example (and for Q3), we'll just pass in the VERB context for every token

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))

Word                Lemma               

He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


In [None]:
# ANSWER Q3 HERE (make sure not to overwrite our "tokenized_corpus" variable from the "")

to_lemmatize_sentence = " ".join(corpus)

# lemmatization requires punctuation removal
to_lemmatize_sentence = re.split(re_punctuation_string, to_lemmatize_sentence)
to_lemmatize_sentence = list(filter(None, to_lemmatize_sentence))
to_lemmatize_sentence = list(dict.fromkeys(to_lemmatize_sentence))
# print(to_lemmatize_sentence)

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos = "v")))

Word                Lemma               

he                  he                  
is                  be                  
a                   a                   
king                king                
she                 she                 
queen               queen               
Man                 Man                 
Is                  Is                  
woman               woman               
london              london              
the                 the                 
capital             capital             
of                  of                  
England             England             
Berlin              Berlin              
germany             germany             
paris               paris               
france              france              
He                  He                  
will                will                
eat                 eat                 
cake                cake                
pie                 pie                 
and            

Note that other NLP tools, such as [SpaCy](https://spacy.io/), or [Stanza](https://stanfordnlp.github.io/stanza/) are popular alternatives which provide higher levels of abstractions than NLTK. When working on an NLP task, the use of one of those two libraries is recommended over NLTK.

The code below will run through implementing a Word2Vec algorithm from scratch. A fuller and wholesome tutorial can be found in the "DeeperDiveIntoWordEmbeddings.zip" folder. Feel free to take a read through the notebook there if you want more information, or if you found yourself struggling to understand every concept from the taught theory. 

# Vocabulary

The code below obtains the vocabulary of the corpus. 

Q. Print the size of the vocabulary.

Q. A programatically cleaner (and shorter) way of writing the code below by using a set instead of a list. Can you implement the code below using a set?

##### 3 mins

In [None]:
vocabulary = [] # Let us put all the tokens (mostly words) 
                # appearing in the vocabulary in a list
  
for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)


print(vocabulary)



# Q. what is the size of the vocabulary?

# A. uncomment and fill below - 26

vocabulary_size = len(vocabulary)
print(vocabulary_size)

['he', 'is', 'a', 'king', 'she', 'queen', 'man', 'woman', 'london', 'is,', 'the', 'capital', 'of', 'england', 'berlin', 'germany', 'paris', 'france.', 'will', 'eat', 'cake,', 'pie,', 'and/or', 'brownies', "didn't", 'like']
26


In [None]:
vocabulary = set()
for sentence in tokenized_corpus: 
    for token in sentence: 
        vocabulary.add(token)

print(len(vocabulary))

26


# Helper functions 

* These are some of the common helper functions that are used for NLP models:

    * `word2idx`:  Maintains a dictionary of word and the corresponding index
    
    * `idx2word`: Maintains a mapping from index to word 
    
    
* Print the word2idx and idx2word, we will be using these in future exercises. 

##### 3 mins

In [None]:

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

print(word2idx)
print(idx2word)



{'eat': 0, 'brownies': 1, 'like': 2, 'will': 3, 'is': 4, 'she': 5, 'the': 6, 'berlin': 7, 'london': 8, 'cake,': 9, 'capital': 10, 'he': 11, 'king': 12, 'pie,': 13, "didn't": 14, 'paris': 15, 'and/or': 16, 'is,': 17, 'woman': 18, 'man': 19, 'queen': 20, 'a': 21, 'england': 22, 'germany': 23, 'france.': 24, 'of': 25}
{0: 'eat', 1: 'brownies', 2: 'like', 3: 'will', 4: 'is', 5: 'she', 6: 'the', 7: 'berlin', 8: 'london', 9: 'cake,', 10: 'capital', 11: 'he', 12: 'king', 13: 'pie,', 14: "didn't", 15: 'paris', 16: 'and/or', 17: 'is,', 18: 'woman', 19: 'man', 20: 'queen', 21: 'a', 22: 'england', 23: 'germany', 24: 'france.', 25: 'of'}


# Look-up table 

* This is a table that maps from an index to a one hot vector. 

Q. Print one-hot vectors corresponding to the words 'this', 'he' and ''england'

##### 3 mins

In [None]:

def look_up_table(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x
  
# This is a one hot representation

# Q. try printing it for word_idx = 1
print(look_up_table(1))

word_idx = word2idx['he']
print(look_up_table(word_idx))

try: 
    word_idx = word2idx['this']
    print(look_up_table(word_idx))
except: 
    print(f"Word does not exist")

word_idx = word2idx['england']
print(look_up_table(word_idx))


tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.])
Word does not exist
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0.])


# Extracting contexts and the focus word


Recall that we are building the skip-gram model. 

**We first begin by obtaining the set of contexts and focus words.**
* Let's say we have a sentence (represented as vocabulary indicies): `[0, 2, 3, 6, 7]`.
* For every word in the sentence, we want to get the words which are `window_size` around it.
* So if `window_size==2`, for the word '0', we obtain: `[[0, 2], [0, 3]]`
* For the word '2', we obtain: `[[2, 0], [2, 3], [2, 6]]`
* For the word '3', we obtain: `[[3, 0], [3, 2], [3, 6], [3, 7]]`

Q. Print some of the index pairs and trace them back to their words. 


##### 10 mins

In [None]:
window_size = 2

idx_pairs = []

# variables of interest: 
#   center_word_pos: center word position
#   context_word_pos: context_word_position
#   add sentence length as a constraint

for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]
    
    for center_word_pos in range(len(indices)):
        
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
                
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

# print(idx_pairs)

selected_idx_pairs = idx_pairs[:5]
# print(selected_idx_pairs)

for pair in selected_idx_pairs: 
    print(idx2word[pair[0]],idx2word[pair[1]] )

he is
he a
is he
is a
is king


# Parameters and hyperparameters 

* For our toy task, let us set the embedding dimensions to 5
* Let us run the algorithm for 10 epochs (number of times the training algorithm looks at the corpus/training data)
* Let us choose the learning rate as 0.001

We have two parameter matrices $W_1$ and $W_2$ - the embedding matrix and the weight matrix. 

Q. What are the dimensionalities of $W_1$ and $W_2$?

The dimension of the embedding matrix is |vocabulary| X 5, as it maps |vocabulary| inputs to 5 outputs. 
The dimension of the context matrix is 5 X |vocabularly|. 

* Why is this different here?
It is different because in the implementation, they multiply the weight before the feature. My answer is still correct.


##### 3 mins

In [None]:
# Hyperparameters:
embedding_dims = 5
num_epochs = 100
learning_rate = 0.001

# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)


# Training the model

(Refer to Lecture 2 slides 30-31)

In the code below, we are going to compute the log probability of the correct context (target) given the word. 

Before running the code, answer the question commented in the code -> fill `y_true`.

Print the loss and see if the loss goes down.

###### 10 mins



In [185]:

for epoch in tqdm(range(num_epochs)):
  
    loss_val = 0
    
    for data, target in idx_pairs: # data is current word, target is context word 
      
        x = torch.Tensor(look_up_table(data)) #, requires_grad=True) # x is a One-hot tensor

        # Q. what would y_true be? 
        y_true = torch.Tensor([target]).long()

        # A. 
        # This is wrong and seems to be unnecessary 
        # y_true = look_up_table(y_true).view(1, -1)
        # print(y_true.shape)

        # 
        z1 = torch.matmul(W1, x) 
        # Q. what is z1? # new vector representation of x
        
        z2 = torch.matmul(W2, z1)
        # print(z2.shape) # 26 
        # Q. what is the above operation? 
    
        # Let us obtain prediction over the vocabulary
        log_softmax = F.log_softmax(z2, dim=0)
        
        # Our loss is a negative log-likelihood loss 
        # (what does this mean?)
        
        # print(log_softmax.view(1,-1))
        # print(log_softmax.view(1,-1).shape)
        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        
        loss_val += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

    # print(loss_val/len(idx_pairs))

print(f'\nFinal epoch loss: {loss_val/len(idx_pairs)}')        

100%|██████████| 100/100 [00:03<00:00, 26.44it/s]


Final epoch loss: 3.3253594691936788





Q. Given that we are interested in distributed representations, what is the major bottleneck in our setup? Is it the dimensionality of the representations? Is it the learning rate? Is it the corpus? 

Major bottleneck is the size of the weights (i.e. number of parameters). If we have more words in our corpus, then the matrix is significantly bigger. 

Q. What hyperparameters would you tune to improve the representations? 

Q. Train the algorithm with a bigger corpus. 

(You can either copy and paste the corpus and bring it to the same format as the corpus above or use the hint below)

###### 10 mins

In [186]:
# Example code for getting corpora from the internet
import urllib
txt = [line.strip() for line in urllib.request.urlopen('https://raw.githubusercontent.com/luonglearnstocode/Seinfeld-text-corpus/master/corpus.txt').readlines()]
txt = [i.decode("utf-8") for i in txt]

In [187]:
print(txt[2])

JERRY: Do you know what this is all about? Do you know, why we're here? To be out, this is out...and out is one of the single most enjoyable experiences of life. People...did you ever hear people talking about "We should go out"? This is what they're talking about...this whole thing, we're all out now, no one is home. Not one person here is home, we're all out! There are people tryin' to find us, they don't know where we are. (on an imaginary phone) "Did you ring?, I can't find him." "Where did he go?" "He didn't tell me where he was going". He must have gone out. You wanna go out: you get ready, you pick out the clothes, right? You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then you're standing around, whatta you do? You go: "We gotta be getting back". Once you're out, you wanna get back! You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? Where ever you are in life, it's my feeling, you've gott

In [188]:
# Tokenise text with punctuation removal and stop word removal

tokenized_corpus_PR_SWR = []
tokenized_sentence_PR_SWR = []


for sentence in txt:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list 

    for token in tokenized_sentence_PR:
        if token not in stop_words_list:
            tokenized_corpus_PR_SWR.append(token)
        
# print(tokenized_corpus_PR_SWR)

In [189]:
vocabulary = set()
for sentence in tokenized_corpus: 
    for token in sentence: 
        vocabulary.add(token)

print(len(vocabulary))

26


# Using word embeddings

One of the simplest ways of exploiting word representations is to find similar words. There are many ways of measuring the semantic similarity between two words. As we are using word representations which are vectors in the euclidean space, distance metrics defined in the euclidean space are the most popular choice. This is because words that share common contexts in the corpus are located in close proximity to one another in the euclidean space.  One such metric is the eucldeian distance.

Q. What is the euclidean distance between 'the' and 'a' (in the sample corpus and the new corpus)? 
1.7463502883911133 in the sample corpus. 

Q. What other distance metrics can we use for two vectors? 

Cos similarity that does normalises for the length of the vectors (frequency)



###### 10 mins


In [190]:
from scipy import spatial

# Let us get two vectors from the trained model

x = torch.Tensor(look_up_table(0))
x_emb = torch.matmul(W1, x).detach().numpy()
y = torch.Tensor(look_up_table(1))
y_emb = torch.matmul(W1, y).detach().numpy()

# let us print the euclidean distance
print(euclidean(x_emb, y_emb))

idx_the = word2idx["the"]
idx_the = torch.Tensor(look_up_table(idx_the))
idx_a = word2idx["a"]
idx_a = torch.Tensor(look_up_table(idx_a))

the_emb = torch.matmul(W1, idx_the).detach().numpy()
a_emb = torch.matmul(W1, idx_a).detach().numpy()
print(euclidean(the_emb, a_emb))

print(spatial.distance.cosine(the_emb, a_emb)) # cosine distance 

1.9633183479309082
3.240849733352661
1.6134350895881653


# ADVANCED: Training with negative sampling 
 
 
 

Refer to skipgram models in the slides. 

Q. What happens when we have a very large vocabulary? 

Size of matrices increase. 

Q. What is a negative sample? 

It is a sample for which the desired output is 0, i.e. it does not appear in the context window of the target word. 


Below is the code for training the model with negative sampling. 


##### 10 mins 


In [191]:
# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)

for epoch in range(num_epochs):
    epoch_loss = 0
    for data, target in idx_pairs:
        x_var = Variable(look_up_table(data)).float() 
        
        y_pos = Variable(torch.from_numpy(np.array([target])).long())
        y_pos_var = Variable(look_up_table(target)).float()
        
        neg_sample = np.random.choice(list(range(vocabulary_size)),size=(1))[0] # How do I ensure that neg is not y_pos?

        y_neg = Variable(torch.from_numpy(np.array([neg_sample])))
        y_neg_var = Variable(look_up_table(neg_sample)).float()
         
        x_emb = torch.matmul(W1, x_var) 
        y_pos_emb = torch.matmul(W2, y_pos_var)
        y_neg_emb = torch.matmul(W2, y_neg_var)
        
        # get positive sample score
        pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb)) # sigmoid loss on positive sample
        
        # get negsample score
        neg_loss = F.logsigmoid(-1 * torch.matmul(x_emb, y_neg_emb)) # sigmoid loss on negative example
        
        loss = - (pos_loss + neg_loss)
        epoch_loss += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()
        
    if epoch % 10 == 0:    
        print(f'Loss at epo {epoch}: {epoch_loss/len(idx_pairs)}')

Loss at epo 0: 2.7675078245023124
Loss at epo 10: 2.451729760433619
Loss at epo 20: 2.485786211118102
Loss at epo 30: 2.345728284235184
Loss at epo 40: 2.219196591913127
Loss at epo 50: 2.121277701224272
Loss at epo 60: 2.049541345381966
Loss at epo 70: 2.0268701075074764
Loss at epo 80: 1.9597840020862909
Loss at epo 90: 1.8807630566593545


* In the current setup, we are only exploiting a very small sample of negative examples. This is suboptimal. 

* Given a sufficiently large vocabulary, we would ideally sample the negative samples from a noise distribution whose probabilities match the frequency of vocabulary.


Q.  Using this code as the basis, build an object oriented negative sampling based model and train it on the fairly large corpus. 



In [193]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip

# Pre-trained representations

We have seen from the above that word embeddings are learned in an unsupervised manner, i.e., we don't have any labelled data. These representations can be used to `bootstrap' models in NLP. There are many word representation inducing algorithms : [word2vec](https://arxiv.org/abs/1301.3781), [GloVe](https://nlp.stanford.edu/pubs/glove.pdf), [Fasttext](https://arxiv.org/abs/1607.04606) are some of the popular choices. There are differences in the algorithms but they are all based on the distributional hypothesis. 

We will now use one of these pre-trained representations: GloVe. 

Q. What is the dimensionality of the representations below? 
50 
##### 2 mins

In [None]:
w2i = [] # word2index
i2w = [] # index2word
wvecs = [] # word vectors

# this is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.50d.txt', 'r','utf-8') as f: 
  index = 0
  for line in tqdm(f.readlines()):
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
      
      (word, vec) = (line.strip().split()[0], 
                     list(map(float,line.strip().split()[1:]))) 
      
      wvecs.append(vec)
      w2i.append((word, index))
      i2w.append((index, word))
      index += 1

w2i = dict(w2i)
i2w = dict(i2w)
wvecs = np.array(wvecs)

100%|██████████| 400000/400000 [00:12<00:00, 30971.40it/s]


In [226]:
print(len(wvecs))
print(w2i["chicken"])
print(i2w[37751])
print(wvecs[37751])
print(wvecs[37751].shape)

400000
4298
kernels
[-0.0053221 -1.2119     0.23615    0.66215    0.097367   0.09544
 -0.046128  -0.79512   -0.17979    0.34579    0.14464    0.16585
  1.0863    -0.097819   0.65772    0.36766   -0.33688    0.089284
  0.26065   -0.93269   -0.87663   -1.5887     1.5867    -0.021235
  0.081774   1.1644     0.5547     0.76308    0.7955    -0.30241
  0.72303   -0.69054   -0.77696    1.4036    -0.31992    0.22547
 -0.80432    0.44514    0.73697   -0.26593    0.36781   -0.34146
 -1.385      1.2478     0.066696   1.1887     1.0335     0.12906
 -0.99628   -1.1382   ]
(50,)


For the following experiments, we recommend  using `wvecs` - the pretrained representations. 

# Evaluating word representation models

## Inrtinsic Evaluation

* Intrinsic evaluation of word representations involves evaluating  set of word vectors generated by an embedding technique on specific  subtasks that in someways are directly related to the distributional hypothesis. These are typically simple and fast to compute and thereby allow us to help understand representation learning algorithms.

* An intrinsic evaluation should typically return to us a scalar quantity that measures the performance of those word vectors on the evaluation subtask.



## Word Similarity

The first task we consider is evaluating if the representations are good at computing if two words are similar. In this task, you will use both euclidean distance or cosine distance as similarity measures. 

* Print similarity scores for word pairs in https://github.com/iraleviant/eval-multilingual-simlex/blob/master/evaluation/ws-353/wordsim353-english-sim.txt

     (Format of the file: two words and the corresponding human score for the two words)

* Obtain pearson's correlation with predicted scores and the human generated scores. 
(0.4246259666521494, 6.801092849706633e-10) where the first value is the correlation and 2nd value the p value. 

##### 15 mins



In [None]:
def cosine_distance(u, v):
    distance = 0.0
    dot = np.dot(u,v)
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    distance = dot/(norm_u)/norm_v
    return distance

In [269]:
sim_table = []

with open("word_sim.txt") as f: 
    next(f) # skip header
    for line in f: 
        try: 
            line = line.split()
            word1_emb = w2i[line[0]]
            word2_emb = w2i[line[1]]
            # print(word1_emb, word2_emb)
            dist = cosine_distance(wvecs[word1_emb] , wvecs[word2_emb])
            human_score = float(line[2])/10
            print(line[0], line[1], human_score, dist)
            sim_table.append([line[0], line[1], human_score, dist])
            
        except: 
            pass
        # print(line, dist)

tiger cat 0.8310000000000001 0.6150732418405161
tiger tiger 1.0 1.0
plane car 0.585 0.6656114206912427
train car 0.631 0.7658227528614754
television radio 0.6849999999999999 0.8709275278715246
media radio 0.808 0.7614966025988954
bread butter 0.754 0.8402199989344501
cucumber potato 0.592 0.7118516503650522
doctor nurse 0.8150000000000001 0.7977497347874546
professor doctor 0.5309999999999999 0.5824731746059034
student professor 0.654 0.672044328547566
smart student 0.658 0.40652697584786424
smart stupid 0.592 0.6181943426538674
stock phone 0.154 0.5236597198261208
stock jaguar 0.154 0.207980845411382
stock egg 0.162 0.22105772094366385
stock live 0.369 0.2711624205965431
stock life 0.185 0.38823228726019743
wood forest 0.8380000000000001 0.5854366178142411
money cash 0.9380000000000001 0.8989869711257938
professor cucumber 0.023 -0.130357802619789
king cabbage 0.11499999999999999 0.11323214733125984
king queen 0.8460000000000001 0.7839043010964117
king rook 0.592 0.2544332119182013
bi

In [292]:
from scipy.stats import pearsonr

sim_table = np.array(sim_table)
x = np.double(sim_table[:, 2])
y = np.double(sim_table[:, 3])

pearsonr(x, y) 

(0.4246259666521494, 6.801092849706633e-10)

##  Exploring Analogies

The second task we consider **completing analogies**. We are given an incomplete analogy of the form: 


* $a : b : : c :~?$


We would then identify the word vector which maximizes the cosine similarity. 
This metric has an intuitive interpretation. Ideally, we want $\phi(b) - \phi(a) = \phi(d) - \phi(c)$ where $\phi(.)$ is the word vector. 
For instance, 

* *london $-$ england = paris $-$ france* .

Thus we identify the vector $\phi(d)$ which maximizes the normalized dot-product between the two word
vectors (i.e. cosine similarity).



* You can either use your own method to compute the correct word or use the code below. 

* Use original analogies dataset https://github.com/svn2github/word2vec/blob/master/questions-words.txt 

Q. When does it fail? 

The vector space does not capture all semantic relationships between words. For example, if I were to do happy and sad, and bad and __, this will fail. 

Q. What are the possible reasons for failure?

##### 15mins



In [293]:
def find_analogy(word_a, word_b, word_c, word_vectors, word2index):
    word_a = word_a.lower()
    word_b = word_b.lower()
    word_c = word_c.lower()
    
    (e_a, e_b, e_c) = (word_vectors[word2index[word_a]], 
                       word_vectors[word2index[word_b]], 
                       word_vectors[word2index[word_c]])
    
    
    max_cosine_sim = -999
    best_word = None
    
    for (w, i) in word2index.items():
        if w in [word_a, word_b, word_c]:
            continue
        cosine_sim = cosine_distance(e_b - e_a, word_vectors[i] - e_c)
        
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
            
    return best_word
  
# find_analogy('france', 'paris', 'england', wvecs, w2i)

In [294]:
find_analogy('france', 'paris', 'england', wvecs, w2i)

'london'

In [296]:
find_analogy('happy', 'sad', 'good', wvecs, w2i)

'irony'

In [297]:
find_analogy("cat", "animal", "rose", wvecs, w2i)

'agricultural'

# Advanced: Compositionality 

* Given access to only word representations, how can we build representations for phrases and sentences? 

  (Hint: algebraic operation is one way) 


* Compute the similarity score between two sentences on the STS.input.MSRpar.txt dataset from https://github.com/alvations/stasis/tree/master/STS-data/STS2012-train 

  (Please use 00-readme.txt in the corpus for details on the format)
  
* Measure the pearson correlation with the human scores in STS.gs.MSRpar.txt

Q. What problems did you encounter when computing the scores? 

Q. What are alternative ways of computing the scores? 

Q. Using your composition method, compute representations for the following expressions and also list the top-5 most similar words: 

* New York 
* kick the bucket
* post office

  Does it work? What are the possible reasons? 






# References


* [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf): word2vec reference

* [Eluciating the properties of semantic word representations](http://www.offconvex.org/2016/02/14/word-embeddings-1/): A global perspective

* [Understanding the algebraic notions of semantic word representations](http://www.offconvex.org/2015/12/12/word-embeddings-1/): Why does the word-analogies task work with simple algebraic manipulations?

* [Stemming And Lemmatization](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)