# Homework 9 - Descriptive Notebook

In this homework notebook, we will create and train our own SkipGram embedding, by using the speech from Martin Luther King in the text.text file.

Get familiar with the code and write a small report (2 pages max), with answers to the questions listed at the end of the notebook.

**The report must be submitted in PDF format, before April 4th, 11.59pm!**

Do not forget to write your name and student ID on the report.

You may also submit your own copy of the notebook along with the report. If you do so, please add your name and ID to the cell below.

In [4]:
# Name: Amarjyot Kaur
# Student ID: 1003084

### Imports needed

Note, we strongly advise to use a CUDA/GPU machine for this notebook.

Technically, this can be done on CPU only, but it will be very slow!

If you decide to use it on CPU, you might also have to change some of the .cuda() methods used on torch tensors and models in this notebook!

In [5]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import functools
import matplotlib.pyplot as plt
CUDA = torch.cuda.is_available()

### Step 1. Produce some data based on a given text for training our SkipGram model    

The functions below will be used to produce our dataset for training the SkipGram model.

In [6]:
def text_to_train(text, context_window):
    """
    This function receives the text as a list of words, in lowercase format.
    It then returns data, a list of all the possible (x,y) pairs with
    - x being the middle word of the sentence of length 2*context_window+1,
    - y being a list of 2k words, containing the k preceding words and the k
    posterior words.
    """
    
    # Get data from list of words in text, using a context window of size k = context_window
    data = []
    for i in range(context_window, len(text) - context_window):
        target = [text[i+e] for e in range(-context_window, context_window+1) if i+e != i]
        input_word = text[i]
        data.append((input_word, target))
        
    return data

In [7]:
def create_text():
    """
    This function loads the string of text from the text.txt file,
    and produces a list of words in string format, as variable text.
    """
    
    # Load corpus from file
    with open("./text.txt", 'r', encoding="utf8",) as f:
        corpus = f.readlines()
    f.close()
    
    # Join corpus into a single string
    text = ""
    for s in corpus:
        l = s.split()
        for s2 in l:
            # Removes all special characters from string
            s2 = ''.join(filter(str.isalnum, s2))
            s2 += ' '
            text += s2.lower()
    text = text.split()
    
    return text

In [8]:
text = create_text()
print(text)

['i', 'am', 'happy', 'to', 'join', 'with', 'you', 'today', 'in', 'what', 'will', 'go', 'down', 'in', 'history', 'as', 'the', 'greatest', 'demonstration', 'for', 'freedom', 'in', 'the', 'history', 'of', 'our', 'nation', 'five', 'score', 'years', 'ago', 'a', 'great', 'american', 'in', 'whose', 'symbolic', 'shadow', 'we', 'stand', 'today', 'signed', 'the', 'emancipation', 'proclamation', 'this', 'momentous', 'decree', 'came', 'as', 'a', 'great', 'beacon', 'of', 'hope', 'to', 'millions', 'of', 'slaves', 'who', 'had', 'been', 'seared', 'in', 'the', 'flames', 'of', 'whithering', 'injustice', 'it', 'came', 'as', 'a', 'joyous', 'daybreak', 'to', 'end', 'the', 'long', 'night', 'of', 'their', 'captivity', 'but', 'one', 'hundred', 'years', 'later', 'the', 'colored', 'america', 'is', 'still', 'not', 'free', 'one', 'hundred', 'years', 'later', 'the', 'life', 'of', 'the', 'colored', 'american', 'is', 'still', 'sadly', 'crippled', 'by', 'the', 'manacle', 'of', 'segregation', 'and', 'the', 'chains', '

In [9]:
def generate_data(text, context_window):
    """
    This function receives the text and context window size.
    It produces four outputs:
    - vocab, a set containing the words found in text.txt,
    without any doublons,
    - word2index, a dictionary to convert words to their integer index,
    - word2index, a dictionary to convert integer index to their respective words,
    - data, containing our (x,y) pairs for training.
    """
    
    # Create vocabulary set V
    vocab = set(text)
    
    # Word to index and index 2 word converters
    word2index = {w:i for i,w in enumerate(vocab)}
    index2word = {i:w for i,w in enumerate(vocab)}
    
    # Generate data
    data = text_to_train(text, context_window)
    
    return vocab, data, word2index, index2word

In [10]:
vocab, data, word2index, index2word = generate_data(text, context_window = 2)

In [11]:
print(vocab)

{'peoples', 'bodies', 'free', 'revolt', 'children', 'bad', 'discontent', 'usual', 'heavy', 'engage', 'go', 'hew', 'thank', 'crooked', 'old', 'changed', 'inalienable', 'character', 'city', 'only', 'dark', 'straight', 'nations', 'colorado', 'end', 'crippled', 'hallowed', 'happy', 'luxury', 'somehow', 'struggle', 'thee', 'society', 'righteousness', 'autumn', 'engulfed', 'i', 'able', 'spot', 'tranquilizing', 'places', 'highways', 'rolls', 'america', 'storms', 'ghettos', 'until', 'manacle', 'freedom', 'of', 'vaults', 'has', 'shadow', 'he', 'insufficient', 'nullification', 'oppression', 'rock', 'ago', 'well', 'us', 'join', 'staggered', 'millions', 'hundred', 'jail', 'sense', 'quest', 'never', 'alleghenies', 'beginning', 'give', 'sing', 'condition', 'island', 'true', 'exile', 'off', 'that', 'nor', 'every', 'white', 'content', 'rights', 'citizenship', 'note', 'louisiana', 'wallow', 'one', 'equality', 'pursuit', 'rough', 'as', 'poverty', 'would', 'bankrupt', 'nation', 'knowing', 'right', 'catho

In [12]:
print(word2index)

{'peoples': 0, 'bodies': 1, 'free': 2, 'revolt': 3, 'children': 4, 'bad': 5, 'discontent': 6, 'usual': 7, 'heavy': 8, 'engage': 9, 'go': 10, 'hew': 11, 'thank': 12, 'crooked': 13, 'old': 14, 'changed': 15, 'inalienable': 16, 'character': 17, 'city': 18, 'only': 19, 'dark': 20, 'straight': 21, 'nations': 22, 'colorado': 23, 'end': 24, 'crippled': 25, 'hallowed': 26, 'happy': 27, 'luxury': 28, 'somehow': 29, 'struggle': 30, 'thee': 31, 'society': 32, 'righteousness': 33, 'autumn': 34, 'engulfed': 35, 'i': 36, 'able': 37, 'spot': 38, 'tranquilizing': 39, 'places': 40, 'highways': 41, 'rolls': 42, 'america': 43, 'storms': 44, 'ghettos': 45, 'until': 46, 'manacle': 47, 'freedom': 48, 'of': 49, 'vaults': 50, 'has': 51, 'shadow': 52, 'he': 53, 'insufficient': 54, 'nullification': 55, 'oppression': 56, 'rock': 57, 'ago': 58, 'well': 59, 'us': 60, 'join': 61, 'staggered': 62, 'millions': 63, 'hundred': 64, 'jail': 65, 'sense': 66, 'quest': 67, 'never': 68, 'alleghenies': 69, 'beginning': 70, 'g

In [13]:
print(index2word)

{0: 'peoples', 1: 'bodies', 2: 'free', 3: 'revolt', 4: 'children', 5: 'bad', 6: 'discontent', 7: 'usual', 8: 'heavy', 9: 'engage', 10: 'go', 11: 'hew', 12: 'thank', 13: 'crooked', 14: 'old', 15: 'changed', 16: 'inalienable', 17: 'character', 18: 'city', 19: 'only', 20: 'dark', 21: 'straight', 22: 'nations', 23: 'colorado', 24: 'end', 25: 'crippled', 26: 'hallowed', 27: 'happy', 28: 'luxury', 29: 'somehow', 30: 'struggle', 31: 'thee', 32: 'society', 33: 'righteousness', 34: 'autumn', 35: 'engulfed', 36: 'i', 37: 'able', 38: 'spot', 39: 'tranquilizing', 40: 'places', 41: 'highways', 42: 'rolls', 43: 'america', 44: 'storms', 45: 'ghettos', 46: 'until', 47: 'manacle', 48: 'freedom', 49: 'of', 50: 'vaults', 51: 'has', 52: 'shadow', 53: 'he', 54: 'insufficient', 55: 'nullification', 56: 'oppression', 57: 'rock', 58: 'ago', 59: 'well', 60: 'us', 61: 'join', 62: 'staggered', 63: 'millions', 64: 'hundred', 65: 'jail', 66: 'sense', 67: 'quest', 68: 'never', 69: 'alleghenies', 70: 'beginning', 71

In [14]:
print(data)

[('happy', ['i', 'am', 'to', 'join']), ('to', ['am', 'happy', 'join', 'with']), ('join', ['happy', 'to', 'with', 'you']), ('with', ['to', 'join', 'you', 'today']), ('you', ['join', 'with', 'today', 'in']), ('today', ['with', 'you', 'in', 'what']), ('in', ['you', 'today', 'what', 'will']), ('what', ['today', 'in', 'will', 'go']), ('will', ['in', 'what', 'go', 'down']), ('go', ['what', 'will', 'down', 'in']), ('down', ['will', 'go', 'in', 'history']), ('in', ['go', 'down', 'history', 'as']), ('history', ['down', 'in', 'as', 'the']), ('as', ['in', 'history', 'the', 'greatest']), ('the', ['history', 'as', 'greatest', 'demonstration']), ('greatest', ['as', 'the', 'demonstration', 'for']), ('demonstration', ['the', 'greatest', 'for', 'freedom']), ('for', ['greatest', 'demonstration', 'freedom', 'in']), ('freedom', ['demonstration', 'for', 'in', 'the']), ('in', ['for', 'freedom', 'the', 'history']), ('the', ['freedom', 'in', 'history', 'of']), ('history', ['in', 'the', 'of', 'our']), ('of', [

In [15]:
def words_to_tensor(words: list, word2index: dict, dtype = torch.FloatTensor):
    """
    This fucntion converts a word or a list of words into a torch tensor,
    with appropriate format.
    It reuses the word2index dictionary.
    """
    
    tensor =  dtype([word2index[word] for word in words])
    tensor = tensor.cuda()
    
    return Variable(tensor)

### Step 2. Create a SkipGram model and train

Task #1: Write your own model for the SkipGram model below.

In [16]:
class SkipGram(nn.Module):
  def __init__(self, vocab_size, embedding_dim, context_size):
          super(SkipGram, self).__init__()
          self.embeddings = nn.Embedding(vocab_size, embedding_dim)
          self.linear1 = nn.Linear(embedding_dim, embedding_dim)
          self.linear2 = nn.Linear(embedding_dim, 2*context_size * vocab_size)
          self.context_size = context_size
          self.embedding_dim = embedding_dim
          self.vocab_size=vocab_size
        
  def forward(self, inputs):
          embeds = self.embeddings(inputs).view((1, -1))  
          out1 = F.relu(self.linear1(embeds)) # output of first layer
          out2 = self.linear2(out1)           # output of second layer
          out2 = torch.reshape(out2,(2*self.context_size,-1))
          log_probs = F.log_softmax(out2, dim=1)
          return log_probs


In [17]:
# Create model and pass to CUDA
model = SkipGram(context_size = 2, embedding_dim = 20, vocab_size = len(vocab))
#model = model.cuda()
model.train()


SkipGram(
  (embeddings): Embedding(467, 20)
  (linear1): Linear(in_features=20, out_features=20, bias=True)
  (linear2): Linear(in_features=20, out_features=1868, bias=True)
)

In [18]:
# Define training parameters
learning_rate = 0.001
epochs = 50
torch.manual_seed(28)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr = learning_rate)

In [19]:
len(vocab)

467

In [22]:
import inspect
print(inspect.stack())

def train(data, word2index, model, epochs, loss_func, optimizer):
  losses=[]
  total_loss = 0
  accuracies=[]
  for target,context in data:
    print([word2index[w] for w in context])
    context_idxs = torch.tensor([word2index[w] for w in context],dtype=torch.long)
    print(context_idxs)
    model.zero_grad()
    log_probs = model(context_idxs)
    loss = loss_func(log_probs, torch.tensor([word2index[target]],dtype=torch.long))
    loss.backward()
    optimizer.step()
    total_loss+=loss.item()
  
  if epoch % 10 == 0:
      accuracy = calc_accuracy(model, data, word2index, index2word)
      print("Epoch:", epoch, "Accuracy:",accuracy)
      accuracies.append(accuracy)
      losses.append(total_loss)  

  return losses, accuracies, model



def calc_accuracy(model, data, word2index, index2word):
    correct = 0
    for target,context in data:
        prediction = get_prediction(context, model, word2index, index2word)
        if prediction == target:
            correct+= 1
    result = correct/len(data)
    return result

losses, accuracies, model = train(data, word2index, model, epochs, loss_function, optimizer)

[FrameInfo(frame=<frame at 0x7fc3d724a250, file '<ipython-input-22-35a6bf7e31e9>', line 2, code <module>>, filename='<ipython-input-22-35a6bf7e31e9>', lineno=2, function='<module>', code_context=['print(inspect.stack())\n'], index=0), FrameInfo(frame=<frame at 0x5566a692d220, file '/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py', line 2882, code run_code>, filename='/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py', lineno=2882, function='run_code', code_context=['                exec(code_obj, self.user_global_ns, self.user_ns)\n'], index=0), FrameInfo(frame=<frame at 0x5566a75260a0, file '/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py', line 2822, code run_ast_nodes>, filename='/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py', lineno=2822, function='run_ast_nodes', code_context=['                if self.run_code(code, result):\n'], index=0), FrameInfo(frame=<frame at 0x5566a996dc20, file 

RuntimeError: ignored

Task #2: Write your own training function for the SkipGram model in the cell below. It should return a list of losses and accuracies for display later on, along with your trained model. You may also write a helper function for computing the accuracy of your model during training.

### 3. Visualization

In [None]:
# Display losses over time
plt.figure()
plt.plot(losses)
plt.show()

In [None]:
# Display accuracies over time
plt.figure()
plt.plot(accuracies)
plt.show()

### Questions and expected answers for the report

A. Copy and paste your SkipGram class code (Task #1 in the notebook)

B. Copy and paste your train function (Task #2 in the notebook), along with any helper functions you might have used (e.g. a function to compute the accuracy of your model after each iteration). Please also copy and paste the function call with the parameters you used for the train() function.

C. Why is the SkipGram model much more difficult to train than the CBoW. Is it problematic if it does not reach a 100% accuracy on the task it is being trained on?

D. If we were to evaluate this model by using intrinsic methods, what could be a possible approach to do so. Please submit some code that will demonstrate the performance/problems of the word embedding you have trained!

**My Answers**

C: The Skipgram model predicts the context words given the middle word as input while CBOW predicts the middle word given the context words as input.
in the Skipgram algo, for each middle word, all the N context words will be predicted and loss is calculated based on N predictions. However, in 
CBOW, for every N context words, one middle word is predicted so only one prediction is made for each input and loss is calculated based on the predicted middle word.
Hence, Skipgram is more computationally costly compared to CBOW and is thus more difficult to train.
It is not problematic if Skipgram does not produce 100% accuracy as it predicts multiple context words of similar semantics. The accuracy score is averaged across all the predicted context words so 
even if the accuracy score is not 100% and perhaps at 90%, it would still produce some correct context words as well as other words with similar meaning that may also make sense linguisitically.



D: Intrinsic methods of evaluation are experiments in which word emnbeddings are compared with human judgement on word relations.
A good word embedding should model semantic proximity, word analogies and should be able to operate on variations of the words.
We can use the cosine similarity metric by calculating the cosine angle between 2 embedding vectors. Values close to 1 will mean that the words carry 
close semantic meaning while values close to 0 mean they are more uncorrelated. We then check that the semantic proximity is preserved by checking 
printing the top 10 words with highest cosine distance for any given word v and check that these words make sense and are similar in meaning to the given word v.


