# Homework 9 - Descriptive Notebook

In this homework notebook, we will create and train our own SkipGram embedding, by using the speech from Martin Luther King in the text.text file.

Get familiar with the code and write a small report (2 pages max), with answers to the questions listed at the end of the notebook.

**The report must be submitted in PDF format, before April 4th, 11.59pm!**

Do not forget to write your name and student ID on the report.

You may also submit your own copy of the notebook along with the report. If you do so, please add your name and ID to the cell below.

In [1]:
# Name:
# Student ID:

### Imports needed

Note, we strongly advise to use a CUDA/GPU machine for this notebook.

Technically, this can be done on CPU only, but it will be very slow!

If you decide to use it on CPU, you might also have to change some of the .cuda() methods used on torch tensors and models in this notebook!

In [2]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import functools
import matplotlib.pyplot as plt
CUDA = torch.cuda.is_available()

### Step 1. Produce some data based on a given text for training our SkipGram model    

The functions below will be used to produce our dataset for training the SkipGram model.

In [3]:
def text_to_train(text, context_window):
    """
    This function receives the text as a list of words, in lowercase format.
    It then returns data, a list of all the possible (x,y) pairs with
    - x being the middle word of the sentence of length 2*context_window+1,
    - y being a list of 2k words, containing the k preceding words and the k
    posterior words.
    """
    
    # Get data from list of words in text, using a context window of size k = context_window
    data = []
    for i in range(context_window, len(text) - context_window):
        target = [text[i+e] for e in range(-context_window, context_window+1) if i+e != i]
        input_word = text[i]
        data.append((input_word, target))
        
    return data

In [4]:
def create_text():
    """
    This function loads the string of text from the text.txt file,
    and produces a list of words in string format, as variable text.
    """
    
    # Load corpus from file
    with open("./text.txt", 'r', encoding="utf8",) as f:
        corpus = f.readlines()
    f.close()
    
    # Join corpus into a single string
    text = ""
    for s in corpus:
        l = s.split()
        for s2 in l:
            # Removes all special characters from string
            s2 = ''.join(filter(str.isalnum, s2))
            s2 += ' '
            text += s2.lower()
    text = text.split()
    
    return text

In [5]:
text = create_text()
print(text)

['i', 'am', 'happy', 'to', 'join', 'with', 'you', 'today', 'in', 'what', 'will', 'go', 'down', 'in', 'history', 'as', 'the', 'greatest', 'demonstration', 'for', 'freedom', 'in', 'the', 'history', 'of', 'our', 'nation', 'five', 'score', 'years', 'ago', 'a', 'great', 'american', 'in', 'whose', 'symbolic', 'shadow', 'we', 'stand', 'today', 'signed', 'the', 'emancipation', 'proclamation', 'this', 'momentous', 'decree', 'came', 'as', 'a', 'great', 'beacon', 'of', 'hope', 'to', 'millions', 'of', 'slaves', 'who', 'had', 'been', 'seared', 'in', 'the', 'flames', 'of', 'whithering', 'injustice', 'it', 'came', 'as', 'a', 'joyous', 'daybreak', 'to', 'end', 'the', 'long', 'night', 'of', 'their', 'captivity', 'but', 'one', 'hundred', 'years', 'later', 'the', 'colored', 'america', 'is', 'still', 'not', 'free', 'one', 'hundred', 'years', 'later', 'the', 'life', 'of', 'the', 'colored', 'american', 'is', 'still', 'sadly', 'crippled', 'by', 'the', 'manacle', 'of', 'segregation', 'and', 'the', 'chains', '

In [6]:
def generate_data(text, context_window):
    """
    This function receives the text and context window size.
    It produces four outputs:
    - vocab, a set containing the words found in text.txt,
    without any doublons,
    - word2index, a dictionary to convert words to their integer index,
    - word2index, a dictionary to convert integer index to their respective words,
    - data, containing our (x,y) pairs for training.
    """
    
    # Create vocabulary set V
    vocab = set(text)
    
    # Word to index and index 2 word converters
    word2index = {w:i for i,w in enumerate(vocab)}
    index2word = {i:w for i,w in enumerate(vocab)}
    
    # Generate data
    data = text_to_train(text, context_window)
    
    return vocab, data, word2index, index2word

In [7]:
vocab, data, word2index, index2word = generate_data(text, context_window = 2)

In [8]:
print(vocab)

{'mountainside', 'emerges', 'no', 'color', 'as', 'whithering', 'gentiles', 'slaves', 'hold', 'constitution', 'this', 'white', 'americans', 'heir', 'go', 'shadow', 'autumn', 'nullification', 'state', 'beginning', 'now', 'larger', 'security', 'but', 'veterans', 'fathers', 'some', 'sacred', 'magnificent', 'real', 'satisfied', 'later', 'created', 'selfevident', 'underestimate', 'island', 'skin', 'vicious', 'needed', 'cooling', 'blow', 'made', 'were', 'judged', 'which', 'us', 'stream', 'daybreak', 'end', 'quest', 'gradualism', 'equality', 'opportunity', 'motels', 'our', 'hilltops', 'by', 'lord', 'revealed', 'he', 'heavy', 'brothers', 'emancipation', 'engage', 'snowcapped', 'areas', 'signing', 'old', 'their', 'only', 'character', 'rough', 'freedom', 'deeply', 'discords', 'republic', 'lives', 'happiness', 'score', 'staggered', 'stone', 'interposition', 'given', 'unmindful', 'sunlit', 'peoples', 'whirlwinds', 'rolls', 'on', 'i', 'defaulted', 'today', 'demonstration', 'stripped', 'determination

In [9]:
print(word2index)

{'mountainside': 0, 'emerges': 1, 'no': 2, 'color': 3, 'as': 4, 'whithering': 5, 'gentiles': 6, 'slaves': 7, 'hold': 8, 'constitution': 9, 'this': 10, 'white': 11, 'americans': 12, 'heir': 13, 'go': 14, 'shadow': 15, 'autumn': 16, 'nullification': 17, 'state': 18, 'beginning': 19, 'now': 20, 'larger': 21, 'security': 22, 'but': 23, 'veterans': 24, 'fathers': 25, 'some': 26, 'sacred': 27, 'magnificent': 28, 'real': 29, 'satisfied': 30, 'later': 31, 'created': 32, 'selfevident': 33, 'underestimate': 34, 'island': 35, 'skin': 36, 'vicious': 37, 'needed': 38, 'cooling': 39, 'blow': 40, 'made': 41, 'were': 42, 'judged': 43, 'which': 44, 'us': 45, 'stream': 46, 'daybreak': 47, 'end': 48, 'quest': 49, 'gradualism': 50, 'equality': 51, 'opportunity': 52, 'motels': 53, 'our': 54, 'hilltops': 55, 'by': 56, 'lord': 57, 'revealed': 58, 'he': 59, 'heavy': 60, 'brothers': 61, 'emancipation': 62, 'engage': 63, 'snowcapped': 64, 'areas': 65, 'signing': 66, 'old': 67, 'their': 68, 'only': 69, 'characte

In [10]:
print(index2word)

{0: 'mountainside', 1: 'emerges', 2: 'no', 3: 'color', 4: 'as', 5: 'whithering', 6: 'gentiles', 7: 'slaves', 8: 'hold', 9: 'constitution', 10: 'this', 11: 'white', 12: 'americans', 13: 'heir', 14: 'go', 15: 'shadow', 16: 'autumn', 17: 'nullification', 18: 'state', 19: 'beginning', 20: 'now', 21: 'larger', 22: 'security', 23: 'but', 24: 'veterans', 25: 'fathers', 26: 'some', 27: 'sacred', 28: 'magnificent', 29: 'real', 30: 'satisfied', 31: 'later', 32: 'created', 33: 'selfevident', 34: 'underestimate', 35: 'island', 36: 'skin', 37: 'vicious', 38: 'needed', 39: 'cooling', 40: 'blow', 41: 'made', 42: 'were', 43: 'judged', 44: 'which', 45: 'us', 46: 'stream', 47: 'daybreak', 48: 'end', 49: 'quest', 50: 'gradualism', 51: 'equality', 52: 'opportunity', 53: 'motels', 54: 'our', 55: 'hilltops', 56: 'by', 57: 'lord', 58: 'revealed', 59: 'he', 60: 'heavy', 61: 'brothers', 62: 'emancipation', 63: 'engage', 64: 'snowcapped', 65: 'areas', 66: 'signing', 67: 'old', 68: 'their', 69: 'only', 70: 'char

In [11]:
print(data)

[('happy', ['i', 'am', 'to', 'join']), ('to', ['am', 'happy', 'join', 'with']), ('join', ['happy', 'to', 'with', 'you']), ('with', ['to', 'join', 'you', 'today']), ('you', ['join', 'with', 'today', 'in']), ('today', ['with', 'you', 'in', 'what']), ('in', ['you', 'today', 'what', 'will']), ('what', ['today', 'in', 'will', 'go']), ('will', ['in', 'what', 'go', 'down']), ('go', ['what', 'will', 'down', 'in']), ('down', ['will', 'go', 'in', 'history']), ('in', ['go', 'down', 'history', 'as']), ('history', ['down', 'in', 'as', 'the']), ('as', ['in', 'history', 'the', 'greatest']), ('the', ['history', 'as', 'greatest', 'demonstration']), ('greatest', ['as', 'the', 'demonstration', 'for']), ('demonstration', ['the', 'greatest', 'for', 'freedom']), ('for', ['greatest', 'demonstration', 'freedom', 'in']), ('freedom', ['demonstration', 'for', 'in', 'the']), ('in', ['for', 'freedom', 'the', 'history']), ('the', ['freedom', 'in', 'history', 'of']), ('history', ['in', 'the', 'of', 'our']), ('of', [

In [None]:
def words_to_tensor(words: list, word2index: dict, dtype = torch.FloatTensor):
    """
    This fucntion converts a word or a list of words into a torch tensor,
    with appropriate format.
    It reuses the word2index dictionary.
    """
    
    tensor =  dtype([word2index[word] for word in words])
    tensor = tensor.cuda()
    
    return Variable(tensor)

### Step 2. Create a SkipGram model and train

Task #1: Write your own model for the SkipGram model below.

In [None]:
class SkipGram(nn.Module):
    """
    Your skipgram model here!
    """
    
    def __init__(self, context_size, embedding_dim, vocab_size):
        pass

    def forward(self, inputs):
        pass

In [None]:
# Create model and pass to CUDA
model = SkipGram(context_size = 2, embedding_dim = 20, vocab_size = len(vocab))
model = model.cuda()
model.train()

In [None]:
# Define training parameters
learning_rate = 0.001
epochs = 50
torch.manual_seed(28)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr = learning_rate)

In [None]:
def get_prediction(context, model, word2index, index2word):
    """
    This is a helper function to get prediction from our model.
    """
    
    return None

Task #2: Write your own training function for the SkipGram model in the cell below. It should return a list of losses and accuracies for display later on, along with your trained model. You may also write a helper function for computing the accuracy of your model during training.

In [None]:
def train(data, word2index, model, epochs, loss_func, optimizer):
    losses = []
    accuracies = []
    pass
    return losses, accuracies, model

losses, accuracies, model = train(data, word2index, model, epochs, loss_func, optimizer)

### 3. Visualization

In [None]:
# Display losses over time
plt.figure()
plt.plot(losses)
plt.show()

In [None]:
# Display accuracies over time
plt.figure()
plt.plot(accuracies)
plt.show()

### Questions and expected answers for the report

A. Copy and paste your SkipGram class code (Task #1 in the notebook)

B. Copy and paste your train function (Task #2 in the notebook), along with any helper functions you might have used (e.g. a function to compute the accuracy of your model after each iteration). Please also copy and paste the function call with the parameters you used for the train() function.

C. Why is the SkipGram model much more difficult to train than the CBoW. Is it problematic if it does not reach a 100% accuracy on the task it is being trained on?

D. If we were to evaluate this model by using intrinsic methods, what could be a possible approach to do so. Please submit some code that will demonstrate the performance/problems of the word embedding you have trained!