# CS549 Machine Learning - Irfan Khan 
# Assignment 11: Word Embeddings

**Total points: 15**

Updated assignment designed by Ex Prof Yang Xu, Computer Science department, SDSU. In this assignment, you will implement a simple continuous bag-of-words (CBOW) model that uses surrounding context words to predict the target word in the middle. https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight=embeddings

You will also use pre-trained GloVe embeddings from https://nlp.stanford.edu/projects/glove/ to predict target word in the middle. 

To index into the embeddings, you must use torch.LongTensor (since the indices are integers, not floats).

# Load Libraries

In [1]:
import  torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from typing import List
from collections import Counter

# build_vocab function

In [2]:
def build_vocab(words: List[str]):
    vocab = Counter()
    for w in words:
        vocab[w] += 1
    return vocab

## Task 1. Prepare data
**2 points**

In this task, you should prepare your data, which is a list of tuples. Each tuple has two elements: a list of context words, and the target word.
Here both context words and target word are represented by word indices.

In [3]:
CONTEXT_SIZE = 3  # Define the context size. Default value 3, which means the context 
#includes 3 words to the left, 3 to the right

#In Python, enclosing a string within triple quotes (either single ' or double " quotes)
#allows you to create a multiline string literal

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

print (raw_text) #- try it out
vocab = build_vocab(raw_text)
print (vocab) #- try it out
vocab_size = len(vocab)

print ('vocab_size',vocab_size)

word_to_idx = {word: i for i, word in enumerate(vocab)}#index for each word in vocab
idx_to_word = list(vocab)#list of words in the vocab

print ('word_to_idx', word_to_idx) #- try it out
print ('idx_to_word',idx_to_word) #- try it out

word_indices = [word_to_idx[w] for w in raw_text]#Every word in the raw text has an index

print ('word_indices',word_indices)
print ('len(word_indices)', len(word_indices))

#The function prepare_data creates a data sample with 6 context words and the target word between them
def prepare_data(word_indices):
    data = []
    for i in range(CONTEXT_SIZE, len(word_indices) - CONTEXT_SIZE):

        #### START YOUR CODE ####
        # Hint: You can intialize context to an empty list
        # and then use a for loop to append elements to context properly.

        """
        [..CONTEXT] target [CONTEXT..]
        """

        target = word_indices[i]

        context = []
        for j in range(i - CONTEXT_SIZE, i + CONTEXT_SIZE + 1):
            if j != i:
                context.append(word_indices[j])
        #### END YOUR CODE ####

        data.append((context, target))

    return data



['We', 'are', 'about', 'to', 'study', 'the', 'idea', 'of', 'a', 'computational', 'process.', 'Computational', 'processes', 'are', 'abstract', 'beings', 'that', 'inhabit', 'computers.', 'As', 'they', 'evolve,', 'processes', 'manipulate', 'other', 'abstract', 'things', 'called', 'data.', 'The', 'evolution', 'of', 'a', 'process', 'is', 'directed', 'by', 'a', 'pattern', 'of', 'rules', 'called', 'a', 'program.', 'People', 'create', 'programs', 'to', 'direct', 'processes.', 'In', 'effect,', 'we', 'conjure', 'the', 'spirits', 'of', 'the', 'computer', 'with', 'our', 'spells.']
Counter({'of': 4, 'a': 4, 'the': 3, 'are': 2, 'to': 2, 'processes': 2, 'abstract': 2, 'called': 2, 'We': 1, 'about': 1, 'study': 1, 'idea': 1, 'computational': 1, 'process.': 1, 'Computational': 1, 'beings': 1, 'that': 1, 'inhabit': 1, 'computers.': 1, 'As': 1, 'they': 1, 'evolve,': 1, 'manipulate': 1, 'other': 1, 'things': 1, 'data.': 1, 'The': 1, 'evolution': 1, 'process': 1, 'is': 1, 'directed': 1, 'by': 1, 'pattern':

### Expected Output

vocab_size 49

In [4]:
print(type(vocab))
print(len(vocab))
print(type(word_indices))
print(len(word_indices))

<class 'collections.Counter'>
49
<class 'list'>
62


### Expected Output

<class 'collections.Counter'><br>
49<br>
<class 'list'><br>
62

In [5]:
# Test Task 1. Do not change the code below.
data = prepare_data(word_indices)
print ('length of data', len(data))
print('data[0]:', data[0])
ctx, tgt = data[0]
print('context words:', [idx_to_word[c] for c in ctx])
print('target word:', idx_to_word[tgt])

length of data 56
data[0]: ([0, 1, 2, 4, 5, 6], 3)
context words: ['We', 'are', 'about', 'study', 'the', 'idea']
target word: to


## Expected output

length of data 56<br>
data[0]: ([0, 1, 2, 4, 5, 6], 3)<br>
context words: ['We', 'are', 'about', 'study', 'the', 'idea']<br>
target word: to


## Task 2: Implement a CBOW model

**5 points**

In this task, you will implement a CBOW model. In the `__init__()` method, define the size of `self.embeddings` and `self.linear` properly.

The `self.linear` takes the sum of embeddings of all context words as input, and the output size is `vocab_size`.
It is followed by a softmax activation (`nn.LogSoftmax`).

The `forward()` method has a input argument `inputs`, which is the context word indices (in a `torch.long` tensor).
You should get the embeddings of all context words, and compute the sum (into the `embeds` variable). 

Make this class definition flexible to receive embed_weights from a pre-trained model which will be needed for section 2.

In [14]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, no_of_samples, embed_weights):
        super(CBOW, self).__init__()

        #### START YOUR CODE ####
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        #Hint use nn.Embedding() if no initial pre-trained embeddings in embed_weights are provided
        if embed_weights is not None:
            self.embeddings = nn.Embedding.from_pretrained(embed_weights, freeze=False)
        #Use nn.Embedding.from_pretrained() if embed_weights are provided.
        else:
            self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        #### END YOUR CODE ####
        
        self.act = nn.LogSoftmax(dim = -1)

    def forward(self, inputs):
        
        #### START YOUR CODE ####
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        #### END YOUR CODE ####
        
        out = self.linear(embeds)
        out = self.act(out)

        return out

In [15]:
# Test Task 2. Do not change the code blow
torch.manual_seed(0)

m = CBOW(10, 20, 3, embed_weights=None)
test_input = torch.tensor([1,2,3], dtype=torch.long)

test_output = m(test_input)

print('test_output.shape', test_output.shape)
print('test_output', test_output.data)

test_output.shape torch.Size([1, 10])
test_output tensor([[-1.6878, -4.2108, -5.0252, -2.9802, -3.1362, -1.5436, -1.4120, -3.2485,
         -1.6490, -4.5009]])


### Expected Output

test_output.shape torch.Size([1, 10])<br>
test_output tensor([[-1.6878, -4.2108, -5.0252, -2.9802, -3.1362, -1.5436, -1.4120, -3.2485,
         -1.6490, -4.5009]])

## Task 3. Training loop
**2 points**

In this task, you will complete the training loop. 

You should create `ctx_tensor` and `tgt_tensor` out of `ctx` and `tgt`, respectively. *Hint*: you need to put `tgt` to a list before creating the `tgt_tensor`, so that the resulting tensor is of the correct dimension that is acceptable to `nn.NLLLoss()`.

`ctx_tensor` is used to compute `output`. `loss_function()` is called upon `output` and `tgt_tensor` to compute the loss. total_loss contains the total loss at each epoch.

We are using an embedding dimension of 50.

In [21]:
torch.manual_seed(0)

EMBEDDING_DIM = 50
model = CBOW(vocab_size, EMBEDDING_DIM, CONTEXT_SIZE*2, embed_weights=None)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
optimizer.zero_grad()
# Training
for epoch in range(1000):
    total_loss = 0
    for ctx, tgt in data:
        #### START YOUR CODE ####
        context = torch.tensor(ctx, dtype=torch.long)
        target = torch.tensor([tgt], dtype=torch.long)
        pred = model(context)
        loss = loss_function(pred, target)
        total_loss += loss
        #### END YOUR CODE ####

    #optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    # print training information
    if epoch % 5 == 0 and epoch > 0:
        print(f'Loss within epoch {epoch}: ', total_loss.item())

Loss within epoch 5:  182.17068481445312
Loss within epoch 10:  128.0272674560547
Loss within epoch 15:  93.20354461669922
Loss within epoch 20:  69.58182525634766
Loss within epoch 25:  53.30769729614258
Loss within epoch 30:  41.98340606689453
Loss within epoch 35:  33.98306655883789
Loss within epoch 40:  28.213909149169922
Loss within epoch 45:  23.947216033935547
Loss within epoch 50:  20.706602096557617
Loss within epoch 55:  18.183002471923828
Loss within epoch 60:  16.173582077026367
Loss within epoch 65:  14.542236328125
Loss within epoch 70:  13.195314407348633
Loss within epoch 75:  12.066819190979004
Loss within epoch 80:  11.109158515930176
Loss within epoch 85:  10.287293434143066
Loss within epoch 90:  9.574957847595215
Loss within epoch 95:  8.952116012573242
Loss within epoch 100:  8.403237342834473
Loss within epoch 105:  7.916141510009766
Loss within epoch 110:  7.481141090393066
Loss within epoch 115:  7.0904388427734375
Loss within epoch 120:  6.737704277038574
Los

### Expected output:

You should observe the loss decreasing from ~180 (at epoch 5) to ~0.7 (at epoch 995).
The absolute values do not matter.



## Task 4
**1 point**

In this final task, you will need to find the maximum index among the model output. *Hint*: use `torch.argmax()`.

In [25]:
def get_predicted_word(model_output, idx_to_word):
    #### START YOUR CODE ####
    word = idx_to_word[torch.argmax(model_output)]
    #### END YOUR CODE ####

    return word

In [26]:
# Test Task. Do not change the code blow
ctx_words = 'processes manipulate other things called data.'.split()
ctx_indices = [word_to_idx[w] for w in ctx_words]
ctx_tensor = torch.tensor(ctx_indices, dtype=torch.long)

out = model(ctx_tensor)
pred = get_predicted_word(out, idx_to_word)
print(f'The predicted word is: \"{pred}\"')

The predicted word is: "abstract"


### Expected Output

The predicted word is: "abstract"


In [27]:
# Test Task. Do not change the code blow
ctx_words = 'we conjure the of the computer'.split()
ctx_indices = [word_to_idx[w] for w in ctx_words]
ctx_tensor = torch.tensor(ctx_indices, dtype=torch.long)

out = model(ctx_tensor)
pred = get_predicted_word(out, idx_to_word)
print(f'The predicted word is: \"{pred}\"')

The predicted word is: "spirits"


### Expected output

The predicted word is: "spirits"

## Section 2

Here we will repeat the same exercise as Section 1 but start with pre-trained GloVe embeddings in included text file 'glove.6B.50d.txt'

In [28]:
def load_glove_embeddings(embedding_file):
    g_embeddings = {}
    g_vocab = []
    with open(embedding_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            g_vocab.append(word)
            vector = torch.tensor([float(val) for val in values[1:]])
            g_embeddings[word] = vector
    return g_embeddings, g_vocab

# Load pretrained GloVe embeddings and extract vocabulary
glove_file = 'glove.6B.50d.txt'
g_embeddings, g_vocab = load_glove_embeddings(glove_file)



## Test to Make Sure GloVe Embeddings are Loaded Correctly

In [29]:
print ('g_vocab_size', len(g_vocab))

g_word_to_idx = {word: i for i, word in enumerate(g_vocab)}
g_idx_to_word = list(g_vocab)

g_vocab_size, g_embedding_dim = len (g_vocab),50

g_embedding_matrix = torch.stack(list(g_embeddings.values()))

print ('g_embedding_matrix.shape',g_embedding_matrix.shape)
# Create PyTorch embedding layer

g_embedding_layer = nn.Embedding.from_pretrained(g_embedding_matrix)
word = 'the'
g_word_index = g_vocab.index(word)

print ('g_word_index', g_word_index)
# Access pretrained embeddings for specific words
#g_word_embedding = g_embedding_layer(torch.tensor([g_word_index]))

#print('g_word_embedding',g_word_embedding)

g_vocab_size 400001
g_embedding_matrix.shape torch.Size([400001, 50])
g_word_index 0


### Expected Output

g_vocab_size 400001<br>
g_embedding_matrix.shape torch.Size([400001, 50])<br>
g_word_index 0

# Cosine similarity

**2 points**

Cosine similarity metric ranges from -1 to 1. Larger values indicate more similarity.

hint: use F.cosine_similarity



In [31]:
word_king = 'king'
word_man ='man'
word_woman = 'woman'
word_queen = 'queen'


king_embedding = g_embedding_layer(torch.tensor([g_vocab.index(word_king)]))
man_embedding = g_embedding_layer(torch.tensor([g_vocab.index(word_man)]))
woman_embedding = g_embedding_layer(torch.tensor([g_vocab.index(word_woman)]))
queen_embedding = g_embedding_layer(torch.tensor([g_vocab.index(word_queen)]))

word_embedding_case_1 = king_embedding - man_embedding + woman_embedding

##Start your Code
#Obtain the cosine similarity between king-man+woman to queen in "cosine_sim1"
cosine_sim1 = F.cosine_similarity(word_embedding_case_1, queen_embedding)
cosine_sim2 = F.cosine_similarity(king_embedding, queen_embedding)
#Obtain the cosine similarity between king and queen in "cosine_sim2"


#End your code
print("Cosine sim1:", cosine_sim1.item())
print("Cosine sim2:", cosine_sim2.item())



Cosine sim1: 0.8609580993652344
Cosine sim2: 0.7839043736457825


### Expected Output

Cosine sim1: 0.8609580993652344<br>
Cosine sim2: 0.7839043140411377


In [32]:
#First we have to convert indexes from our toy vocabulary to those used in GloVe Vocabulary
#and create a new embedding_matrix that contains embeddings only for words in the smaller 
#vocabulary
import re

#print ('len(vocab)', len(vocab))
def preprocess_word(word):
    # Convert word to lowercase
    word = word.lower()
    # Remove punctuation
    word = re.sub(r'[^\w\s]', '', word)
    return word

n_embedding_matrix = torch.zeros(len(vocab), g_embedding_matrix.shape[1])
#n_embedding_layer = nn.Embedding.from_pretrained(n_embedding_matrix)
print ('g', g_embedding_matrix.shape)
print ('n', n_embedding_matrix.shape)

new_vocab_indices = {}
for i, word in enumerate (vocab):
    word=preprocess_word(word)
    if word in g_vocab:
        #new_vocab_indices[word] = g_vocab.index(word)
        index = g_vocab.index(word)
        n_embedding_matrix[i] = torch.tensor(g_embedding_matrix[index], dtype=torch.float)
    else:
        print ('This word not in GloVe',word)
#n_embedding_layer = nn.Embedding.from_pretrained(n_embedding_matrix)    
#new_word_indices = [new_vocab_indices[preprocess_word(w)] for w in raw_text]        
        
#print ('new_vocab_indices',new_vocab_indices)
#print ('new_word_indices',new_word_indices)
#print ('len(new_word_indices)',len(new_word_indices)) 

g torch.Size([400001, 50])
n torch.Size([49, 50])


  n_embedding_matrix[i] = torch.tensor(g_embedding_matrix[index], dtype=torch.float)


### Expected Output

g torch.Size([400001, 50])<br>
n torch.Size([49, 50])<br>

Ignore any warnings


# Training Loop Starting with Pre-Trained Embeddings from GloVe

**3 points**


In [33]:
torch.manual_seed(0)
EMDEDDING_DIM = 50
#print ('vocab_size',vocab_size)
#print ('NE Matric.shape', n_embedding_matrix.shape)
model2 = CBOW(vocab_size, EMBEDDING_DIM, CONTEXT_SIZE*2, n_embedding_matrix)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model2.parameters(), lr=0.001)
optimizer.zero_grad()
# Training
for epoch in range(1000):
    total_loss = 0

    for ctx, tgt in data:
        #### START YOUR CODE ####
        context = torch.tensor(ctx, dtype=torch.long)
        target = torch.tensor([tgt], dtype=torch.long)
        pred = model2(context)
        loss = loss_function(pred, target)
        total_loss += loss
        #### END YOUR CODE ####

    #optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    # print training information
    if epoch % 5 == 0 and epoch > 0:
        print(f'Loss within epoch {epoch}: ', total_loss.item())

Loss within epoch 5:  201.59201049804688
Loss within epoch 10:  178.2009735107422
Loss within epoch 15:  161.1248779296875
Loss within epoch 20:  146.86416625976562
Loss within epoch 25:  134.7068328857422
Loss within epoch 30:  124.19705963134766
Loss within epoch 35:  115.00506591796875
Loss within epoch 40:  106.88453674316406
Loss within epoch 45:  99.64949798583984
Loss within epoch 50:  93.15776062011719
Loss within epoch 55:  87.29912567138672
Loss within epoch 60:  81.98626708984375
Loss within epoch 65:  77.14881134033203
Loss within epoch 70:  72.72893524169922
Loss within epoch 75:  68.67851257324219
Loss within epoch 80:  64.95677947998047
Loss within epoch 85:  61.528926849365234
Loss within epoch 90:  58.364959716796875
Loss within epoch 95:  55.43872833251953
Loss within epoch 100:  52.727420806884766
Loss within epoch 105:  50.21091842651367
Loss within epoch 110:  47.87141036987305
Loss within epoch 115:  45.69314956665039
Loss within epoch 120:  43.662010192871094
Los

### Expected Output

You should observe the loss decreasing from ~201 (at epoch 5) to ~2.8 (at epoch 995).
The absolute values do not matter.


In [34]:
def new_get_predicted_word(output, idx_to_word):
    #### START YOUR CODE ####
    #Use the code from Task 4 above
    word = idx_to_word[torch.argmax(output)]
    #### END YOUR CODE ####

    return word

In [35]:
# Test Task. Do not change the code blow
ctx_words = 'we conjure the of the computer'.split()
ctx_indices = [word_to_idx[w] for w in ctx_words]
ctx_tensor = torch.tensor(ctx_indices, dtype=torch.long)

out = model2(ctx_tensor)
pred = new_get_predicted_word(out, idx_to_word)
print(f'The predicted word is: \"{pred}\"')

The predicted word is: "spirits"


### Expected Output

The predicted word is: "spirits"


In [36]:
# Test Task. Do not change the code blow
ctx_words = 'processes manipulate other things called data.'.split()
ctx_indices = [word_to_idx[w] for w in ctx_words]
ctx_tensor = torch.tensor(ctx_indices, dtype=torch.long)

out = model2(ctx_tensor)
pred = new_get_predicted_word(out, idx_to_word)
print(f'The predicted word is: \"{pred}\"')

The predicted word is: "abstract"


### Expected Output

The predicted word is: "abstract"

## Note:
Since our example text paragraph is rather small with a small vocabulary, the real utility of using pre-trained embeddings is not evident. It would be evident if we had a rather large text file. The purpose of section 2 is to familarize you with how to use pre-trained embeddings.