<a href="https://colab.research.google.com/github/NastasiaMazur/VU_1_1/blob/main/exercises/HomeExercise3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exercise 3: Creating Neural Model in PyTorch

In this home exercise, you will first learn how to create a neural model in PyTorch and then you will train and improve a mini-implementation of an embedding model.


---

## **Exercise 3a: Neural Network Model**

First we need to import the neural network module of PyTorch:

In [161]:
import torch
import torch.nn as nn

We can use `nn.Linear(H_in, H_out)` to create a a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

In [162]:
# Create the inputs
input = torch.ones(2,3,4)
print("Input ", input)

# N* H_in -> N*H_out
linear = nn.Linear(4, 2)
linear_output = linear(input)
linear_output

Input  tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])


tensor([[[-0.6670, -0.1194],
         [-0.6670, -0.1194],
         [-0.6670, -0.1194]],

        [[-0.6670, -0.1194],
         [-0.6670, -0.1194],
         [-0.6670, -0.1194]]], grad_fn=<ViewBackward0>)

In [163]:
list(linear.parameters()) # Ax + b

[Parameter containing:
 tensor([[-0.3725, -0.3647, -0.2755, -0.1436],
         [ 0.0393,  0.2583,  0.1404, -0.1708]], requires_grad=True),
 Parameter containing:
 tensor([ 0.4893, -0.3865], requires_grad=True)]

Let's add an activation function:

In [164]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[[0.3392, 0.4702],
         [0.3392, 0.4702],
         [0.3392, 0.4702]],

        [[0.3392, 0.4702],
         [0.3392, 0.4702],
         [0.3392, 0.4702]]], grad_fn=<SigmoidBackward0>)

Instead of creating intermediate layers and passing variables around, we can create a sequence:

In [165]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
output

tensor([[[0.3823, 0.6162],
         [0.3823, 0.6162],
         [0.3823, 0.6162]],

        [[0.3823, 0.6162],
         [0.3823, 0.6162],
         [0.3823, 0.6162]]], grad_fn=<SigmoidBackward0>)

---

## **Exercise 3b: Word Embeddings**


Instead of using predefined modules of nn we can define our own modules and build custom neural networks. As a toy example, we will convert words to word embeddings. The preprocessing below should be done more elegantly and not as simply as below.

In [166]:
import string

# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold."""

training_sentence = test_sentence.translate(str.maketrans('', '', string.punctuation)).lower().split()
print(training_sentence)

['when', 'forty', 'winters', 'shall', 'besiege', 'thy', 'brow', 'and', 'dig', 'deep', 'trenches', 'in', 'thy', 'beautys', 'field', 'thy', 'youths', 'proud', 'livery', 'so', 'gazed', 'on', 'now', 'will', 'be', 'a', 'totterd', 'weed', 'of', 'small', 'worth', 'held', 'then', 'being', 'asked', 'where', 'all', 'thy', 'beauty', 'lies', 'where', 'all', 'the', 'treasure', 'of', 'thy', 'lusty', 'days', 'to', 'say', 'within', 'thine', 'own', 'deep', 'sunken', 'eyes', 'were', 'an', 'alleating', 'shame', 'and', 'thriftless', 'praise', 'how', 'much', 'more', 'praise', 'deservd', 'thy', 'beautys', 'use', 'if', 'thou', 'couldst', 'answer', 'this', 'fair', 'child', 'of', 'mine', 'shall', 'sum', 'my', 'count', 'and', 'make', 'my', 'old', 'excuse', 'proving', 'his', 'beauty', 'by', 'succession', 'thine', 'this', 'were', 'to', 'be', 'new', 'made', 'when', 'thou', 'art', 'old', 'and', 'see', 'thy', 'blood', 'warm', 'when', 'thou', 'feelst', 'it', 'cold']


Next let's find our vocabulary, i.e., all the unique words in the training data:

In [167]:
vocabulary = set(w for w in training_sentence)
vocabulary

{'a',
 'all',
 'alleating',
 'an',
 'and',
 'answer',
 'art',
 'asked',
 'be',
 'beauty',
 'beautys',
 'being',
 'besiege',
 'blood',
 'brow',
 'by',
 'child',
 'cold',
 'couldst',
 'count',
 'days',
 'deep',
 'deservd',
 'dig',
 'excuse',
 'eyes',
 'fair',
 'feelst',
 'field',
 'forty',
 'gazed',
 'held',
 'his',
 'how',
 'if',
 'in',
 'it',
 'lies',
 'livery',
 'lusty',
 'made',
 'make',
 'mine',
 'more',
 'much',
 'my',
 'new',
 'now',
 'of',
 'old',
 'on',
 'own',
 'praise',
 'proud',
 'proving',
 'say',
 'see',
 'shall',
 'shame',
 'small',
 'so',
 'succession',
 'sum',
 'sunken',
 'the',
 'then',
 'thine',
 'this',
 'thou',
 'thriftless',
 'thy',
 'to',
 'totterd',
 'treasure',
 'trenches',
 'use',
 'warm',
 'weed',
 'were',
 'when',
 'where',
 'will',
 'winters',
 'within',
 'worth',
 'youths'}

We introduce a special token, `<unk>`, to tackle the words that are out of vocabulary. We could pick another string for our unknown token if we wanted. The only requirement here is that our token should be unique: we should only be using this token for unknown words. We will also add this special token to our vocabulary.

In [168]:
vocabulary.add("<unk>")

Now we will create the index for our vocabulary - one index to word and one word to index to make looking up words easier:

In [169]:
ix_to_word = sorted(list(vocabulary))
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix

{'<unk>': 0,
 'a': 1,
 'all': 2,
 'alleating': 3,
 'an': 4,
 'and': 5,
 'answer': 6,
 'art': 7,
 'asked': 8,
 'be': 9,
 'beauty': 10,
 'beautys': 11,
 'being': 12,
 'besiege': 13,
 'blood': 14,
 'brow': 15,
 'by': 16,
 'child': 17,
 'cold': 18,
 'couldst': 19,
 'count': 20,
 'days': 21,
 'deep': 22,
 'deservd': 23,
 'dig': 24,
 'excuse': 25,
 'eyes': 26,
 'fair': 27,
 'feelst': 28,
 'field': 29,
 'forty': 30,
 'gazed': 31,
 'held': 32,
 'his': 33,
 'how': 34,
 'if': 35,
 'in': 36,
 'it': 37,
 'lies': 38,
 'livery': 39,
 'lusty': 40,
 'made': 41,
 'make': 42,
 'mine': 43,
 'more': 44,
 'much': 45,
 'my': 46,
 'new': 47,
 'now': 48,
 'of': 49,
 'old': 50,
 'on': 51,
 'own': 52,
 'praise': 53,
 'proud': 54,
 'proving': 55,
 'say': 56,
 'see': 57,
 'shall': 58,
 'shame': 59,
 'small': 60,
 'so': 61,
 'succession': 62,
 'sum': 63,
 'sunken': 64,
 'the': 65,
 'then': 66,
 'thine': 67,
 'this': 68,
 'thou': 69,
 'thriftless': 70,
 'thy': 71,
 'to': 72,
 'totterd': 73,
 'treasure': 74,
 'trenc

👋 ⚒ How can we now lookup which word is the fifth word in our index list?

In [194]:
# Your code here

fifth_word = ix_to_word[4]
fifth_word

# I considered the 0 index as an OOV word


'an'

We will use a very simple solution of building trigrams to train our model.

In [171]:
trigrams = [([training_sentence[i], training_sentence[i + 1]], training_sentence[i + 2])
            for i in range(len(training_sentence)-2)]
print(trigrams)

[(['when', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege'), (['shall', 'besiege'], 'thy'), (['besiege', 'thy'], 'brow'), (['thy', 'brow'], 'and'), (['brow', 'and'], 'dig'), (['and', 'dig'], 'deep'), (['dig', 'deep'], 'trenches'), (['deep', 'trenches'], 'in'), (['trenches', 'in'], 'thy'), (['in', 'thy'], 'beautys'), (['thy', 'beautys'], 'field'), (['beautys', 'field'], 'thy'), (['field', 'thy'], 'youths'), (['thy', 'youths'], 'proud'), (['youths', 'proud'], 'livery'), (['proud', 'livery'], 'so'), (['livery', 'so'], 'gazed'), (['so', 'gazed'], 'on'), (['gazed', 'on'], 'now'), (['on', 'now'], 'will'), (['now', 'will'], 'be'), (['will', 'be'], 'a'), (['be', 'a'], 'totterd'), (['a', 'totterd'], 'weed'), (['totterd', 'weed'], 'of'), (['weed', 'of'], 'small'), (['of', 'small'], 'worth'), (['small', 'worth'], 'held'), (['worth', 'held'], 'then'), (['held', 'then'], 'being'), (['then', 'being'], 'asked'), (['being', 'asked'], 'where'), (['asked', 'where'

In [172]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, window_size, hidden_dim):
        super(NGramLanguageModeler, self).__init__()          # calls the constructor of the parent class (nn.Module) to initialize the model

        self.embeddings = nn.Embedding(vocab_size, embedding_dim) # creates an embedding layer;  converts input indices into dense vectors
        self.hidden_layer = nn.Sequential(                        # hidden layer of the model as a sequential container
            nn.Linear(window_size * embedding_dim, hidden_dim),
            nn.ReLU()
        )
        self.output_layer = nn.Linear(hidden_dim, vocab_size)     # creates the output layer; takes the hidden representation and produces output scores for each word in the vocabulary

    def forward(self, inputs):                            # specifies how the input data should be processed through the layers to produce the output
        embeds = self.embeddings(inputs).view((1, -1))    # takes the input indices inputs, passes them through the embedding layer, and flattens the resulting tensor into a 1D tensor
        layer1 = self.hidden_layer(embeds)                # passes the flattened embeddings through the hidden layer, applying the linear transformation followed by the ReLU activation
        output = self.output_layer(layer1)                # passes the result of the hidden layer through the output layer, producing the final output scores for each word in the vocabulary
        probabilities = nn.functional.log_softmax(output, dim=1)
        return probabilities

👋 ⚒ Now let's train the model. Try to adapt the hyperparamters so that the total_loss is reduced. These include in the initial settings:

*   Dimensionality of the embeddings: 10
*   Dimensionality of the hidden layer: 128
*   Number of epochs: 10
*   Learning rate: 0.002
*   Loss function (`NLLLoss()` Negative Log Likelihood right now)

The window size can in this toy example not be changed, since we train on trigrams (so we only have two context words in the set).

The code cell before, the `class NGramLanguageModeler(nn.Module)` defines the exact setup of the network. For the setup one thing to change could be the  activation functions (ReLU and Log-Softmax right now) - others can be found [here](https://pytorch.org/docs/stable/nn.html) for sequential layers and here for other uses as [nn.functional](https://pytorch.org/docs/stable/nn.functional.html)).

Just for comparison, this is the initial output with the first settings of the notebook:

`See how loss decreases with each epoch:  [4.502952638980561, 4.453238930322428, 4.404419487556525, 4.356399970771992, 4.309086953644204, 4.262444785210938, 4.216389432417608, 4.170906921403598, 4.125928629816106, 4.081349758975274]
Loss of the last epoch: 4.081349758975274`




In [184]:
window_size = 2
#embedding_dim = 10
embedding_dim = 100
#hidden_dim = 128
hidden_dim = 300
num_epochs = 30

losses = []
# Negative log likelihood
loss_function = nn.NLLLoss()
ngram_model = NGramLanguageModeler(len(vocabulary), embedding_dim, window_size, hidden_dim)

# What do SGD and lr mean? What happenes if you change them?
optimizer = torch.optim.SGD(ngram_model.parameters(), lr=0.002)

for epoch in range(num_epochs):
    total_loss = 0
    for context, target in trigrams:
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        optimizer.zero_grad()

        # Step 3. Run the forward pass, getting probabilities over next
        # words - the size of this tensor is 87 corresponding to the size of the vocabulary
        probabilities = ngram_model.forward(context_idxs)

        # Step 4. Compute your loss function. The target word (gold standard label) needs to be
        # wrapped in a tensor.
        loss = loss_function(probabilities, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss / len(trigrams))

print("See how loss decreases with each epoch: ", losses)
print("Loss of the last epoch:", losses[-1])

See how loss decreases with each epoch:  [4.487901048322694, 4.340056104997618, 4.195154086678429, 4.052556094870103, 3.9115185020244225, 3.771405781264854, 3.63199930275436, 3.492593377037386, 3.3530263457678062, 3.2127565103294575, 3.0720596229080606, 2.930731737508183, 2.789144310276065, 2.6478008538220834, 2.5069692177055156, 2.3669838304013275, 2.2284098467995634, 2.0918758100112984, 1.9580623702665345, 1.827540348061418, 1.7005750534281265, 1.5780810207392262, 1.4604113154706702, 1.3480396914271127, 1.241676820062958, 1.1414693835298573, 1.0478215756933247, 0.960797579678814, 0.88043060780099, 0.8067999749848273]
Loss of the last epoch: 0.8067999749848273


👋 ⚒ Implement the cosine simmilarity for the trained embeddings. For this you need to choose too find the index of the two words you wish to compare, get their embeddings, convert them into numpy arrays, and run them through the cosine function provided.

In [210]:
import torch
import numpy as np
from numpy.linalg import norm

word1 = "beauty"
word2 = "treasure"
index_word1 = word_to_ix[word1]
index_word2 = word_to_ix[word2]

# Tensor to look up specific embeddings
lookup_tensor = torch.tensor(list(word_to_ix.values()), dtype=torch.long)

# Embedding for the 10th word1 (beauty) word in the vocabulary:
lookup_embeds_word1 = ngram_model.embeddings(lookup_tensor)[index_word1]         #[10]
print(f"Embedding for word '{word1}':\n {lookup_embeds_word1}\n")

# Embedding for the 74th (treasure) word in the vocabulary:
lookup_embeds_word2 = ngram_model.embeddings(lookup_tensor)[index_word2]         #[74]
print(f"Embedding for word '{word2}':\n {lookup_embeds_word2}\n")

# Convert the embedding tensor to a NumPy array:
embedding_array_word1 = lookup_embeds_word1.detach().numpy()
print(f"Embedding for word '{word1}' as NumPy array:\n {embedding_array_word1}\n")

embedding_array_word2 = lookup_embeds_word2.detach().numpy()
print(f"Embedding for word '{word2}' as NumPy array:\n {embedding_array_word2}\n")

print("__________________________________________________")

#cosine = np.dot(A,B)/(norm(A, axis=1)*norm(B))
cosine = np.dot(embedding_array_word1, embedding_array_word2) / (norm(embedding_array_word1) * norm(embedding_array_word2))

print(f"Cosine similarity between '{word1}' and '{word2}':\n{cosine}")
print(cosine) #Why there is a difference? --- might not display all the decimal places





Embedding for word 'beauty':
 tensor([-0.3680, -0.0494,  1.2171, -1.5308, -0.6764, -0.5091,  0.6183,  1.0730,
         0.2889, -0.1775,  1.1132,  2.2134,  1.2887, -0.1574,  0.7096, -0.3561,
        -0.5816, -1.8774, -0.8243, -0.2820, -0.5314,  1.1153, -0.5679, -0.2579,
         2.5918, -0.2809,  0.5132,  0.3900,  0.5723, -0.4696, -0.6464,  0.1209,
        -0.2123, -0.6325, -0.8746,  1.0276,  0.4383,  0.7355, -0.7584,  0.4323,
         0.0372, -0.5926,  1.5436,  2.1966, -0.6841, -2.0652,  0.4222,  0.0199,
         0.4301,  0.2294,  0.4967, -1.0874,  0.1335,  1.2891, -1.8101, -1.1112,
        -0.6953, -0.9038,  1.3517,  1.9084,  0.5619, -0.1569, -0.7110, -0.4901,
         0.0235,  0.9996,  0.5041, -1.3395,  0.9625, -0.7438, -0.9277, -2.8309,
        -0.1226, -0.3207,  1.4782,  0.8456, -1.1601,  0.4863, -0.0860, -0.1175,
        -0.8490,  0.4749, -0.1351, -2.6311,  0.6834,  0.0379, -1.9022, -0.1040,
         1.0719,  0.5834,  0.8683, -1.2159,  0.7880,  0.5849, -1.3296,  1.7386,
        -0

The reason why I have decided to use words and

**Comment:**

For the purposes of visualization, I have decided to use words instead of just indexes in the code. Although I think it would make sense to do the opposite for a more complex program.