# Week 3 - NLP and Deep Learning

---

# Lecture 5 - Language Identification with a Feedforward Neural Network


In this exercise, you will implement the forward step of a FFNN from scratch and compare your solution to Pytorch on a small toy example to predict the language for a given word. 

It is very important that you understand the basic building blocks (input/output: how to encode your instances, the labels; the model: what the neural network consists of, how to learn its weights, how to do a forward pass for prediction). 

##  1. Representing the data

We are assuming multi-class classification tasks for the assignments of this week. The labels are: $$ y \in \{da,nl,en\}$$

We will use the same data as in week2, from:
* English [Wookipedia](https://starwars.fandom.com/wiki/Main_Page)  
* Danish [Kraftens Arkiver](https://starwars.fandom.com/da/wiki) 
* Dutch [Yodapedia](https://starwars.fandom.com/da/wiki)


In [26]:
def load_langid(path):
    text = []
    labels = []
    for line in open(path, encoding="utf-8"):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1])
    return text, labels

wooki_train_text, wooki_train_labels = load_langid(f'langid-data/wookipedia_langid.train.tok.txt')


* a): Convert the training data into n-hot format, where each feature represents whether a **single character** is present or not.  Similarly, convert the labels into numeric format. For simplicity, you can assume a closed vocabulary (only the letters in wookie_train_text, no unknown-character handling). Keep original casing, and assign the character indices based on their chronological order.

  * What is the vocabulary size?
  
**Hint:** It is easier for the rest of the assignment if you directly use a torch tensor to save the features ([tutorial](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py), another [introduction](https://towardsdatascience.com/an-easy-introduction-to-pytorch-for-neural-networks-3ea08516bff2)), a 2d torch tensor filled with 0's can be initiated with: `torch.zeros(dim1, dim2, dtype=float)`. Note the use of `float` instead of `int` here, which is only because the `torch.mm` requires float tensors as input.

In [27]:
import torch

all_text_combined = ''.join(wooki_train_text)
first_occurrence = {}
for char in all_text_combined:
    if char not in first_occurrence:
        first_occurrence[char] = len(first_occurrence)

# Now first_occurrence contains characters as keys and their chronological indices as values.

# Initialize the feature tensor with zeros, using the efficient mapping
features = torch.zeros((len(wooki_train_text), len(first_occurrence)), dtype=torch.float)

# Fill the feature tensor with 1s where the character is present in the sentence
for i, sentence in enumerate(wooki_train_text):
    for char in set(sentence): # We use set to avoid duplicate character counting
        features[i, first_occurrence[char]] = 1

label_to_idx = {'da': 0, 'nl': 1, 'en': 2}
labels = torch.tensor([label_to_idx[label] for label in wooki_train_labels], dtype=torch.float)

print("Vocab size:", len(first_occurrence))

Vocab size: 131


##  2: Forward pass (from scratch)

### Feedforward Neural Networks (FNNs) or MLPs

Feedforward Neural Networks (FNNs) are also called Multilayer Perceptrons (MLPs). These are the most basic types of neural networks. They are called this way as the information is flowing from the input nodes through the network up to the output nodes. 

It is essential to understand that a neural network is a non-linear classification model which is based upon function application. Each layer in a neural network is an application of a function.

Summary (by J.Frellsen):
<img src="pics/fnn_jf.png">

You are going to implement the forward step manually on a small dataset. You will create a network following the design in the following figure (note that the input should be the sames size as the number of characters found in the previous assignment, instead of 4):

<img src="pics/nn.svg">

a) How many neurons do hidden layer 1 and hidden layer 2 have? Note: the bias node is not shown in the figure, you do not have to count them for this assignment.

b) How many neurons does the output layer have? And the input layer? (Note: the figure shows only 4 input nodes, in this example your input size is defined in the previous assignment - what is the input layer size?)

c) Specify the size of layers of the feedforward neural network:

In [28]:
## helper functions to determine the input and output dimensions of each layer
input_dim = features.shape[1] # 131

hidden_dim1 = 15
hidden_dim2 = 20

output_dim = 3

d) Now initialize the layers themselves as torch tensors (do not use a torch.module here!). You can define the bias and the weights in separate tensors. The weights should be initialized randomly (`torch.randn((dim1, dim2), dtype=torch.float)`, see also [torch.randn](https://pytorch.org/docs/stable/generated/torch.randn.html)) and the biases can be set to 1 (`torch.ones(dim1, dtype=torch.float)`, see also [torch.ones](https://pytorch.org/docs/stable/generated/torch.ones.html)). Confirm whether their size match the answer to `b)` and `a)` by printing .shape of the tensors.


In [29]:
## define all parameters of this NN

hidden1 = torch.randn((input_dim,hidden_dim1),dtype=torch.float)
bias1 = torch.ones((1,hidden_dim1),dtype=torch.float)
hidden2 = torch.randn((hidden_dim1,hidden_dim2),dtype=torch.float)
bias2 = torch.ones((1,hidden_dim2),dtype=torch.float)
output = torch.randn((hidden_dim2,output_dim),dtype=torch.float)
output_bias = torch.ones((1,output_dim),dtype=torch.float)

# print the shapes of all parameters
print("features:", features.shape)
print("hidden1:", hidden1.shape)
print("bias1:", bias1.shape)
print("hidden2:", hidden2.shape)
print("bias2:", bias2.shape)
print("output:", output.shape)
print("output_bias:", output_bias.shape)

features: torch.Size([15000, 131])
hidden1: torch.Size([131, 15])
bias1: torch.Size([1, 15])
hidden2: torch.Size([15, 20])
bias2: torch.Size([1, 20])
output: torch.Size([20, 3])
output_bias: torch.Size([1, 3])


Now that we have defined the shape of all parameters, we are ready to "connect the dots" and build the network. 

It is instructive to break the computation of each layer down into two steps: the scores $a1$ are obtained by the linear function followed by the activation applications $\sigma$ to obtain the representation $z1$, as in:

$$ a1 = xW_1 + b_1$$
$$ z1 = \sigma(a1)$$

d) Specify the entire network up to the output layer $z3$, and **up to and exclusive** the final application of the softmax, the last activation function, which is provided. For multiplication [torch.mm](https://pytorch.org/docs/stable/generated/torch.mm.html) can be used. Use a tanh activation function: [torch.tanh](https://pytorch.org/docs/stable/generated/torch.tanh.html).

The exact implementation of the softmax might differ from toolkit to toolkit (due to variations in implementation details in order to obtain numerical stability). Therefore, we will use the Pytorch implementation for the softmax calculation ([torch.nn.Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html)).

In [32]:
## implement the forward pass (up to and exclusive the softmax) 
## apply it to the training data `data_train` - use vectorization

z1 = torch.tanh(torch.mm(features,hidden1) + bias1)
z2 = torch.tanh(torch.mm(z1,hidden2) + bias2)
z3 = torch.mm(z2,output) + output_bias

print(z3)

tensor([[-0.8842, -2.7971, -0.3018],
        [-5.1118,  2.4100, -0.4130],
        [-3.0317, -8.0785, -0.5843],
        ...,
        [-3.1447, -1.1899, -0.4766],
        [-4.2456, -0.6283, -0.2588],
        [-2.1948, -0.1600, -0.0308]])


In [33]:
def forward_pass(x):
    input_dim = x.shape[1]
    hidden_dim1 = 15
    hidden_dim2 = 20
    output_dim = 3
    hidden1 = torch.randn((input_dim,hidden_dim1),dtype=torch.float)
    bias1 = torch.ones((1,hidden_dim1),dtype=torch.float)
    hidden2 = torch.randn((hidden_dim1,hidden_dim2),dtype=torch.float)
    bias2 = torch.ones((1,hidden_dim2),dtype=torch.float)
    output = torch.randn((hidden_dim2,output_dim),dtype=torch.float)
    output_bias = torch.ones((1,output_dim),dtype=torch.float)
    z1 = torch.tanh(torch.mm(x,hidden1) + bias1)
    z2 = torch.tanh(torch.mm(z1,hidden2) + bias2)
    z3 = torch.mm(z2,output) + output_bias
    softmax = torch.nn.Softmax(dim=1)
    return softmax(z3)

We can check that all predictions sum up to approximately 1 (hint: use `torch.sum` with `axis=1`)



In [34]:
print(torch.sum(forward_pass(features), axis=1))

tensor([1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000])



Congrats! you have made it through the manual construction of the forward pass. Note that these weights are still random, so performance is not expected to be good. Now lets compare your implementation to a set of pre-determined weights.

##  3. Where do the weights come from?  Loading existing weights

So far, the model that you used randomly initialized weights. In this step we will load pre-trained model weights and do the forward pass with those weights, in order to check your implementation against model predictions computed by the toolkit.

Now we are going to:
* load pretrained weights for all parameters
* apply the weights to the evaluation data
* check that your manual softmax scores match the ones obtained by the pre-trained model `model` that we will load
* convert the output to labels and calculate the accuracy score

First, lets load the pre-trained model:

In [10]:
import torch
import torch.nn as nn

# use the character indexing from assignment 3
idx2char = ['H', 'e', ' ', 'v', 'n', 'w', 't', 's', 'o', 'f', 'a', 'r', 'u', 'g', 'h', ',', 'i', 'c', 'y', 'd', 'b', 'm', 'p', 'l', 'k', '.', 'D', 'E', 'C', 'j', 'R', 'S', 'U', '1', "'", 'æ', 'å', 'q', '`', 'I', '(', ')', 'M', 'F', '-', 'x', 'K', '9', '5', 'B', 'W', 'z', 'G', 'P', 'L', '/', 'O', '6', 'T', '7', 'Z', '2', '0', 'J', 'V', 'A', 'ø', 'X', '–', 'N', 'ë', ':', '&', '3', 'Y', 'é', '4', '[', ']', '’', ';', '8', 'É', 'Æ', 'Q', '!', '—', 'ï', '°', 'ō', '\u200b', '‘', 'ń', '“', '”', '?', 'Å', '<', '>', '#', '%', '+', 'ʊ', 'ɹ', 'ə', 'ɑ', 'ö', 'à', 'á', 'è', '=', 'ü', 'Ø', '∑', '^', 'ś', 'ñ', '|', '½', '$', '«', '™', 'ó', '´', '…', '―', '»', 'ː', 'θ', '²', 'Θ']
char2idx = {'H': 0, 'e': 1, ' ': 2, 'v': 3, 'n': 4, 'w': 5, 't': 6, 's': 7, 'o': 8, 'f': 9, 'a': 10, 'r': 11, 'u': 12, 'g': 13, 'h': 14, ',': 15, 'i': 16, 'c': 17, 'y': 18, 'd': 19, 'b': 20, 'm': 21, 'p': 22, 'l': 23, 'k': 24, '.': 25, 'D': 26, 'E': 27, 'C': 28, 'j': 29, 'R': 30, 'S': 31, 'U': 32, '1': 33, "'": 34, 'æ': 35, 'å': 36, 'q': 37, '`': 38, 'I': 39, '(': 40, ')': 41, 'M': 42, 'F': 43, '-': 44, 'x': 45, 'K': 46, '9': 47, '5': 48, 'B': 49, 'W': 50, 'z': 51, 'G': 52, 'P': 53, 'L': 54, '/': 55, 'O': 56, '6': 57, 'T': 58, '7': 59, 'Z': 60, '2': 61, '0': 62, 'J': 63, 'V': 64, 'A': 65, 'ø': 66, 'X': 67, '–': 68, 'N': 69, 'ë': 70, ':': 71, '&': 72, '3': 73, 'Y': 74, 'é': 75, '4': 76, '[': 77, ']': 78, '’': 79, ';': 80, '8': 81, 'É': 82, 'Æ': 83, 'Q': 84, '!': 85, '—': 86, 'ï': 87, '°': 88, 'ō': 89, '\u200b': 90, '‘': 91, 'ń': 92, '“': 93, '”': 94, '?': 95, 'Å': 96, '<': 97, '>': 98, '#': 99, '%': 100, '+': 101, 'ʊ': 102, 'ɹ': 103, 'ə': 104, 'ɑ': 105, 'ö': 106, 'à': 107, 'á': 108, 'è': 109, '=': 110, 'ü': 111, 'Ø': 112, '∑': 113, '^': 114, 'ś': 115, 'ñ': 116, '|': 117, '½': 118, '$': 119, '«': 120, '™': 121, 'ó': 122, '´': 123, '…': 124, '―': 125, '»': 126, 'ː': 127, 'θ': 128, '²': 129, 'Θ': 130}

# the label indexes that were used during training
label2idx = {'da':0, 'nl':1, 'en':2}
idx2label = ['da', 'nl', 'en']

# This is the definition of an FNN model in PyTorch, and can mostly be ignored for now.
# We will focus on how to create Torch models in week 5
class LangId(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.input = nn.Linear(vocab_size, 15)
        self.hidden1 = nn.Linear(15, 20)
        self.hidden2 = nn.Linear(20, 3)

    def forward(self, x):
        x = torch.tanh(self.input(x))
        x = torch.tanh(self.hidden1(x))
        x = self.hidden2(x)
        return x

lang_classifier = torch.load(f'model.th')


Inspect the weights you just loaded using the `state_dict()` function of the model: 

In [11]:
lang_classifier.state_dict()


OrderedDict([('input.weight',
              tensor([[ 0.1274,  0.2723,  0.4691,  ...,  0.0754,  0.0201,  0.0813],
                      [-0.1876,  0.3465,  0.4979,  ..., -0.0436, -0.0362, -0.0866],
                      [ 0.1779,  0.3311,  0.3578,  ..., -0.0705,  0.0656,  0.0415],
                      ...,
                      [-0.0264,  0.2019,  0.1753,  ...,  0.0335,  0.0764,  0.0222],
                      [-0.0810, -0.3535, -0.1255,  ..., -0.0645,  0.0299,  0.0438],
                      [ 0.0740, -0.1535,  0.1290,  ..., -0.0464, -0.0612,  0.0650]])),
             ('input.bias',
              tensor([ 0.4091,  0.8057,  0.4696,  0.3282,  0.4459, -0.3094, -0.7575, -0.3531,
                      -0.3175,  0.2946,  0.7420,  0.1358,  0.1037, -0.2193, -0.3283])),
             ('hidden1.weight',
              tensor([[ 0.2575,  0.6953,  0.5631,  0.2704, -0.3716, -0.9438, -0.4709, -0.9932,
                       -0.7564, -0.0925, -2.2822,  0.2297,  0.2956,  0.0241, -1.9843],
            

* a) Convert the following dev data into the input format for the neural network above. 

**Hint** The indices of the characters are based on the order in the training data, and should match in the development data, we provide the correct idx2char and char2idx that were used to train the model in the code above.

In [13]:
# a)
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')

# Initialize the feature tensor with zeros, using the efficient mapping
dev_features = torch.zeros((len(wooki_dev_text), len(char2idx)), dtype=torch.float)

# Fill the feature tensor with 1s where the character is present in the sentence
for i, sentence in enumerate(wooki_dev_text):
    for char in set(sentence): # We use set to avoid duplicate character counting
        dev_features[i, char2idx[char]] = 1

dev_labels = torch.tensor([label_to_idx[label] for label in wooki_dev_labels], dtype=torch.float)


* b) run a forward pass on the dev-data with `lang_classifier`, using the forward() function

* c) Apply your manual implementation of the forward pass to the evaluation data by using the parameters (weights) you just loaded with `state_dict()`. This allows you to check if you get the same results back as the model implemented in Torch. If the outputs match, you implemented the forward pass correctly, congratulations!

**Hint**: internally the torch model saves the weight in a transposed vector for efficiency reasons. This means that W1 will have the dimension of (15,131). To use your previous implementation you have to call the the transpose function in Pytorch ([`.t()`](https://pytorch.org/docs/stable/generated/torch.t.html)), which will convert the shape to be (131,15)

* d) Now apply softmax on the resulting weights and convert the output to the label predictions.

In [14]:
# b)

lang_classifier.forward(dev_features)

tensor([[-2.3604,  0.7146,  0.3900],
        [-2.2557,  1.4831, -0.5462],
        [-2.0805,  1.3734, -0.5807],
        ...,
        [ 2.0050, -1.2380, -1.6113],
        [ 2.2639, -1.3041, -1.8159],
        [-1.7680, -1.5234,  2.6722]], grad_fn=<AddmmBackward0>)

In [15]:
lang_classifier.state_dict().keys()

odict_keys(['input.weight', 'input.bias', 'hidden1.weight', 'hidden1.bias', 'hidden2.weight', 'hidden2.bias'])

In [16]:
# starting point for c)
W1 = lang_classifier.state_dict()['input.weight'].t()
B1 = lang_classifier.state_dict()['input.bias'].t()
W2 = lang_classifier.state_dict()['hidden1.weight'].t()
B2 = lang_classifier.state_dict()['hidden1.bias'].t()
W3 = lang_classifier.state_dict()['hidden2.weight'].t()
B3 = lang_classifier.state_dict()['hidden2.bias'].t()

z1 = torch.tanh(torch.mm(dev_features,W1) + B1)
z2 = torch.tanh(torch.mm(z1,W2) + B2) 
z3 = torch.mm(z2,W3) + B3
print(z3)

tensor([[-2.3604,  0.7146,  0.3900],
        [-2.2557,  1.4831, -0.5462],
        [-2.0805,  1.3734, -0.5807],
        ...,
        [ 2.0050, -1.2380, -1.6113],
        [ 2.2639, -1.3041, -1.8159],
        [-1.7680, -1.5234,  2.6722]])


In [19]:
# d)
softmax = torch.nn.Softmax(dim=1) 
preds = [torch.argmax(x) for x in softmax(z3)]
print("Accuracy:",(sum(a == b for a, b in zip(preds, dev_labels))/len(preds)).item())

Accuracy: 0.8136666417121887


# Lecture 6: What do word embeddings represent?
In the following exercises, you are going to explore what is represented with word embeddings. You are going to make use of the python gensim package and two sets of pre-trained embeddings. The embeddings can be downloaded from:

* http://itu.dk/people/robv/data/embeds/twitter.bin.tar.gz
* http://itu.dk/people/robv/data/embeds/GoogleNews-50k.bin.tar.gz

The first embeddings are skip-gram embeddings trained on a collection of 2 billion words from English tweets collected during 2012 and 2018 with the default settings of word2vec. The second embeddings are trained on 100 billion words from Google News. They have both been truncated to the most frequent 500,000 words. Note that loading that each of these embeddings require approximately 2GB of ram.

The embeddings can be loaded in gensim as follows:

In [21]:
#!tar xvzf twitter.bin.tar.gz
#!tar xvzf GoogleNews-50k.bin.tar.gz

GoogleNews-50k.bin


In [34]:
import gensim.models

twitEmbs = gensim.models.KeyedVectors.load_word2vec_format(
                                'twitter.bin', binary=True)
googEmbs = gensim.models.KeyedVectors.load_word2vec_format(
                                'GoogleNews-50k.bin', binary=True)
print('loading finished')

loading finished


You can now use the index operator ``[]`` or the function ``get_vector()`` to acces the individual word embeddings.

In [25]:
twitEmbs['cat']

array([ 4.64285821e-01,  2.37979457e-01, -4.24226150e-02, -4.35831666e-01,
       -4.06450212e-01, -1.43117514e-02,  1.22334510e-01, -5.59092343e-01,
        1.23332568e-01,  2.36625358e-01,  3.58797014e-02, -9.40739065e-02,
       -2.04128489e-01, -1.81295779e-02, -1.08792759e-01, -2.70818472e-01,
        1.05479717e-01,  1.37095019e-01,  1.79271579e-01,  2.91243941e-01,
       -5.87746739e-01,  2.90462654e-02,  6.89281642e-01, -1.80917114e-01,
       -2.57750720e-01, -2.01395631e-01, -5.16403615e-01,  5.85804135e-03,
       -1.67768478e-01,  2.17095211e-01,  2.22494245e-01,  1.56742647e-01,
       -3.60864878e-01,  3.94283593e-01,  8.04448500e-03,  1.11518592e-01,
       -1.85592070e-01, -1.16088443e-01,  3.24357510e-01,  4.00876179e-02,
        9.14092362e-02, -1.04118213e-01, -6.89513862e-01,  1.54412836e-01,
        4.57625002e-01,  2.55037360e-02, -3.84058757e-03,  7.12698698e-02,
       -2.25590184e-01, -1.96693689e-01, -3.88458431e-01, -2.27625713e-01,
        6.94357634e-01, -

## 4. Word similarities
Cosine distance can be used to measure the distance between two words. It is defined as:
\begin{equation}
cos_{\vec{a},\vec{b}} = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|} = \frac{\sum^n_1 a_i b_i}{\sqrt{\sum^n_1 a_i^2} \sqrt{\sum^n_1 b_i^2}}
\end{equation}

* a) Implement the cosine similarity using pure python (only the ``math`` package is allowed). Note that `similarity == 1-distance`.

You can compare your scores to the gensim implementation to check wheter it is correct. The following code should give the same output

In [32]:
#a)
import math
def cosine(vec1, vec2):
    magnitude_vec1 = math.sqrt(sum(a**2 for a in vec1))
    magnitude_vec2 = math.sqrt(sum(b**2 for b in vec2))
    if magnitude_vec1 == 0 or magnitude_vec2 == 0:
        # If either vector has zero magnitude, similarity is not defined
        return 0
    else:
        dot_product = sum(a*b for a, b in zip(vec1, vec2))
        similarity = 1 - (dot_product/(magnitude_vec1 * magnitude_vec2))
        return similarity
    
print("{:.7f}".format(twitEmbs.distance('cat', 'dog')))
print("{:.7f}".format(cosine(twitEmbs['cat'], twitEmbs['dog'])))

0.1044651
0.1044651


In wordnet, the distance between two senses can be based on the distance in the taxonomy. The most common metric for this is:

* Wu-Palmer Similarity: denotes how similar two word senses are, based on the depth of the two senses in the taxonomy and of their Least Common Subsumer (most specific ancestor node).

It can be obtained in python like this:

In [33]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

first_word = wordnet.synsets('cat')[0] #0 means: most common sense
second_word = wordnet.synsets('dog')[0]
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))

print('Twitter similarity: ' + str(twitEmbs.similarity('cat', 'dog')))


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/alanispani/nltk_data...


WordNet similarity: 0.8571428571428571
Twitter similarity: 0.8955349




* b) Think of 5 word pairs which have a high similarity according to you. Estimate the difference between these pairs in wordnet as well as in the Twitter embeddings and the Google News embeddings. Which method is closest to your own intuition? (You are allowed to use the gensim implementation of cosine similarity here.)


In [42]:
list1 = ["Joy", "Quick", "Ocean", "Intelligent", "House"]
list2 = ["Happiness", "Fast", "Sea", "Smart", "Home"]

for one, two in zip(list1, list2):
    first_word = wordnet.synsets(one)[0]
    second_word = wordnet.synsets(two)[0]
    print(f'WordNet similarity of {one} to {two}: ' + str(first_word.wup_similarity(second_word)))
    print(f'Twitter similarity of {one} to {two}: ' + str(twitEmbs.similarity(one, two)))
    print(f'Google similarity of {one} to {two}: ' + str(googEmbs.similarity(one, two)))



WordNet similarity of Joy to Happiness: 0.8
Twitter similarity of Joy to Happiness: 0.5528852
Google similarity of Joy to Happiness: 0.27504772
WordNet similarity of Quick to Fast: 0.11764705882352941
Twitter similarity of Quick to Fast: 0.6936942
Google similarity of Quick to Fast: 0.43288276
WordNet similarity of Ocean to Sea: 0.8
Twitter similarity of Ocean to Sea: 0.76649284
Google similarity of Ocean to Sea: 0.50030446
WordNet similarity of Intelligent to Smart: 0.16666666666666666
Twitter similarity of Intelligent to Smart: 0.7548678
Google similarity of Intelligent to Smart: 0.5120937
WordNet similarity of House to Home: 0.35294117647058826
Twitter similarity of House to Home: 0.64104354
Google similarity of House to Home: 0.22996894


## 5. Analogies

Analogies have often been used to demonstrate the power of word embeddings. Analogies have the form ``A :: B : C :: D``. In this setting `A`, `B` and `C` are usually given and the fourth term `D` is extracted from the embeddings by using ``3cosadd``:

\begin{equation}
\underset{d}{\mathrm{argmax}} (\cos (d, c) - \cos (d, a) + \cos (d, b))
\end{equation}

You can query analogies with gensim:

In [43]:
# Man is to king as woman is to ...?
twitEmbs.most_similar(positive=['woman', 'king'], negative=['man'], 
                                                         topn=10)

[('queen', 0.8401796221733093),
 ('goddess', 0.7309160232543945),
 ('king…', 0.7233694195747375),
 ('princess', 0.715788722038269),
 ('kings', 0.707615852355957),
 ('godess', 0.6952609419822693),
 ('Queen', 0.6902579069137573),
 ('queen,', 0.6876209378242493),
 ('quee…', 0.68569016456604),
 ('queens', 0.6832401752471924)]

``3cosadd`` can be used to solve semantic as well as syntactic analogies:

| Semantic            |                                      |
|---------------------|--------------------------------------|
| Country-capital     | Denmark :: Copenhagen : England :: X |
| Family-relations    | boy :: girl : he :: X                |
| Object-color        | sky :: blue : grass :: X             |

| Syntactic           |                                      |
|---------------------|--------------------------------------|
| Superlatives        | nice :: nicer : good :: X            |
| Present-past tense  | work :: worked : drink :: X          |
| Country-nationality | Brazil :: Brazilian : Denmark :: X   |


Try the analogies from the table. Is the correct answer returned for all queries? 
If not: are the answers at least ranked high?

* a) Think of another category of *semantic* analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus Google News)?

* b) Think of another category of *syntactic* analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus Google News)?


In [65]:
def cosadd(vec1, vec2, vec3):
    tw = twitEmbs.most_similar(positive=[vec2, vec3], negative=[vec1], topn=2)
    go = googEmbs.most_similar(positive=[vec2, vec3], negative=[vec1], topn=2)
    return tw, go 
print(cosadd('Denmark', 'Copenhagen', 'England'))
print(cosadd('boy', 'girl', 'he'))
print(cosadd('sky', 'blue', 'grass'))
print(cosadd('nice', 'nicer', 'good'))
print(cosadd('work', 'worked', 'drink'))
print(cosadd('Brazil', 'Brazilian', 'Denmark'))

([('Dublin', 0.7911565899848938), ('London', 0.7881398797035217)], [('London', 0.4688378870487213), ('Twickenham', 0.45996522903442383)])
([('she', 0.9052495360374451), ('she…', 0.782346785068512)], [('she', 0.7908554673194885), ('She', 0.6058069467544556)])
([('mustard', 0.7218776941299438), ('green', 0.7193830609321594)], [('brown', 0.506827712059021), ('Bermuda_grass', 0.5015184283256531)])


([('better', 0.7841683626174927), ('worse', 0.7769781351089478)], [('better', 0.7236929535865784), ('worse', 0.5950594544410706)])
([('drank', 0.8187708854675293), ('shotgunned', 0.7269829511642456)], [('drinks', 0.6172131299972534), ('drank', 0.5977991819381714)])
([('Romanian', 0.8317359089851379), ('Bulgarian', 0.8283323645591736)], [('Danish', 0.8729315996170044), ('Swedish', 0.7290647625923157)])


In [73]:
# a) Semantic analogies

print(cosadd('doctor', 'hospital', 'teacher'))
print(cosadd('lion', 'roars', 'cat'))
print(cosadd('painter', 'brush', 'writer'))
print(cosadd('chef', 'kitchen', 'pilot'))
print(cosadd('fish', 'ocean', 'bird'))

# Google wins by 1

([('student', 0.7544766068458557), ('school', 0.7415605187416077)], [('elementary', 0.6052683591842651), ('school', 0.5762953758239746)])
([('squeaks', 0.7607443928718567), ('bursts', 0.7025244235992432)], [('roar', 0.4860982298851013), ('meows', 0.4726291596889496)])
([('screen', 0.6950099468231201), ('screen,', 0.692493736743927)], [('voiceover_narration', 0.4480317533016205), ('cameras', 0.4420016407966614)])


([('cockpit', 0.6811436414718628), ('sprinkler', 0.6798239350318909)], [('cockpit', 0.46297597885131836), ('airplane', 0.4398004710674286)])
([('aurora', 0.6533317565917969), ('owl', 0.6438174843788147)], [('Atlantic_Ocean', 0.5046343207359314), ('Pacific_Ocean', 0.49234429001808167)])


In [70]:
# b) Syntactic analogies

print(cosadd('walk', 'walked', 'jump'))
print(cosadd('happy', 'happily', 'slow'))
print(cosadd('mouse', 'mice', 'goose'))
print(cosadd('swim', 'swimmer', 'teach')) # Google
print(cosadd('quick', 'quicker', 'cold'))

# Google wins by 1

([('jumped', 0.8450671434402466), ('flipped', 0.8145832419395447)], [('jumped', 0.7264297008514404), ('leaped', 0.6826007962226868)])
([('slowed', 0.5967739224433899), ('idly', 0.595432698726654)], [('painfully_slow', 0.5080717206001282), ('glacial_pace', 0.5063737034797668)])
([('geese', 0.6491351127624512), ('hares', 0.6470092535018921)], [('geese', 0.6305144429206848), ('Canada_geese', 0.5270898342132568)])
([('teacher,', 0.6527166366577148), ('teaching', 0.6240153908729553)], [('teaches', 0.60946124792099), ('taught', 0.6061096787452698)])
([('colder', 0.8040951490402222), ('warmer', 0.7245747447013855)], [('colder', 0.7085322737693787), ('warmer', 0.5902280807495117)])


# Bonus: Learning Word Embeddings


So far you've learned about distributional semantics (vector semantics) in both the traditional and modern neural way, and you qualitatively worked with pre-trained (off-the-shelf) word embeddings in the last assignment.

In this assignment, you will learn how to implement a neural network  to learn word embeddings, namely the *Continous Bag of Words* (CBOW) model for word embeddings. More specifically, you will:

* learn how to represent text for windows language modeling
* learn how to design a Pytorch model (`torch.module`)
* learn how to implement a FNN for learning embeddings with CBOW which *sums* the context embedding vectors
* train the model for a few epochs using stochastic gradient descent (SGD)
* read off the learned embeddings $W$, store them in a gensim-readable file and inspect them


### CBOW



CBOW is a model proposed by [Mikolov et al., 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).

It is a simple neural method to learn word embeddings and it is one of the two core algorithms in the `word2vec` toolkit (see figure below). Note that, besides its usage here to learn word embeddings, CBOW is also a more general term used to refer to any input representation which consists of (some) way of aggregating a set of word embeddings. Hence its name, the continous BOW representation. You can in fact use such a similar representation (e.g., the average of the embeddings of words) for other tasks as well, such as text classification. Here, CBOW is meant in its original formulation: a network over the *sum* of embeddings of context words aimed at predicting the middle target word. It is related in spirit to a language model, but instead framed as a classification task (with context available on both sides) and hence bears more similarities with a *[word close test](https://en.wikipedia.org/wiki/Cloze_test)*.

Illustration of the CBOW model (in comparison to the skip-gram):
<img src="pics/cbow-vs-skipgram.png">

##  Bonus 6. Representing the data

Given a corpus, extract the training data for the CBOW model using a window size of 2 words on each side of the target word. The following image shows what the input of the training algorithm (`Input`) should look like (`Training window`):


<img src="pics/cbow-window.jpg">

Hints:
* Remember to `"<pad>"` the input when the window size is smaller than the expected window size. This also means that the `"<pad>"` token should be in the vocabulary; reserve the first `0` index for this special token.
* In Pytorch, all input is expected to be a `torch.tensor`. You can create these beforehand with `torch.zeros()`, or just convert a resulting python list by using `torch.tensor(train_data)`.

Example:

Given the following tiny corpus:
```
tiny_corpus = ["this is an example", "this is a longer example sentence", "I love deep learning"]
```

To create the `train_X` data, you first need to extract n-gram windows and the target words:

```
label,context
this ['<pad>', '<pad>', 'is', 'an']
is ['<pad>', 'this', 'an', 'example']
example ['this', 'is', 'example', '<pad>']
...
```

And convert them into numeric format, where each word token is represented by its unique index:

```
train_labels = [ 1,  2,  3,  4,  1,  2,  5,  6,  4,  7,  8,  9, 10, 11]
train_data = [[ 0,  0,  2,  3],
 [ 0,  1,  3,  4],
 [ 1,  2,  4,  0],
 [ 2,  3,  0,  0],
 [ 0,  0,  2,  5],
 [ 0,  1,  5,  6],
 [ 1,  2,  6,  4],
 [ 2,  5,  4,  7],
 [ 5,  6,  7,  0],
 [ 6,  4,  0,  0],
 [ 0,  0,  9, 10],
 [ 0,  8, 10, 11],
 [ 8,  9, 11,  0],
 [ 9, 10,  0,  0]]
```

In [16]:
tiny_corpus = ["this is an example", "this is a longer example sentence", "I love deep learning"]

In [37]:
tiny_corpus = []
with open('sample.txt') as f:
    for line in f.readlines():
        tiny_corpus.append(line.strip())

Suggestion: Implement all your steps first on the `tiny_corpus` data. Then test your implementation on the provided data `sample.txt`.

In [38]:
## global settings
PAD = "<PAD>"
window_size=2

### your code here
def generate_ngrams(sentence, n):
    tokens = [PAD]*n + sentence.split() + [PAD]*n
    ngrams = [([tokens[i-j-1] for j in reversed(range(n))] + [tokens[i+j+1] for j in range(n)], tokens[i])
              for i in range(n, len(tokens)-n)]
    return ngrams

# Tokenize the sentences and create a vocabulary
tokens = [word for sentence in tiny_corpus for word in sentence.split()]

word2idx = {PAD: 0}

# Assign indices starting from 1 in the order they appear in the corpus
for word in tokens:
    if word not in word2idx:
        word2idx[word] = len(word2idx)

ngrams = [generate_ngrams(sentence, window_size) for sentence in tiny_corpus]
ngrams = [item for sublist in ngrams for item in sublist]  # Flatten the list of n-grams

# Convert words to their numeric indices using the mapping
train_data = torch.tensor([[word2idx[word] for word in context] for context, _ in ngrams])
train_labels = torch.tensor([word2idx[target] for _, target in ngrams])

# Results
train_data, train_labels

(tensor([[  0,   0,   2,   3],
         [  0,   1,   3,   4],
         [  1,   2,   4,   2],
         ...,
         [ 11,  43,   2, 840],
         [ 43, 817, 840,   0],
         [817,   2,   0,   0]]),
 tensor([  1,   2,   3,  ..., 817,   2, 840]))

##  Bonus 7. Implement the continuous bag of words model for estimating word embeddings

Implement the CBOW model for word embeddings: a CBOW with window size 2, which `sums` the input embeddings and from that hidden representation `predicts` the target token. 

The steps for CBOW are as follows:
* Convert your data to the center/window (done in previous assignment
* The model should have an embedding layer a linear layer (and optionally a loss function, you can also put the loss function in the forward loop)
* In the forward function of the model, it should: look up the embeddings, sum them, convert to logits (in the linear layer), and optionally calculate the loss (can also be done in forward loop)
* In the training loop (assignment 4b), we have a for loop for the epochs and one for the data. Within this, we call the forward function and obtain the loss after which the backward pass can be called


To train a model in Pytorch, one has to define a sub-class of `torch.nn.module` (see also [assignment3](https://github.itu.dk/robv/intro-nlp2023/blob/main/assignments/week3/train.py)). The constructor `__init__()` and the `forward()` function can then be defined to specify the structure of the network. In the `__init__` function, the layers are specified and initialized, whereas the `forward` function defines how the layers interact during a forward-pass. You can use [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) for the embedding layer, [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) for the hidden layer, and [`torch.nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) as loss function. 

For some examples we refer to this [tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py) and this [introduction](https://towardsdatascience.com/an-easy-introduction-to-pytorch-for-neural-networks-3ea08516bff2).

* a) Implement the CBOW network as described above:

**Hint**: you can print the structure of the model by simply printing the initialized variable. Make sure all the layers are represented in the forward pass.

In [39]:
import torch.nn as nn
import torch.optim as optim

embed_dim = 64
vocab_dim = len(word2idx)

class CBOW(nn.Module):
    def __init__(self, embed_dim, vocab_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_dim, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_dim)
    
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        embeds_sum = embeds.sum(dim=0).view(1, -1) # Sum and reshape embeddings
        logits = self.linear(embeds_sum)
        return logits

cbow_model = CBOW(embed_dim,vocab_dim)
print(cbow_model)

CBOW(
  (embeddings): Embedding(14532, 64)
  (linear): Linear(in_features=64, out_features=14532, bias=True)
)


* b) Now implement the training procedure with gradient descent (`learning rate=0.001`). Go through the dataset `10` times, and update the weights after each line (`batch size = 1`). An example of a training procedure can be found on: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#train-the-network

**Hint**: you have to convert the lists created in assignment 3 to be able to do the forward pass. The forward pass expects its input to be in tensors. So for the gold labels this means we have to ensure that we do not pass a zero-dimension tensor which looks like: `tensor(1)`, but convert this to `tensor([1])`. Similarly for the training data, we convert `tensor([0, 0, 2, 3])` to `tensor([[0], [0], [2], [3]])`. This can be done with [tensor views](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html#torch.Tensor.view).


In [40]:
# b)
from tqdm import tqdm
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(cbow_model.parameters(), lr=0.001)

epochs = 10
for epoch in tqdm(range(epochs)):
    total_loss = 0
    for context, target in zip(train_data, train_labels):
        # Prepare data
        context_var = context.view(-1, 1)

        # Reset gradients before each iteration
        cbow_model.zero_grad()

        # Forward pass
        log_probs = cbow_model(context_var)

        # Compute the loss, gradients, and update the parameters
        loss = loss_function(log_probs, target.view(1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch}, Loss: {total_loss}")

 10%|█         | 1/10 [07:21<1:06:13, 441.45s/it]

Epoch 0, Loss: 795465.8421205501


 20%|██        | 2/10 [14:07<56:04, 420.56s/it]  

Epoch 1, Loss: 703642.9686597846


 30%|███       | 3/10 [20:49<48:05, 412.25s/it]

Epoch 2, Loss: 659211.8396050427


 40%|████      | 4/10 [27:36<41:00, 410.12s/it]

Epoch 3, Loss: 629916.4330505263


 50%|█████     | 5/10 [33:45<32:55, 395.15s/it]

Epoch 4, Loss: 607484.8932853411


 60%|██████    | 6/10 [40:05<26:01, 390.26s/it]

Epoch 5, Loss: 589339.6070791595


 70%|███████   | 7/10 [46:23<19:18, 386.20s/it]

Epoch 6, Loss: 574111.0862202514


 80%|████████  | 8/10 [53:14<13:07, 393.91s/it]

Epoch 7, Loss: 560905.0768480739


 90%|█████████ | 9/10 [59:50<06:34, 394.50s/it]

Epoch 8, Loss: 549154.8300352106


100%|██████████| 10/10 [1:06:14<00:00, 397.47s/it]

Epoch 9, Loss: 538506.09074019



