# Week 3 - NLP and Deep Learning

---

# Lecture 5 - Language Identification with a Feedforward Neural Network


In this exercise, you will implement the forward step of a FFNN from scratch and compare your solution to Pytorch on a small toy example to predict the language for a given word. 

It is very important that you understand the basic building blocks (input/output: how to encode your instances, the labels; the model: what the neural network consists of, how to learn its weights, how to do a forward pass for prediction). 

##  1. Representing the data

We are assuming multi-class classification tasks for the assignments of this week. The labels are: $$ y \in \{da,nl,en\}$$

We will use the same data as in week2, from:
* English [Wookipedia](https://starwars.fandom.com/wiki/Main_Page)  
* Danish [Kraftens Arkiver](https://starwars.fandom.com/da/wiki) 
* Dutch [Yodapedia](https://starwars.fandom.com/da/wiki)


In [17]:
def load_langid(path):
    text = []
    labels = []
    for line in open(path, encoding="utf-8"):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1])
    return text, labels

wooki_train_text, wooki_train_labels = load_langid('langid-data/wookipedia_langid.train.tok.txt')

* a): Convert the training data into n-hot format, where each feature represents whether a **single character** is present or not.  Similarly, convert the labels into numeric format. For simplicity, you can assume a closed vocabulary (only the letters in wookie_train_text, no unknown-character handling). Keep original casing, and assign the character indices based on their chronological order.

  * What is the vocabulary size?
  
**Hint:** It is easier for the rest of the assignment if you directly use a torch tensor to save the features ([tutorial](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py)), a 2d torch tensor filled with 0's can be initiated with: `torch.zeros(dim1, dim2, dtype=float)`. Note the use of `float` instead of `int` here, which is only because the `torch.mm` requires float tensors as input.

In [18]:
# a)
import torch
import numpy as np
import pandas as pd

class OneHotCharEncoder:
    def __init__(self) -> None:
        pass

    def fit(self, text: str) -> torch.Tensor:
        self.text = text
        uniq = pd.unique(np.array(list(text))) # get unique characters in order
        self.vocab = uniq

        return self.encode(text) # use vocab to create a one-hot vector

    def _validate(self):
         if self.vocab is None:
              raise ValueError("Call fit first")

    def encode(self, text: str | list) -> torch.Tensor:
            self._validate()
            # not that efficient as its not vectorized but the vocab is pretty small so shouldn't be a big deal
            return torch.tensor([1 if c in text else 0 for c in self.vocab], dtype=float)
    def transform(self, text: list) -> torch.Tensor:
         return torch.stack([self.encode(t) for t in text])

enc = OneHotCharEncoder()
enc.fit("".join(wooki_train_text))
wooki_train_tensors = enc.transform(wooki_train_text)
print(wooki_train_tensors) # each row is a datapoint
print(len(enc.vocab)) # 131 chars

labels = pd.unique(np.array(wooki_train_labels))
print(labels)

tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.]], dtype=torch.float64)
131
['en' 'nl' 'da']


##  2: Forward pass (from scratch)

### Feedforward Neural Networks (FNNs) or MLPs

Feedforward Neural Networks (FNNs) are also called Multilayer Perceptrons (MLPs). These are the most basic types of neural networks. They are called this way as the information is flowing from the input nodes through the network up to the output nodes. 

It is essential to understand that a neural network is a non-linear classification model which is based upon function application. Each layer in a neural network is an application of a function.

Summary (by J.Frellsen):
<img src="pics/fnn_jf.png">

You are going to implement the forward step manually on a small dataset. You will create a network following the design in the following figure (note that the input should be the sames size as the number of characters found in the previous assignment, instead of 4):

<img src="pics/nn.svg">

a) How many neurons do hidden layer 1 and hidden layer 2 have? Note: the bias node is not shown in the figure, you do not have to count them for this assignment.

b) How many neurons does the output layer have? And the input layer? (Note: the figure shows only 4 input nodes, in this example your input size is defined in the previous assignment - what is the input layer size?)

c) Specify the size of layers of the feedforward neural network:

In [19]:
## helper functions to determine the input and output dimensions of each layer
input_dim = len(enc.vocab)

hidden_dim1 = 15
hidden_dim2 = 20

output_dim = len(labels)

d) Now initialize the layers themselves as torch tensors (do not use a torch.module here!). You can define the bias and the weights in separate tensors. The weights should be initialized randomly (`torch.randn((dim1, dim2), dtype=torch.float)`, see also [torch.randn](https://pytorch.org/docs/stable/generated/torch.randn.html)) and the biases can be set to 1 (`torch.ones(dim1, dtype=torch.float)`, see also [torch.ones](https://pytorch.org/docs/stable/generated/torch.ones.html)). Confirm whether their size match the answer to `b)` and `a)` by printing .shape of the tensors.


In [20]:
## define all parameters of this NN

# Each row is a neuron, and each column is a datapoint that gets passed to |row| neurons.
w1 = torch.randn((input_dim, hidden_dim1), dtype=float)
b1 = torch.ones(hidden_dim1, dtype=float)

w2 = torch.randn((hidden_dim1, hidden_dim2), dtype=float)
b2 = torch.ones(hidden_dim2, dtype=float)

w3 = torch.randn((hidden_dim2, output_dim), dtype=float)
b3 = torch.randn(output_dim, dtype=float)

Now that we have defined the shape of all parameters, we are ready to "connect the dots" and build the network. 

It is instructive to break the computation of each layer down into two steps: the scores $a1$ are obtained by the linear function followed by the activation applications $\sigma$ to obtain the representation $z1$, as in:

$$ a1 = xW_1 + b_1$$
$$ z1 = \sigma(a1)$$

d) Specify the entire network up to the output layer $z3$, and **up to and exclusive** the final application of the softmax, the last activation function, which is provided. For multiplication [torch.mm](https://pytorch.org/docs/stable/generated/torch.mm.html) can be used. Use a tanh activation function: [torch.tanh](https://pytorch.org/docs/stable/generated/torch.tanh.html).

The exact implementation of the softmax might differ from toolkit to toolkit (due to variations in implementation details in order to obtain numerical stability). Therefore, we will use the Pytorch implementation for the softmax calculation ([torch.nn.Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html)).

In [21]:
## implement the forward pass (up to and exclusive the softmax) 
## apply it to the training data `data_train` - use vectorization

def pred(x: torch.Tensor):
    a1 = torch.matmul(x, w1) + b1
    z1 = torch.tanh(a1)

    a2 = torch.matmul(z1, w2) + b2
    z2 = torch.tanh(a2)

    a3 = torch.matmul(z2, w3) + b3
    return a3

wooki_preds_a3 = pred(wooki_train_tensors)

We can check that all predictions sum up to approximately 1 (hint: use `torch.sum` with `axis=1`)



In [22]:
print(torch.sum(wooki_preds_a3, axis=1)) # no shit they're not gonna sum to 1 you told me not to do the last softmax
m = torch.nn.Softmax(dim=1)
wooki_preds = m(wooki_preds_a3)
print(torch.sum(wooki_preds, axis=1))

tensor([-0.1223,  4.1720,  0.2384,  ...,  0.3482,  1.6423,  5.1073],
       dtype=torch.float64)
tensor([1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
       dtype=torch.float64)



Congrats! you have made it through the manual construction of the forward pass. Note that these weights are still random, so performance is not expected to be good. Now lets compare your implementation to a set of pre-determined weights.

##  3. Where do the weights come from?  Loading existing weights

So far, the model that you used randomly initialized weights. In this step we will load pre-trained model weights and do the forward pass with those weights, in order to check your implementation against model predictions computed by the toolkit.

Now we are going to:
* load pretrained weights for all parameters
* apply the weights to the evaluation data
* check that your manual softmax scores match the ones obtained by the pre-trained model `model` that we will load
* convert the output to labels and calculate the accuracy score

First, lets load the pre-trained model:

In [23]:
import torch
import torch.nn as nn

# use the character indexing from assignment 3
idx2char = ['H', 'e', ' ', 'v', 'n', 'w', 't', 's', 'o', 'f', 'a', 'r', 'u', 'g', 'h', ',', 'i', 'c', 'y', 'd', 'b', 'm', 'p', 'l', 'k', '.', 'D', 'E', 'C', 'j', 'R', 'S', 'U', '1', "'", 'æ', 'å', 'q', '`', 'I', '(', ')', 'M', 'F', '-', 'x', 'K', '9', '5', 'B', 'W', 'z', 'G', 'P', 'L', '/', 'O', '6', 'T', '7', 'Z', '2', '0', 'J', 'V', 'A', 'ø', 'X', '–', 'N', 'ë', ':', '&', '3', 'Y', 'é', '4', '[', ']', '’', ';', '8', 'É', 'Æ', 'Q', '!', '—', 'ï', '°', 'ō', '\u200b', '‘', 'ń', '“', '”', '?', 'Å', '<', '>', '#', '%', '+', 'ʊ', 'ɹ', 'ə', 'ɑ', 'ö', 'à', 'á', 'è', '=', 'ü', 'Ø', '∑', '^', 'ś', 'ñ', '|', '½', '$', '«', '™', 'ó', '´', '…', '―', '»', 'ː', 'θ', '²', 'Θ']
char2idx = {'H': 0, 'e': 1, ' ': 2, 'v': 3, 'n': 4, 'w': 5, 't': 6, 's': 7, 'o': 8, 'f': 9, 'a': 10, 'r': 11, 'u': 12, 'g': 13, 'h': 14, ',': 15, 'i': 16, 'c': 17, 'y': 18, 'd': 19, 'b': 20, 'm': 21, 'p': 22, 'l': 23, 'k': 24, '.': 25, 'D': 26, 'E': 27, 'C': 28, 'j': 29, 'R': 30, 'S': 31, 'U': 32, '1': 33, "'": 34, 'æ': 35, 'å': 36, 'q': 37, '`': 38, 'I': 39, '(': 40, ')': 41, 'M': 42, 'F': 43, '-': 44, 'x': 45, 'K': 46, '9': 47, '5': 48, 'B': 49, 'W': 50, 'z': 51, 'G': 52, 'P': 53, 'L': 54, '/': 55, 'O': 56, '6': 57, 'T': 58, '7': 59, 'Z': 60, '2': 61, '0': 62, 'J': 63, 'V': 64, 'A': 65, 'ø': 66, 'X': 67, '–': 68, 'N': 69, 'ë': 70, ':': 71, '&': 72, '3': 73, 'Y': 74, 'é': 75, '4': 76, '[': 77, ']': 78, '’': 79, ';': 80, '8': 81, 'É': 82, 'Æ': 83, 'Q': 84, '!': 85, '—': 86, 'ï': 87, '°': 88, 'ō': 89, '\u200b': 90, '‘': 91, 'ń': 92, '“': 93, '”': 94, '?': 95, 'Å': 96, '<': 97, '>': 98, '#': 99, '%': 100, '+': 101, 'ʊ': 102, 'ɹ': 103, 'ə': 104, 'ɑ': 105, 'ö': 106, 'à': 107, 'á': 108, 'è': 109, '=': 110, 'ü': 111, 'Ø': 112, '∑': 113, '^': 114, 'ś': 115, 'ñ': 116, '|': 117, '½': 118, '$': 119, '«': 120, '™': 121, 'ó': 122, '´': 123, '…': 124, '―': 125, '»': 126, 'ː': 127, 'θ': 128, '²': 129, 'Θ': 130}

# the label indexes that were used during training
label2idx = {'da':0, 'nl':1, 'en':2}
idx2label = ['da', 'nl', 'en']

# This is the definition of an FNN model in PyTorch, and can mostly be ignored for now.
# We will focus on how to create Torch models in week 5
class LangId(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.input = nn.Linear(vocab_size, 15)
        self.hidden1 = nn.Linear(15, 20)
        self.hidden2 = nn.Linear(20, 3)

    def forward(self, x):
        x = torch.tanh(self.input(x))
        x = torch.tanh(self.hidden1(x))
        x = self.hidden2(x)
        return x

lang_classifier = torch.load('model.th')


Inspect the weights you just loaded using the `state_dict()` function of the model: 

In [24]:
lang_classifier.state_dict()
# those sure are weights

OrderedDict([('input.weight',
              tensor([[ 0.1274,  0.2723,  0.4691,  ...,  0.0754,  0.0201,  0.0813],
                      [-0.1876,  0.3465,  0.4979,  ..., -0.0436, -0.0362, -0.0866],
                      [ 0.1779,  0.3311,  0.3578,  ..., -0.0705,  0.0656,  0.0415],
                      ...,
                      [-0.0264,  0.2019,  0.1753,  ...,  0.0335,  0.0764,  0.0222],
                      [-0.0810, -0.3535, -0.1255,  ..., -0.0645,  0.0299,  0.0438],
                      [ 0.0740, -0.1535,  0.1290,  ..., -0.0464, -0.0612,  0.0650]])),
             ('input.bias',
              tensor([ 0.4091,  0.8057,  0.4696,  0.3282,  0.4459, -0.3094, -0.7575, -0.3531,
                      -0.3175,  0.2946,  0.7420,  0.1358,  0.1037, -0.2193, -0.3283])),
             ('hidden1.weight',
              tensor([[ 0.2575,  0.6953,  0.5631,  0.2704, -0.3716, -0.9438, -0.4709, -0.9932,
                       -0.7564, -0.0925, -2.2822,  0.2297,  0.2956,  0.0241, -1.9843],
            

* a) Convert the following dev data into the input format for the neural network above. 

**Hint** The indices of the characters are based on the order in the training data, and should match in the development data, we provide the correct idx2char and char2idx that were used to train the model in the code above.

In [25]:
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')
enc = OneHotCharEncoder()
enc.vocab = idx2char # not taking chances that the order is messed up
wooki_dev_tensors = enc.transform(wooki_dev_text)
print(wooki_dev_tensors) # those sure are numbers

tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 1., 1.,  ..., 0., 0., 0.],
        [0., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]], dtype=torch.float64)


* b) run a forward pass on the dev-data with `lang_classifier`, using the forward() function

* c) Apply your manual implementation of the forward pass to the evaluation data by using the parameters (weights) you just loaded with `state_dict()`. This allows you to check if you get the same results back as the model implemented in Torch. If the outputs match, you implemented the forward pass correctly, congratulations!

**Hint**: internally the torch model saves the weight in a transposed vector for efficiency reasons. This means that W1 will have the dimension of (15,131). To use your previous implementation you have to call the the transpose function in Pytorch ([`.t()`](https://pytorch.org/docs/stable/generated/torch.t.html)), which will convert the shape to be (131,15)

* d) Now apply softmax on the resulting weights and convert the output to the label predictions.

In [26]:
# b)
wooki_dev_tensors = wooki_dev_tensors.type(torch.float) # oops
torch_result = lang_classifier.forward(wooki_dev_tensors)

# starting point for c)
w = lambda x: lang_classifier.state_dict()[x].t()
w1 = w('input.weight')
b1 = w('input.bias')
w2 = w('hidden1.weight')
b2 = w('hidden1.bias')
w3 = w('hidden2.weight')
b3 = w('hidden2.bias')
my_result = pred(wooki_dev_tensors)

In [27]:
torch_result == my_result
# wow that is actually pretty unexpected, first try too

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        ...,
        [True, True, True],
        [True, True, True],
        [True, True, True]])

In [28]:
# d)
m = nn.Softmax(dim=1)
my_prob = m(my_result)
my_pred_idx = torch.argmax(my_prob, axis=1)
my_pred = [idx2label[idx] for idx in my_pred_idx]

n_correct = np.sum(np.array(my_pred) == np.array(wooki_dev_labels))
acc = n_correct / len(my_pred)
print(f"Accuracy: {acc*100:.2f}%")

Accuracy: 81.37%


# Lecture 6: What do word embeddings represent?
In the following exercises, you are going to explore what is represented with word embeddings. You are going to make use of the python gensim package and two sets of pre-trained embeddings. The embeddings can be downloaded from:

* http://itu.dk/people/robv/data/embeds/twitter.bin.tar.gz
* http://itu.dk/people/robv/data/embeds/GoogleNews-50k.bin.tar.gz

The first embeddings are skip-gram embeddings trained on a collection of 2 billion words from English tweets collected during 2012 and 2018 with the default settings of word2vec. The second embeddings are trained on 100 billion words from Google News. They have both been truncated to the most frequent 500,000 words. Note that loading that each of these embeddings require approximately 2GB of ram.

The embeddings can be loaded in gensim as follows:

In [37]:
import gensim.models

twitEmbs = gensim.models.KeyedVectors.load_word2vec_format(
    'embeddings/twitter.bin', binary=True
)
googleEmbs = gensim.models.KeyedVectors.load_word2vec_format(
    'embeddings/GoogleNews-50k.bin', binary=True
)

print('loading finished')

loading finished


You can now use the index operator ``[]`` or the function ``get_vector()`` to acces the individual word embeddings.

In [30]:
twitEmbs['cat']

array([ 4.64285821e-01,  2.37979457e-01, -4.24226150e-02, -4.35831666e-01,
       -4.06450212e-01, -1.43117514e-02,  1.22334510e-01, -5.59092343e-01,
        1.23332568e-01,  2.36625358e-01,  3.58797014e-02, -9.40739065e-02,
       -2.04128489e-01, -1.81295779e-02, -1.08792759e-01, -2.70818472e-01,
        1.05479717e-01,  1.37095019e-01,  1.79271579e-01,  2.91243941e-01,
       -5.87746739e-01,  2.90462654e-02,  6.89281642e-01, -1.80917114e-01,
       -2.57750720e-01, -2.01395631e-01, -5.16403615e-01,  5.85804135e-03,
       -1.67768478e-01,  2.17095211e-01,  2.22494245e-01,  1.56742647e-01,
       -3.60864878e-01,  3.94283593e-01,  8.04448500e-03,  1.11518592e-01,
       -1.85592070e-01, -1.16088443e-01,  3.24357510e-01,  4.00876179e-02,
        9.14092362e-02, -1.04118213e-01, -6.89513862e-01,  1.54412836e-01,
        4.57625002e-01,  2.55037360e-02, -3.84058757e-03,  7.12698698e-02,
       -2.25590184e-01, -1.96693689e-01, -3.88458431e-01, -2.27625713e-01,
        6.94357634e-01, -

## 4. Word similarities
Cosine distance can be used to measure the distance between two words. It is defined as:
\begin{equation}
cos_{\vec{a},\vec{b}} = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|} = \frac{\sum^n_1 a_i b_i}{\sqrt{\sum^n_1 a_i^2} \sqrt{\sum^n_1 b_i^2}}
\end{equation}

* a) Implement the cosine similarity using pure python (only the ``math`` package is allowed). Note that `similarity == 1-distance`.

You can compare your scores to the gensim implementation to check wheter it is correct. The following code should give the same output

In [31]:
def cos(a: np.ndarray, b: np.ndarray):
    # I *know* the vectors are already numpy arrays,
    # so is using the numpy dot product instead of a loop really cheating?
    numerator = a @ b
    len_a = np.sqrt(a @ a) # this function is also in math so definitely not cheating
    len_b = np.sqrt(b @ b)
    denominator = len_a * len_b
    return numerator / denominator

print(twitEmbs.distance('cat', 'dog'))
print(cos(twitEmbs['cat'], twitEmbs['dog']))
cat = np.copy(twitEmbs['cat'])
dog = np.copy(twitEmbs['dog'])
print(np.dot(cat, dog)/(np.linalg.norm(cat)*np.linalg.norm(dog))) # yay

0.10446518659591675
0.8955349
0.8955349


In wordnet, the distance between two senses can be based on the distance in the taxonomy. The most common metric for this is:

* Wu-Palmer Similarity: denotes how similar two word senses are, based on the depth of the two senses in the taxonomy and of their Least Common Subsumer (most specific ancestor node).

It can be obtained in python like this:

In [32]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

first_word = wordnet.synsets('cat')[0] #0 means: most common sense
second_word = wordnet.synsets('dog')[0]
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))

print('Twitter similarity: ' + str(twitEmbs.similarity('cat', 'dog')))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\thore\AppData\Roaming\nltk_data...


WordNet similarity: 0.8571428571428571
Twitter similarity: 0.8955348




* b) Think of 5 word pairs which have a high similarity according to you. Estimate the difference between these pairs in wordnet as well as in the Twitter embeddings and the Google News embeddings. Which method is closest to your own intuition? (You are allowed to use the gensim implementation of cosine similarity here.)


In [41]:
# b)
pairs = [
    ('mouse', 'keyboard'), # 70%?
    ('mouse', 'hamster'), #  80%?
    ('door', 'window'), #   75%?
    ('hand', 'foot'), #     85%?
    ('television', 'monitor'), #  80%?
]
def get_sims(w1, w2):
    twit = twitEmbs.similarity(w1, w2)
    goog = googleEmbs.similarity(w1, w2)
    wdnt = wordnet.synsets(w1)[0].wup_similarity(wordnet.synsets(w2)[0])
    return twit, goog, wdnt

for w1, w2 in pairs:
    t, g, w = get_sims(w1, w2)
    print(w1, "-", w2)
    print("Twitter:", t)
    print("Google: ", g)
    print("Wordnet:", w)

print("""
mouse - keyboard:
  Twitter has a much hight similarity that the two other ones. That actually does make sense, as you'd expect twitter users to talk more about computers than the average newspaper.
mouse - keyboard:
  Twitter is very similar. So is Google News, guess there arent that many news stories about hamsters, and mice are probably dominated by lab experiments. Wordnet is pretty high, unsurprisingly.
door - window:
  ¯\_(ツ)_/¯
television - monitor:
  I think it would have been a lot higher if I had used TV. You know what lets try.
""")
print(get_sims('TV', 'monitor'))
print("""A little higher, yeah""")

mouse - keyboard
Twitter: 0.76052743
Google:  0.47375855
Wordnet: 0.38095238095238093
mouse - hamster
Twitter: 0.7560636
Google:  0.48806348
Wordnet: 0.9230769230769231
door - window
Twitter: 0.8368629
Google:  0.62127966
Wordnet: 0.631578947368421
hand - foot
Twitter: 0.7235881
Google:  0.22718483
Wordnet: 0.8235294117647058
television - monitor
Twitter: 0.38444385
Google:  0.111713566
Wordnet: 0.38095238095238093

mouse - keyboard:
  Twitter has a much hight similarity that the two other ones. That actually does make sense, as you'd expect twitter users to talk more about computers than the average newspaper.
mouse - keyboard:
  Twitter is very similar. So is Google News, guess there arent that many news stories about hamsters, and mice are probably dominated by lab experiments. Wordnet is pretty high, unsurprisingly.
door - window:
  ¯\_(ツ)_/¯
television - monitor:
  I think it would have been a lot higher if I had used TV. You know what lets try.

(0.42712036, 0.13722876, 0.3809523

## 5. Analogies

Analogies have often been used to demonstrate the power of word embeddings. Analogies have the form ``A :: B : C :: D``. In this setting `A`, `B` and `C` are usually given and the fourth term `D` is extracted from the embeddings by using ``3cosadd``:

\begin{equation}
\underset{d}{\mathrm{argmax}} (\cos (d, c) - \cos (d, a) + \cos (d, b))
\end{equation}

You can query analogies with gensim:

In [33]:
# Man is to king as woman is to ...?
twitEmbs.most_similar(positive=['woman', 'king'], negative=['man'], 
                                                         topn=10)

[('queen', 0.8401797413825989),
 ('goddess', 0.7309160232543945),
 ('king…', 0.7233694195747375),
 ('princess', 0.7157886624336243),
 ('kings', 0.707615852355957),
 ('godess', 0.6952610015869141),
 ('Queen', 0.6902579069137573),
 ('queen,', 0.687620997428894),
 ('quee…', 0.6856901049613953),
 ('queens', 0.6832401752471924)]

``3cosadd`` can be used to solve semantic as well as syntactic analogies:

| Semantic            |                                      |
|---------------------|--------------------------------------|
| Country-capital     | Denmark :: Copenhagen : England :: X |
| Family-relations    | boy :: girl : he :: X                |
| Object-color        | sky :: blue : grass :: X             |

| Syntactic           |                                      |
|---------------------|--------------------------------------|
| Superlatives        | nice :: nicer : good :: X            |
| Present-past tense  | work :: worked : drink :: X          |
| Country-nationality | Brazil :: Brazilian : Denmark :: X   |


Try the analogies from the table. Is the correct answer returned for all queries? 
If not: are the answers at least ranked high?

* a) Think of another category of *semantic* analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus Google News)?

* b) Think of another category of *syntactic* analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus Google News)?


In [79]:
def anlg(pos1, neg, pos2):
    return twitEmbs.most_similar(positive=[pos1, pos2], negative=[neg], topn=5)
# not doing all 3 

# I like to imagine this as "Take A. Subtract B. Add C."
print(anlg('Copenhagen', 'Denmark', 'England')) # Dublin, i guess?
print(anlg('he',         'boy',     'girl'   )) # She, perfect.
print(anlg('blue',       'sky',     'grass'  )) # MUSTARD????? blue-sky+grass=mustard. (at least green is #2)
print(anlg('nicer',      'nice',    'good'   )) # better, perfect.
print(anlg('worked',     'work',    'drink'  )) # drank, perfect.
print(anlg('Brazillian', 'Brazil',  'Denmark')) # Dutch. Burn twitter to the ground.
print()

# a) - homonyms? (is this even semantic?)
# interested to see what kind of "alternatives/synonyms" it tries finding if I try to specify what meaning of a homonym im looking for
print(anlg('mouse',   'computer',   'animal'  )) # hedgehog haha. Not very mouse-like animals.
print(anlg('fly',     'animal',     'machine' )) # jet, makes sense. I see parachute too down there.
print(anlg('command', 'slave',      'linux'   )) # config and lots of nerd terms lol
print(anlg('remote',  'television', 'forest'  )) # i guess some of these make sense
print(anlg('web',     'spider',     'computer')) # software, I had excpected internet. intrAnet is there...
print()

# b) - adding s
print(anlg('runs', 'run', 'jump')) # jumps, 3rd person singular present tense
print(anlg('jumps', 'jump', 'explode')) # explodes
print(anlg('shoes', 'shoe', 'pant')) # pants, mixing it up here. Now the s is plural
print(anlg('pants', 'pant', 'run')) # running, mixing the two. Unsurprisingly, didn't quite get it but running is close.
print(anlg('phones', 'phone', 'think')) # adds ellipses for some reason

[('Dublin', 0.791156530380249), ('London', 0.7881399989128113), ('Glasgow', 0.779718816280365), ('Edinburgh', 0.7671084403991699), ('Antwerp', 0.75947505235672)]
[('she', 0.9052496552467346), ('she…', 0.782346785068512), ('he/she', 0.7575312256813049), ('it.She', 0.7428714036941528), ('*he*', 0.7421621680259705)]
[('mustard', 0.7218776345252991), ('green', 0.719383180141449), ('brown', 0.7046061754226685), ('greens', 0.7032448649406433), ('alfalfa', 0.6829589009284973)]
[('better', 0.7841683626174927), ('worse', 0.7769781351089478), ('shittier', 0.7208447456359863), ('better,', 0.7019132971763611), ('crappier', 0.7017417550086975)]
[('drank', 0.8187708854675293), ('shotgunned', 0.7269829511642456), ('chugged', 0.7176188826560974), ('drinked', 0.7070199251174927), ('sipped', 0.7053581476211548)]
[('Dutch', 0.6439822912216187), ('Moroccan', 0.6369330883026123), ('Tanned', 0.6136460304260254), ('Czech', 0.6110645532608032), ('brazillian', 0.6096874475479126)]

[('hedgehog', 0.718944430351

# Bonus: Learning Word Embeddings


So far you've learned about distributional semantics (vector semantics) in both the traditional and modern neural way, and you qualitatively worked with pre-trained (off-the-shelf) word embeddings in the last assignment.

In this assignment, you will learn how to implement a neural network  to learn word embeddings, namely the *Continous Bag of Words* (CBOW) model for word embeddings. More specifically, you will:

* learn how to represent text for windows language modeling
* learn how to design a Pytorch model (`torch.module`)
* learn how to implement a FNN for learning embeddings with CBOW which *sums* the context embedding vectors
* train the model for a few epochs using stochastic gradient descent (SGD)
* read off the learned embeddings $W$, store them in a gensim-readable file and inspect them


### CBOW



CBOW is a model proposed by [Mikolov et al., 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).

It is a simple neural method to learn word embeddings and it is one of the two core algorithms in the `word2vec` toolkit (see figure below). Note that, besides its usage here to learn word embeddings, CBOW is also a more general term used to refer to any input representation which consists of (some) way of aggregating a set of word embeddings. Hence its name, the continous BOW representation. You can in fact use such a similar representation (e.g., the average of the embeddings of words) for other tasks as well, such as text classification. Here, CBOW is meant in its original formulation: a network over the *sum* of embeddings of context words aimed at predicting the middle target word. It is related in spirit to a language model, but instead framed as a classification task (with context available on both sides) and hence bears more similarities with a *[word close test](https://en.wikipedia.org/wiki/Cloze_test)*.

Illustration of the CBOW model (in comparison to the skip-gram):
<img src="pics/cbow-vs-skipgram.png">

##  Bonus 6. Representing the data

Given a corpus, extract the training data for the CBOW model using a window size of 2 words on each side of the target word. The following image shows what the input of the training algorithm (`Input`) should look like (`Training window`):


<img src="pics/cbow-window.jpg">

Hints:
* Remember to `"<pad>"` the input when the window size is smaller than the expected window size. This also means that the `"<pad>"` token should be in the vocabulary; reserve the first `0` index for this special token.
* In Pytorch, all input is expected to be a `torch.tensor`. You can create these beforehand with `torch.zeros()`, or just convert a resulting python list by using `torch.tensor(train_data)`.

Example:

Given the following tiny corpus:
```
tiny_corpus = ["this is an example", "this is a longer example sentence", "I love deep learning"]
```

To create the `train_X` data, you first need to extract n-gram windows and the target words:

```
label,context
this ['<pad>', '<pad>', 'is', 'an']
is ['<pad>', 'this', 'an', 'example']
example ['this', 'is', 'example', '<pad>']
...
```

And convert them into numeric format, where each word token is represented by its unique index:

```
train_labels = [ 1,  2,  3,  4,  1,  2,  5,  6,  4,  7,  8,  9, 10, 11]
train_data = [[ 0,  0,  2,  3],
 [ 0,  1,  3,  4],
 [ 1,  2,  4,  0],
 [ 2,  3,  0,  0],
 [ 0,  0,  2,  5],
 [ 0,  1,  5,  6],
 [ 1,  2,  6,  4],
 [ 2,  5,  4,  7],
 [ 5,  6,  7,  0],
 [ 6,  4,  0,  0],
 [ 0,  0,  9, 10],
 [ 0,  8, 10, 11],
 [ 8,  9, 11,  0],
 [ 9, 10,  0,  0]]
```

In [80]:
tiny_corpus = ["this is an example", "this is a longer example sentence", "I love deep learning"]

Suggestion: Implement all your steps first on the `tiny_corpus` data. Then test your implementation on the provided data `sample.txt`.

In [121]:
import pandas as pd
import numpy as np

## global settings
PAD = "<PAD>"
window_size=2

### your code here
def tokenize(corpus: list[str]): # does this count as tokenization?
    words = pd.unique(np.array([item for sublist in [i.split() for i in corpus] for item in sublist])) # holy shit list comprehension
    words = np.append(PAD, words)
    indices = list(range(len(words)))

    wrd2idx = { word: index for word, index in zip(words, indices) }
    idx2wrd = words

    return wrd2idx, list(idx2wrd)

wrd2idx, idx2wrd = tokenize(tiny_corpus)
print(wrd2idx)
print(idx2wrd)

def create_train_data(corpus):
    wrd2idx, idx2wrd = tokenize(corpus)
    mid = window_size // 2 # no uneven window sizes
    labels_idx = []
    labels_wrd = []
    train_idx = []
    train_wrd = []

    for sentence in corpus:
        sentence = sentence.split()
        pad = [PAD] * mid
        sentence = pad + sentence + pad

        for i, word in enumerate(sentence):
            if word == PAD:
                continue # im sorry

            before = sentence[i-mid:i]
            after = sentence[i+1:i+mid+1]
            x = before + after
            xi = [wrd2idx[w] for w in x]
            yi = wrd2idx[word]

            train_idx.append(xi)
            labels_idx.append(yi)
            train_wrd.append(x)
            labels_wrd.append(word)
    return labels_idx, train_idx, labels_wrd, train_wrd
labels_idx, train_idx, labels_wrd, train_wrd = create_train_data(tiny_corpus)
print(labels_wrd)
print(train_wrd)

{'<PAD>': 0, 'this': 1, 'is': 2, 'an': 3, 'example': 4, 'a': 5, 'longer': 6, 'sentence': 7, 'I': 8, 'love': 9, 'deep': 10, 'learning': 11}
['<PAD>', 'this', 'is', 'an', 'example', 'a', 'longer', 'sentence', 'I', 'love', 'deep', 'learning']
['this', 'is', 'an', 'example', 'this', 'is', 'a', 'longer', 'example', 'sentence', 'I', 'love', 'deep', 'learning']
[['<PAD>', 'is'], ['this', 'an'], ['is', 'example'], ['an', '<PAD>'], ['<PAD>', 'is'], ['this', 'a'], ['is', 'longer'], ['a', 'example'], ['longer', 'sentence'], ['example', '<PAD>'], ['<PAD>', 'love'], ['I', 'deep'], ['love', 'learning'], ['deep', '<PAD>']]


##  Bonus 7. Implement the continuous bag of words model for estimating word embeddings

Implement the CBOW model for word embeddings: a CBOW with window size 2, which `sums` the input embeddings and from that hidden representation `predicts` the target token. 

The steps for CBOW are as follows:
* Convert your data to the center/window (done in bonus 6)
* The model should have an embedding layer and a linear layer (and optionally a loss function, you can also put the loss function in the forward loop)
* In the forward function of the model, it should: look up the embeddings, sum them, convert to logits (in the linear layer), and optionally calculate the loss (can also be done in forward loop)
* In the training loop, we have a for loop for the epochs and one for the data. Within this, we call the forward function and obtain the loss after which the backward pass can be called


To train a model in Pytorch, one has to define a sub-class of `torch.nn.module` (see also train.py). The constructor `__init__()` and the `forward()` function can then be defined to specify the structure of the network. In the `__init__` function, the layers are specified and initialized, whereas the `forward` function defines how the layers interact during a forward-pass. You can use [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) for the embedding layer, [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) for the hidden layer, and [`torch.nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) as loss function. 

For some examples we refer to this [tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py) and this [introduction](https://towardsdatascience.com/an-easy-introduction-to-pytorch-for-neural-networks-3ea08516bff2).

* a) Implement the CBOW network as described above:

**Hint**: you can print the structure of the model by simply printing the initialized variable. Make sure all the layers are represented in the forward pass.

In [36]:
import torch
import torch.nn as nn
embed_dim = 64

class CBOW(nn.Module):
    def __init__(self, emb_dim, vocab_dim):
        super(CBOW, self).__init__()
        pass

    
    def forward(self, inputs, gold):
        return inputs

cbow_model = CBOW(embed_dim,len(word2idx))
print(cbow)

NameError: name 'word2idx' is not defined

* b) Now implement the training procedure with gradient descent (`learning rate=0.001`). Go through the dataset `10` times, and update the weights after each line (`batch size = 1`). An example of a training procedure can be found on: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#train-the-network

**Hint**: you have to convert the lists created in assignment 3 to be able to do the forward pass. The forward pass expects its input to be in tensors. So for the gold labels this means we have to ensure that we do not pass a zero-dimension tensor which looks like: `tensor(1)`, but convert this to `tensor([1])`. Similarly for the training data, we convert `tensor([0, 0, 2, 3])` to `tensor([[0], [0], [2], [3]])`. This can be done with [tensor views](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html#torch.Tensor.view).
