# Assignment 1

<b>Group 58</b>
* <b> Student 1 </b> : Luc Reinink, 1068948
* <b> Student 2 </b> : Gerrit Merz, 1553410

**Reading material**
* [1] Mikolov, Tomas, et al. "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)" Advances in neural information processing systems. 2013. 

<b><font color='red'>NOTE</font></b> When submitting your notebook, please make sure that the training history of your model is visible in the output. This means that you should **NOT** clean your output cells of the notebook. Make sure that your notebook runs without errors in linear order.



# Question 1 - Keras implementation (10 pt)

### Word embeddings
Build word embeddings with a Keras implementation where the embedding vector is of length 50, 150 and 300. Use the Alice in Wonderland text book for training. Use a window size of 2 to train the embeddings (`window_size` in the jupyter notebook). 

1. Build word embeddings of length 50, 150 and 300 using the Skipgram model
2. Build word embeddings of length 50, 150 and 300 using CBOW model
3. Analyze the different word embeddings:
    - Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in the paper. Do not use existing libraries for this task such as Gensim. 
Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. $e_{x}$ denotes the embedding of word $x$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.
    - Give at least 5 different  examples of analogies.
    - Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.

4. Discuss:
  - Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


<b>HINT</b> See practical 3.1 for some helpful code to start this assignment.


### Import libraries

In [0]:
%tensorflow_version 2.x

In [0]:
import numpy as np
import keras.backend as K
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing import sequence

# other helpful libraries
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

Using TensorFlow backend.


In [0]:
print(tf.__version__) #  check what version of TF is imported

2.2.0


### Import file

If you use Google Colab, you need to mount your Google Drive to the notebook when you want to use files that are located in your Google Drive. Paste the authorization code, from the new tab page that opens automatically when running the cell, in the cell below.

In [0]:
 from google.colab import drive
 drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Navigate to the folder in which `alice.txt` is located. Make sure to start path with '/content/drive/My Drive/' if you want to load the file from your Google Drive.

In [0]:
cd '/content/drive/My Drive/2IMM10 Deep Learning/Assignments/Assignment 1'

/content/drive/My Drive/2IMM10 Deep Learning/Assignments/Assignment 1


In [0]:
# cd '/content/drive/My Drive/Deep Learning'

In [0]:
file_name = 'Resources/alice.txt'
# file_name = 'alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing

See Practical 3.1 for an explanation of the preprocessing steps done below.

In [0]:
# Removes sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # total number of words in the corpus
V = len(tokenizer.word_index) + 1 # total number of unique words in the corpus

In [0]:
n_samples, V

(27165, 2557)

In [0]:
# example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [0]:
# parameters
window_size = 2
window_size_corpus = 4

## Task 1.1 - Skipgram
Build word embeddings of length 50, 150 and 300 using the Skipgram model.

### Explanation
The sections below motivate some of the choices that were made.

#### 1-2 Preparing data for Skipgram
While preparing data for the Skipgram, we take all sentences, and loop over all words of the sentence where each word is an input word. For each input word, we take two words before it, and two words after it to create one-hot encoded versions of them as target words. Hence, each word of the corpus has four target words, except for some words at the edges of sentences. 

#### 3 Creating Skipgram architecture
As described in the paper, we first add an `Embedding` layer with `input_dim=V`, so that every word has an embedding vector of the provided `embed_length`. Next, we add a `Reshape` to connect the embedding to a `Dense` layer with the softmax activation. We use softmax because this is the closest to the activation in the paper. We use Glorot uniform initialisers where possible, since this initialiser finds a good variance for the distribution from which the parameters are drawn. This often results in faster learning.

#### 4-5 Training
We train the data without evaluation set, since the goal of this model is not to find words surrounding the input word, but to extract a good embedding layer. We create three models with embedding lengths of 50, 150 and 300.


In [0]:
# 1. prepare data for skipgram
def generate_skipgram(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
                    
            in_words = []
            labels = []
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    # Add the input word
                    #in_words.append(word)
                    all_in.append(word)
                    # Add one-hot of the context words
                    all_out.append(to_categorical(words[i], V))
                                      
    return (np.array(all_in),np.array(all_out))

# 2. Create training data
x , y = generate_skipgram(corpus,window_size,V)

# 3. Create Skipgram architecture
def create_skipgram_model(V, window_size, embed_length):
  skipgram = Sequential(name="SKIPGRAM_" + str(embed_length))
  skipgram.add(Embedding(input_dim=V, output_dim=embed_length, embeddings_initializer='glorot_uniform', input_length=1))
  skipgram.add(Reshape((embed_length, )))
  skipgram.add(Dense(V, kernel_initializer='glorot_uniform', activation='softmax'))
  skipgram.compile(loss='categorical_crossentropy', optimizer='adadelta')
  return skipgram

#4 . Train skipgram model
def train_skipgram_model(skipgram, epochs):
  skipgram.fit(x, y, batch_size=64, epochs=epochs, verbose = 1)
  return skipgram

# 5. Save embeddings for vectors of length 50, 150 and 300 using skipgram model.
embed_lengths = [50, 150, 300]
epochs = 10
skipgram_models = []

for embed_length in embed_lengths:
    skipgram = create_skipgram_model(V, window_size, embed_length)
    skipgram = train_skipgram_model(skipgram, epochs)
    skipgram_models.append(skipgram)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Task 1.2 - CBOW

Build word embeddings of length 50, 150 and 300 using CBOW model.

### Explanation
The sections below motivate some of the choices that were made.

#### 1-2 Preparing data for CBOW
While preparing data for the CBOW model, we take all sentences, and loop over all words of the sentence, where each word is a target word. For each target word, we take two words before it, and two words after it to create the context. If one or two words before or after the target word don't exist, we replace it by a 0 to mark it as an unknown value. After the loops, we arrange the words so that the order is maintained. However, this is not necessary for CBOW since word vectors are averaged. This results in word order not making any difference. We did it to make it easier to check if the generated pairs are correct.

#### 3 Creating CBOW architecture
As described in the paper, we first add an `Embedding` layer with `input_dim=V`, so that every word has an embedding vector of the provided `embed_length`. Next, we add a `Lambda` layer to average the context vectors, as described in the paper. For the last layer, we add a `Dense` layer with the softmax activation because this is the closest to the activation in the paper. We use Glorot uniform initialisers where possible, since this initialiser finds a good variance for the distribution from which the parameters are drawn. This often results in faster learning.

#### 4-5 Training
We train the data without evaluation set, since the goal of this model is not to classify which word fits in the context, but to extract a good embedding layer. We create three models with embedding lengths of 50, 150 and 300.


In [0]:
# 1. Prepare data for CBOW.
def generate_data_cbow(corpus, window_size, V):
    X = []
    y = []

    for sentence in corpus:
        for i in range(len(sentence)):
            context_before = []
            context_after = []
            for j in range(1, window_size + 1):
                if (i - j >= 0):
                    context_before.append(sentence[i - j])
                else:
                    context_before.append(0)
                if (i + j < len(sentence)):
                    context_after.append(sentence[i + j])
                else:
                    context_after.append(0)

            context_before.reverse()
            context = context_before
            context.extend(context_after)
            target = sentence[i]
            X.append(context)
            y.append(to_categorical(target, V))
    
    return np.array(X), np.array(y)

# 2. Create training data.
X, y = generate_data_cbow(corpus, window_size, V)

# 3. Create CBOW architecture
def create_cbow_model(V, window_size, embed_length):
    cbow = Sequential(name="CBOW_" + str(embed_length))
    cbow.add(Embedding(input_dim=V, output_dim=embed_length, 
                       embeddings_initializer="glorot_uniform", 
                       input_length=window_size*2))
    # Average the context word into a single vector.
    cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_length,)))
    cbow.add(Dense(V, activation="softmax", 
                   kernel_initializer="glorot_uniform"))
    cbow.compile(optimizer="adadelta", loss="categorical_crossentropy",
                metrics=["accuracy"])
    return cbow

# 4. Train CBOW model.
def train_cbow_model(cbow, epochs):
    cbow.fit(X, y, batch_size=64, epochs=epochs)
    return cbow

# 5. Save embeddings for vectors of length 50, 150 and 300 using CBOW model.
embed_lengths = [50, 150, 300]
epochs = 10
cbow_models = []

for embed_length in embed_lengths:
    cbow = create_cbow_model(V, window_size, embed_length)
    cbow = train_cbow_model(cbow, epochs)
    cbow_models.append(cbow)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [0]:
pd.DataFrame(cbow_models[0].get_weights()[0])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
0,-0.003160,-0.044598,-0.022691,-0.027301,-0.045028,0.033939,-0.029093,0.036648,0.045196,0.016631,0.010622,0.003019,-0.026558,-0.026050,-0.020083,-0.018497,0.033937,-0.010353,-0.042910,0.010363,0.007339,0.035711,0.002487,0.027277,-0.044572,0.011258,-0.043617,0.021661,-0.026586,-0.032172,-0.009612,-0.042703,0.024925,0.011072,0.001941,-0.022489,-0.035658,-0.008519,-0.009015,0.022527,-0.029850,0.034572,0.018652,-0.002302,0.008422,0.043052,0.030261,-0.038872,-0.037510,-0.029392
1,-0.021964,0.025391,-0.036893,0.041336,-0.012319,-0.027251,-0.008523,-0.030166,-0.020421,0.024410,-0.017594,0.027649,0.011513,-0.012193,0.045672,-0.004220,-0.044305,-0.015557,-0.040187,0.039597,-0.017017,-0.042293,-0.040556,0.046749,0.017214,-0.026304,-0.025587,-0.041689,-0.035455,0.020387,0.004394,0.011694,0.036447,-0.006199,-0.047093,-0.017880,-0.030506,-0.019972,-0.018389,0.013909,-0.032367,-0.028685,-0.027665,-0.026521,0.036848,0.000955,-0.000757,0.013274,0.025449,0.001055
2,-0.047277,0.033200,0.004926,-0.043496,-0.038368,-0.019265,0.033222,-0.000083,-0.028099,-0.042271,-0.026451,0.009882,0.036982,-0.018226,-0.006882,0.028230,0.028179,-0.036253,0.045479,-0.012998,-0.003839,-0.027244,-0.013604,-0.010748,0.026709,0.034305,0.010073,-0.032167,0.022005,0.007604,0.021992,-0.032735,-0.033295,0.009248,0.042476,0.001713,0.018134,-0.015443,0.026741,0.041811,0.009153,-0.008215,0.010049,0.018712,-0.006159,-0.020338,-0.005121,-0.042298,0.041980,0.018065
3,0.016407,-0.017723,0.031362,0.045387,0.021351,-0.024102,0.034785,0.017627,-0.020827,0.001635,0.023610,0.040219,0.002360,-0.004440,0.015783,-0.015061,0.033655,0.031443,0.000940,-0.005680,0.018744,0.023590,-0.014390,0.038353,0.016115,0.022340,-0.015692,-0.005659,0.047526,-0.001613,-0.006103,-0.035971,-0.033398,0.045693,-0.026787,0.008014,0.046522,-0.010308,-0.030144,0.020764,-0.017119,0.023611,-0.001372,-0.013454,0.029037,-0.000050,0.025433,0.012419,-0.022872,0.029419
4,0.047382,0.011854,-0.025662,-0.005297,0.007176,-0.042802,-0.043173,-0.018175,0.036246,0.035380,-0.003234,-0.024898,-0.008043,-0.015511,-0.013995,0.025899,0.027597,0.015219,-0.016537,0.036126,0.029253,-0.005997,0.034750,0.007414,0.021962,-0.008539,-0.029028,-0.015255,-0.026723,0.003779,0.002824,-0.004126,-0.045323,0.015125,-0.013146,-0.019170,0.039758,-0.019789,-0.046448,-0.030153,0.020372,0.013563,-0.018378,-0.025887,0.023056,0.034963,0.017486,-0.022901,0.017075,-0.017986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2552,0.016058,-0.023422,0.000718,0.016451,0.031005,-0.025403,0.007734,0.018306,0.039828,0.030876,-0.045993,-0.013638,0.027023,0.035126,-0.014713,-0.037673,-0.014547,-0.017329,0.004547,-0.046253,-0.041963,-0.035031,-0.011598,-0.026943,0.013734,-0.037233,0.039462,0.015614,-0.036819,0.003304,0.046912,0.023142,-0.040668,0.003700,0.031717,0.015668,0.046502,0.040458,0.018650,-0.023837,0.004692,-0.007501,0.040772,0.029929,0.040461,-0.047618,0.012199,0.020311,0.011519,0.028983
2553,0.012133,-0.018549,-0.029472,-0.004021,0.021166,-0.022940,0.001314,0.034995,-0.019117,0.002123,-0.042949,-0.016343,-0.016063,-0.033093,0.008955,-0.026190,0.007561,0.017338,-0.021278,0.027950,-0.039476,-0.015620,0.035698,-0.034358,0.042389,0.039134,0.024436,0.024816,-0.006029,0.044174,0.022035,0.037292,-0.008732,-0.003617,0.011482,0.035882,-0.016076,0.017827,-0.037035,-0.021931,-0.000728,-0.012227,0.036359,0.043644,0.026137,-0.018276,0.010664,0.047744,0.020993,0.018589
2554,-0.044441,0.032632,-0.003166,-0.025483,-0.019613,-0.019278,0.000999,-0.005454,0.047833,0.037355,-0.003038,0.036702,0.028664,0.030424,0.025799,0.016142,-0.032917,0.027476,-0.022885,0.021594,-0.027195,0.026230,-0.047075,-0.039135,-0.040725,0.015439,0.042844,-0.030973,-0.041198,0.031295,0.034244,0.021109,-0.003210,0.025499,0.011775,-0.022886,-0.023223,-0.033364,0.040503,0.014590,-0.022475,-0.014596,-0.005806,-0.016817,-0.005206,-0.026769,0.027993,0.023866,-0.021638,0.035950
2555,0.031610,0.043039,0.004712,0.011941,-0.034621,0.000077,-0.006990,0.008463,-0.044059,0.026455,0.034073,-0.030494,-0.003949,-0.038616,0.024017,-0.028927,0.002829,0.019450,0.005806,0.023589,-0.006409,0.022190,-0.029825,-0.018320,-0.046743,0.012812,-0.000242,0.038028,-0.039503,-0.042036,-0.031100,-0.000142,-0.002833,-0.007873,0.013970,-0.013227,-0.037101,0.007464,0.035066,-0.014960,0.025119,-0.022639,0.026728,0.047484,-0.004632,-0.044966,-0.025029,-0.024690,0.040140,0.013173


## Task 1.3 - Analogy function

Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in [1]. Do not use existing libraries for this task such as Gensim. Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. 

In a perfect scenario, we would like that this analogy ( $e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. The result of the formula is called the expected or the predicted word embedding. In this context, "man" is called the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

You have to answer an analogy function using each embedding for both CBOW and Skipgram model. This means that for each analogy we have 6 outputs. Show the true word (with distance similarity value between predicted embedding and true word embedding, i.e. `sim1`) , the predicted word (with distance similarity value between predicted embedding and the embedding of the word in the vocabulary that is closest to this predicted embedding, i.e. `sim2`) and a boolean answer whether the predicted word **exactly** equals the true word. 

<b>HINT</b>: to visualize the results of the analogy tasks , you can print them in a table. An example is given below.


| Analogy task | True word (sim1)  | Predicted word (sim2) | Embedding | Correct?|
|------|------|------|------|------|
|  queen is to king as woman is to ?	 | man (sim1) | predictd_word(sim2) | SG_50 | True / False|

* Give at least 5 different  examples of analogies.
* Compare the performance on the analogy s between the word embeddings and briefly discuss your results.

### Explanation
We use the cosine similarity to convert the matrix into a similarity vector $v$, where each value $v_i$ denotes the similarity of word $i \in V$ to the provided `predicted_embedding`. Hence, we return the word with the index of the largest value in $v$. 

In [0]:
def get_embedding(model):
    weights = model.get_weights()
    return weights[0]

In [0]:
def embed(word, embedding):
    # Get the index of the word from the tokenizer, 
    # i.e. convert the string to it's corresponding integer in the vocabulary.
    int_word = tokenizer.texts_to_sequences([word])[0]
    # Get the one-hot encoding of the word.
    bin_word = to_categorical(int_word, V)
    return np.dot(bin_word, embedding)

In [0]:
def get_most_similar(predicted_embedding, embedding_matrix, reverse_word_map):
    similarity_vector = cosine_similarity(embedding_matrix, predicted_embedding)
    max_sim = similarity_vector[0][0]
    max_index = 0
    for i in range(1, len(similarity_vector)):
        sim = similarity_vector[i][0]
        if sim > max_sim:
            max_sim = sim
            max_index = i
    
    # Return tuple with (most_similar_word, similarity_value)
    return (reverse_word_map.get(max_index), max_sim)

In [0]:
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
all_models = cbow_models + skipgram_models
embeddings = []

In [0]:
for model in all_models:
    embedding = get_embedding(model)
    true_word = "man"

    # Get the actual embedding of the word.
    true_embedding = embed(true_word, embedding)
    # Get the predicted embedding of the word.
    predicted_embedding = embed("king", embedding) - \
        embed("queen", embedding) + embed("woman", embedding)

    # Calculate the similarity between the predicted embedding,
    # and the true embedding, and save it as a (word, similarity_val) tuple.
    sim1 = (true_word, cosine_similarity(predicted_embedding, 
                                         true_embedding)[0][0])
    # Get the most similar word to the predicted_embedding 
    # in (most_similar_word, similarity_val) tuple.
    predicted_word_tuple = get_most_similar(predicted_embedding, 
                                            embedding, reverse_word_map)
    sim2 = predicted_word_tuple
    embedding_name = model.name
    is_correct = predicted_word_tuple[0] == true_word

    # Combine everything for easy DataFrame addition.
    embedding_tuple = ("a king is to a queen as a man is to a woman", 
                sim1, sim2, embedding_name, is_correct)

    embeddings.append(embedding_tuple)

In [0]:
for model in all_models:
    embedding = get_embedding(model)
    true_word = "queen"

    # Get the actual embedding of the word.
    true_embedding = embed(true_word, embedding)
    # Get the predicted embedding of the word.
    predicted_embedding = embed("king", embedding) - \
        embed("man", embedding) + embed("woman", embedding)

    # Calculate the similarity between the predicted embedding,
    # and the true embedding, and save it as a (word, similarity_val) tuple.
    sim1 = (true_word, cosine_similarity(predicted_embedding, 
                                         true_embedding)[0][0])
    # Get the most similar word to the predicted_embedding 
    # in (most_similar_word, similarity_val) tuple.
    predicted_word_tuple = get_most_similar(predicted_embedding, 
                                            embedding, reverse_word_map)
    sim2 = predicted_word_tuple
    embedding_name = model.name
    is_correct = predicted_word_tuple[0] == true_word

    # Combine everything for easy DataFrame addition.
    embedding_tuple = ("a king is to a man as a queen is to a woman", 
                sim1, sim2, embedding_name, is_correct)

    embeddings.append(embedding_tuple)

In [0]:
for model in all_models:
    embedding = get_embedding(model)
    true_word = "spoke"

    # Get the actual embedding of the word.
    true_embedding = embed(true_word, embedding)
    # Get the predicted embedding of the word.
    predicted_embedding = embed("went", embedding) - \
        embed("go", embedding) + embed("speak", embedding)

    # Calculate the similarity between the predicted embedding,
    # and the true embedding, and save it as a (word, similarity_val) tuple.
    sim1 = (true_word, cosine_similarity(predicted_embedding, 
                                         true_embedding)[0][0])
    # Get the most similar word to the predicted_embedding 
    # in (most_similar_word, similarity_val) tuple.
    predicted_word_tuple = get_most_similar(predicted_embedding, 
                                            embedding, reverse_word_map)
    sim2 = predicted_word_tuple
    embedding_name = model.name
    is_correct = predicted_word_tuple[0] == true_word

    # Combine everything for easy DataFrame addition.
    embedding_tuple = ("went is to go as spoke is to speak", 
                sim1, sim2, embedding_name, is_correct)

    embeddings.append(embedding_tuple)

In [0]:
for model in all_models:
    embedding = get_embedding(model)
    true_word = "before"

    # Get the actual embedding of the word.
    true_embedding = embed(true_word, embedding)
    # Get the predicted embedding of the word.
    predicted_embedding = embed("up", embedding) - \
        embed("down", embedding) + embed("after", embedding)

    # Calculate the similarity between the predicted embedding,
    # and the true embedding, and save it as a (word, similarity_val) tuple.
    sim1 = (true_word, cosine_similarity(predicted_embedding, 
                                         true_embedding)[0][0])
    # Get the most similar word to the predicted_embedding 
    # in (most_similar_word, similarity_val) tuple.
    predicted_word_tuple = get_most_similar(predicted_embedding, 
                                            embedding, reverse_word_map)
    sim2 = predicted_word_tuple
    embedding_name = model.name
    is_correct = predicted_word_tuple[0] == true_word

    # Combine everything for easy DataFrame addition.
    embedding_tuple = ("up is to down as before is to after", 
                sim1, sim2, embedding_name, is_correct)

    embeddings.append(embedding_tuple)

In [0]:
for model in all_models:
    embedding = get_embedding(model)
    true_word = "smallest"

    # Get the actual embedding of the word.
    true_embedding = embed(true_word, embedding)
    # Get the predicted embedding of the word.
    predicted_embedding = embed("largest", embedding) - \
        embed("large", embedding) + embed("small", embedding)

    # Calculate the similarity between the predicted embedding,
    # and the true embedding, and save it as a (word, similarity_val) tuple.
    sim1 = (true_word, cosine_similarity(predicted_embedding, 
                                         true_embedding)[0][0])
    # Get the most similar word to the predicted_embedding 
    # in (most_similar_word, similarity_val) tuple.
    predicted_word_tuple = get_most_similar(predicted_embedding, 
                                            embedding, reverse_word_map)
    sim2 = predicted_word_tuple
    embedding_name = model.name
    is_correct = predicted_word_tuple[0] == true_word

    # Combine everything for easy DataFrame addition.
    embedding_tuple = ("largest is to large as smallest is to small", 
                sim1, sim2, embedding_name, is_correct)

    embeddings.append(embedding_tuple)

### Discussion of results
The first observation is that none of the results are correct. This was expected due to the fact that the corpus is very small. However, we also observe that the predicted word is always one of the "added" words in the vector---not the subtracted word. This is somewhat promising because this means that the models have not learned total nonsense. The true word similarity is almost always very low. This is probably due to the fact that the model did not have a chance to learn it properly, because the word is too rare.

We can also observe that the larger embeddings often have a higher similarity value. This is expected, since a larger dimension can fit more information, and therefore, learn more. Finally, we observe that the Skipgram model in general achieves somewhat higher similarity values, which is also expected due to the larger amount of training data (as explained below).

The results could probably be improved with more epochs. Overfitting is not a big problem for the models of this assignment, since the models' goals are to learn a compact representation of the text. CBOW could overfit on frequent words, but this is a problem with CBOW in general. The best way to get better results is to use a (much) larger dataset.

In [0]:
df = pd.DataFrame(columns=["Analogy Task", "True word (sim1)", "Predicted word (sim2)", "Embedding", "Correct?"])

for i in range(len(embeddings)):
    df.loc[i] = embeddings[i]

df

Unnamed: 0,Analogy Task,True word (sim1),Predicted word (sim2),Embedding,Correct?
0,a king is to a queen as a man is to a woman,"(man, -0.13910052)","(king, 0.5858841)",CBOW_50,False
1,a king is to a queen as a man is to a woman,"(man, -0.015250909)","(woman, 0.53653497)",CBOW_150,False
2,a king is to a queen as a man is to a woman,"(man, -0.027084215)","(woman, 0.60925084)",CBOW_300,False
3,a king is to a queen as a man is to a woman,"(man, 0.026587266)","(woman, 0.6127698)",SKIPGRAM_50,False
4,a king is to a queen as a man is to a woman,"(man, -0.0278957)","(king, 0.59490204)",SKIPGRAM_150,False
5,a king is to a queen as a man is to a woman,"(man, 0.069269456)","(king, 0.66545063)",SKIPGRAM_300,False
6,a king is to a man as a queen is to a woman,"(queen, 0.054794166)","(king, 0.5376383)",CBOW_50,False
7,a king is to a man as a queen is to a woman,"(queen, 0.1011336)","(king, 0.5778287)",CBOW_150,False
8,a king is to a man as a queen is to a woman,"(queen, -0.004370478)","(woman, 0.59991646)",CBOW_300,False
9,a king is to a man as a queen is to a woman,"(queen, 0.10654198)","(king, 0.6757214)",SKIPGRAM_50,False


## Task 1.4 - Discussion
Answer the following question:
* Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


### Answer
With Skipgram, we create four training samples for each input word. With CBOW, we create only 1 training sample for each input word. Hence, the Skipgram model will have more training samples with the same number of sentences.

# Question 2 - Peer review (0 pt):
Finally, each group member must write a single paragraph outlining their opinion on the work distribution within the group. Did every group member
contribute equally? Did you split up tasks in a fair manner, or jointly worked through the exercises. Do you think that some members of your group deserve a different grade from others? You can use the table below to make an overview of how the tasks were divided:



__Luc__: Gerrit has been sick for the last couple of days so I will be speaking on his behalf as well. In my opinion, I did some more work than Gerrit. However, this was out of my own motivation, and because of the fact that Gerrit has been sick for a couple of days. Moreover, the tasks we split up were finished on time. I think we deserve an equal grade.

| Student name | Task  |
|------|------|
|  Luc Reinink  | CBOW implementation, code for generating similarity values and analogy table, analyse results. |
| Gerrit Merz  | Skipgram implementation, checking correctness of CBOW implementation. |
| Everyone | Try out different analogies to find interesting results. |
