In this notebook, we explore the model proposed by Mikolov et al. in [1]. We will build the Skipgram and CBOW models from scratch, train them on a relatively small corpus, and take a closer look at some analogies using these trained models. We will look at three different number of dimensions of the word embeddings in order to get a better intuition how the number of dimensions influences the result. The goal is not to obtain a high performance. Rather, the goal is to get a better understanding of the models. For that reason, Skipgram does not use negative sampling even though it would be used in practice.


[1] Mikolov, Tomas, et al. "Efficient Estimation of Word Representations in Vector Space" Advances in neural information processing systems. 2013.

In [None]:
%tensorflow_version 2.x

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [None]:
import numpy as np
import keras.backend as K
import tensorflow as tf
import operator
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing import sequence
from sklearn.metrics.pairwise import cosine_distances

from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

### Import file

In [None]:
file_name = '/content/alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing


In [None]:
# Remove sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# Remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# Convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # Total number of words in the corpus
V = len(tokenizer.word_index) + 1 # Total number of unique words in the corpus

In [None]:
n_samples, V

(27165, 2557)

In [None]:
# Example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [None]:
# Parameters
window_size = 2
window_size_corpus = 4

# Set numpy seed for reproducible results
np.random.seed(42)

## Skipgram

In [None]:
# Prepare data for the skipgram model
def generate_data_skipgram(corpus, window_size, V):
    maxlen = window_size * 2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1

            in_words = []
            labels = []
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    # Add the input word
                    all_in.append(word)
                    # Add one-hot of the context words
                    all_out.append(to_categorical(words[i], V))

    return (np.array(all_in), np.array(all_out))

We break down each (target word, context word**s**) pair into (target word, context word) pairs. This is done with the `generate_data_skipgram` method above. This method returns two NumPy arrays: `x` (input, i.e., target word) and `y` (output, i.e., context word). We can now use this method to generate our training data.


In [None]:
# Create training data
X_skip, y_skip = generate_data_skipgram(corpus, window_size, V)
X_skip.shape, y_skip.shape

((94556,), (94556, 2557))

In [None]:
# Create skipgram architecture
dims = [50, 150, 300]
skipgram_models = []

for dim in dims:
    # Initialize a Keras Sequential model
    skipgram = Sequential()

    # Add an Embedding layer
    skipgram.add(Embedding(input_dim=V,
                           output_dim=dim,
                           input_length=1,
                           embeddings_initializer='glorot_uniform'))

    # Add a Reshape layer, which reshapes the output of the embedding layer (1,dim) to (dim,)
    skipgram.add(Reshape((dim, )))

    # Add a final Dense layer with the same size as in [1]
    skipgram.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))

    # Compile the model with a suitable loss function and select an optimizer.
    # Optimizer Adagrad was used in paper
    skipgram.compile(optimizer=keras.optimizers.Adam(),
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

    skipgram.summary()
    print("")
    skipgram_models.append(skipgram)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1, 50)             127850    
                                                                 
 reshape (Reshape)           (None, 50)                0         
                                                                 
 dense (Dense)               (None, 2557)              130407    
                                                                 
Total params: 258257 (1008.82 KB)
Trainable params: 258257 (1008.82 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1, 150)            383550    
                                                         

We create a list that stores all dimensions (50, 150, and 300). We iterate over this list and create models for each dimension. Note that the base model we created is equal to the model used in Practical 3.1.

Since we want to predict a context word using a single input word, our `input_length` equals 1. The input dimension simply equals the size of our vocabulary, namely `V`. This is because we use a one-hot encoding. In this one-hot encoding, the index that represents the input word equals 1, and all other indices equal 0. Since we have to consider all words in our vocabulary, the one-hot encoding is of dimension `V`. We do not need to perform the one-hot encoding ourselves for the input data since this is done by the embedding layer in our model.

Since we want to predict a certain context word, which can have `V` possible outcomes, we use a Softmax layer with `V` units such that we can map a probability to each word in `V`. Since our problem is a typical multiclass classification problem, we use **categorical** cross-entropy as our loss function.

We use `adam` as our optimizer, which is an adaptive learning rate optimization algorithm that has been designed specifically for training deep neural networks [2]. We have tried various optimizers, such as SGD, RMSProp, Adam, and Adagrad. Adam seemed to perform the best with respect to the loss and accuracy. We also tried varying the learning rate. We initialize the weights with values from a `glorot_uniform` distribution since we were also requested to use this initializer in Practical 3.1, which tackled a very similar problem compared to the one we are trying to solve in this assignment.

[2] https://arxiv.org/pdf/1412.6980.pdf

In [None]:
# Training the skipgram models
for skipgram in skipgram_models:
    skipgram.fit(X_skip, y_skip, batch_size=64, epochs=13, verbose=1)
    print("")

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13



We fit each model that considers a varying number of dimensions with a batch size of 64 and 13 epochs.

We observe that the loss decreases very slowly. Moreover, we observe a slowly increasing accuracy. We use accuracy as a metric, so that we can get sort of an idea of when to stop. The bad thing about our approach is that we train all models for an equal number of epochs this way when that is not necessarily the best course of action, but let us not focus too much on this now as the training does not take that long anyway and we just want to create word embeddings. We used an equal number of epochs for all models to make it easier to compare the losses and accuracies of the different models at equal numbers of epochs. After 12 epochs for the model that considers 50 dimensions, we observe that the accuracy does not increase significantly anymore (even decreases a bit during one epoch). This can also be observed for the model with 150 dimensions after epoch 6. For the model with 300 dimensions, we can already stop after 4 epochs since the accuracy only keeps decreasing. Hence, we observe that training these models for more epochs does not necessarily lead to a better accuracy.

A major reason for this low accuracy is the corpus size being of relatively small size. Our corpus, namely `alice.txt`, has only 26,283 words before processing the file, which is a lot less than the millions of words that are mentioned by Mikolov et al. in [1]. We suspect that this results in our models simply not being able to learn much from our corpus. This is highlighted even more by Mikolov et al. [1] in  tables 2 and 3 in [1]. Since we trained our models on a relatively small corpus, low accuracies were to be expected. Rezaeinia et al. [3] further support this statement.

[3] https://arxiv.org/ftp/arxiv/papers/1711/1711.08609.pdf

In [None]:
for skipgram in skipgram_models:
    # Save embeddings for vectors of length 50, 150 and 300 using skipgram model
    weights = skipgram.get_weights()

    # Get the embedding matrix
    embedding = weights[0]

    # Get word embeddings for each word in the vocabulary, write to file
    f = open(f"vectors_skipgram_{len(embedding[0])}.txt", "w")

    # Create columns for the words and the values in the matrix, makes it easier to read as dataframe
    columns = ["word"] + [f"value_{i+1}" for i in range(embedding.shape[1])]

    # Start writing to the file, start with the column names
    f.write(" ".join(columns))

    # Start a new line
    f.write("\n")

    for word, i in tokenizer.word_index.items():
        f.write(word)
        f.write(" ")
        f.write(" ".join(map(str, list(embedding[i,:]))))
        f.write("\n")
    f.close()

## CBOW


In [None]:
from keras.preprocessing import sequence

# Prepare the data for the CBOW model
def generate_data_cbow(corpus, window_size, V):
    all_in = []
    all_out = []

    # Iterate over all sentences
    for sentence in corpus:
        L = len(sentence)
        for index, word in enumerate(sentence):
            start = index - window_size
            end = index + window_size + 1

            # Empty list which will store the context words
            context_words = []
            for i in range(start, end):
                # Skip the 'same' word
                if i != index:
                    # Add a word as a context word if it is within the window size
                    if 0 <= i < L:
                        context_words.append(sentence[i])
                    else:
                        # Pad with zero if there are no words
                        context_words.append(0)
            # Append the list with context words
            all_in.append(context_words)

            # Add one-hot encoding of the target word
            all_out.append(to_categorical(word, V))

    return (np.array(all_in), np.array(all_out))

For the CBOW model, we generate the training data differently compared to how we did it for the Skipgram model. With the CBOW model, we want to predict words based on their context. We do this by using a window around the word we want to predict. In our code, this window is represented by the `window_size`. All words contained in the window are the context words. Note that if we want to predict the first or final few words of a sentence (depends on the window size), it might be the case that our window reaches the previous or next sentence, respectively. In such a case, the window around the target word is restricted to the words that are in the same sentence. In the code, we solve this by using padding, which ensures that all sequences of context words the same length. For the padding, we simply use a value of 0.

In [None]:
# Create the training data
X_cbow, y_cbow = generate_data_cbow(corpus, window_size, V)
X_cbow.shape, y_cbow.shape

((27165, 4), (27165, 2557))

In [None]:
# Create the CBOW architecture
cbow_models = []

for dim in dims:
    cbow = Sequential()

    # Add an Embedding layer
    cbow.add(Embedding(input_dim=V,
                       output_dim=dim,
                       input_length=window_size*2, # Note that we now have 2L words for each input entry
                       embeddings_initializer='glorot_uniform'))

    cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim, )))

    cbow.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))

    cbow.compile(optimizer=keras.optimizers.Adam(),
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])

    cbow.summary()
    print("")
    cbow_models.append(cbow)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 4, 50)             127850    
                                                                 
 lambda (Lambda)             (None, 50)                0         
                                                                 
 dense_3 (Dense)             (None, 2557)              130407    
                                                                 
Total params: 258257 (1008.82 KB)
Trainable params: 258257 (1008.82 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 4, 150)            383550    
                                                       

Again, we create a list for all the dimensions (50, 150, and 300). We iterate over this list and create models for each dimension.

Since we want to predict a target word using `window_size` $\cdot$ 2 context words, our `input_length` equals `window_size` $\cdot$ 2. Again, the input dimension simply equals the size of our vocabulary, namely `V`. The reasoning for this is analogous to the reasoning we provided for this matter when discussing the Skipgram model. This reasoning also applies to why we use Softmax and categorical cross-entropy. Furthermore, `adam` is used as our optimizer for the same reasons given in this section for the Skipgram model. Similar reasoning applies to why we use `glorot_uniform` initializers.

In [None]:
# Train CBOW model
for cbow in cbow_models:
    cbow.fit(X_cbow, y_cbow, batch_size=64, epochs=50, verbose=1)
    print("")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/

**Motivation**. The motivation for training the CBOW models the way we do is equivalent to the motivation given for training the Skipgram models. The only difference between these models and the Skipgram models is that, for these models, we can still see the accuracy increase and loss decrease after several epochs, while this was not the case for the Skipgram models. This is one of the reasons why we train CBOW for more epochs than Skipgram. Furthermore, we can easily increase the number of epochs for these models compared to the ones for Skipgram since CBOW is significantly faster to train than Skipgram. The results of this model together with the Skipgram model will explored further in Task 1.3.

In [None]:
for cbow in cbow_models:
    # Save embeddings for vectors of length 50, 150 and 300 using cbow model
    weights = cbow.get_weights()

    # Get the embedding matrix
    embedding = weights[0]

    # Get word embeddings for each word in the vocabulary, write to file
    f = open(f'vectors_cbow_{len(embedding[0])}.txt', 'w')

    # Create columns for the words and the values in the matrix, makes it easier to read as dataframe
    columns = ["word"] + [f"value_{i+1}" for i in range(embedding.shape[1])]

    # Start writing to the file, start with the column names
    f.write(" ".join(columns))
    f.write("\n")

    for word, i in tokenizer.word_index.items():
        f.write(word)
        f.write(" ")
        f.write(" ".join(map(str, list(embedding[i,:]))))
        f.write("\n")
    f.close()

## Analogy function

Let us implement our own function to perform the analogy task. We will use the same distance metric as in [1]. With this function, we want to be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. In a perfect scenario, we would like that this analogy ($e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. In this context, we will call "man" the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e., the result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

### Computing the distance between the predicted and true word

In [None]:
def embed(word, embedding, vocab_size=V, tokenizer=tokenizer):
    """ Embed a word by getting the one hot encoding and taking the dot product of this vector with the
        embedding matrix 'word' = string type
    """
    # get the index of the word from the tokenizer, i.e. convert the string to it's corresponding integer in the vocabulary
    int_word = tokenizer.texts_to_sequences([word])[0]
    # get the one-hot encoding of the word
    bin_word = to_categorical(int_word, V)
    return np.dot(bin_word, embedding)

In [None]:
def compute_distance(word_a, word_b, word_c, word_d):
    """ Returns the cosine distance between the predicted and the true word (word_d)

    Our analogy function is: 'word_a is to word_b as word_c is to ?'
    Here, ? is predicted based on the embeddings. Then, we compare ? to word_d, which is the true word.
    """
    models = skipgram_models + cbow_models
    embeddings = [model.get_weights()[0] for model in models]
    for embedding in embeddings:
        predicted_embedding = embed(word_b, embedding) - embed(word_a, embedding) + embed(word_c, embedding)
        dist_exp_true = cosine_distances(predicted_embedding, embed(word_d, embedding))
        print(dist_exp_true[0][0])

In [None]:
# Example distances between the predicted and true word for skipgram 50, 100, 150, and cbow 50, 100, 150
compute_distance('king', 'queen', 'woman', 'man')

0.85569066
0.88777417
0.8977496
0.85658985
0.879178
0.91641605


### Listing the top $z$ closest words based on an analogy function

In [None]:
from scipy.spatial.distance import cosine, cdist


def embed(word, embedding, vocab_size=V, tokenizer=tokenizer):
    # Get the index of the word from the tokenizer, i.e. convert the string to it's corresponding integer in the vocabulary
    int_word = tokenizer.texts_to_sequences([word])[0]
    # Get the one-hot encoding of the word
    bin_word = to_categorical(int_word, V)
    return np.dot(bin_word, embedding).reshape(-1)


def get_nearest_words(model_name, embed_word, used_words, nr=10):
    """Returns the `nr` nearest words to the `embed_word` for a certain `model_name`
    """
    # Load the model embedding matrix and create a list of all the words
    df = pd.read_csv(f"vectors_{model_name}.txt", sep=" ")

    # Filter out words that are in the analogy
    df = df[~(df["word"].isin(used_words))]

    # Store the embedded representation of the words
    embedded_words = df.iloc[:, 1:].values
    embedded_word = embed_word.reshape(1, -1)

    # Get the distances between the input embedding and the embedded words such that we can look for the smallest one
    # cdist makes it easy for us to compute the cosine distance between each pair of the two collections of inputs
    distances = cdist(embedded_word, embedded_words, "cosine").reshape(-1)

    # Sort distances and store the indices of the `nr` lowest distances
    top_sorted_indices = distances.argsort()[:nr]

    # Convert the indices to actual words
    top_words = [list(df["word"])[i] for i in top_sorted_indices]

    # Keep the rounded values of those indices
    values = [round(distances[i], 4) for i in top_sorted_indices]
    # Concatenate the top words together with their values and return it as a list
    return list(zip(top_words, values))


def print_analogy(analogy, embeddings, models, model_names, nr=10):
    # Retrieve the words from the analogy we need to compute
    word_a, word_b, word_c, word_true = analogy

    # Formulate the analogy task
    analogy_task = f"{word_a} is to {word_b} as {word_c} is to ?"

    print(f"Analogy Task: {analogy_task}")
    print("---------------------------------------------------")

    # Iterate over all models available
    for model_name, embedding in zip(model_names, embeddings):
        # Obtain embeddings for all the words
        embed_true = embed(word_true, embedding)
        embed_a, embed_b, embed_c = embed(word_a, embedding), embed(word_b, embedding), embed(word_c, embedding)

        # Obtain the predicted embedding based on the analogy function
        embed_prediction = embed_b - embed_a + embed_c

        # The true word with distance similarity value between predicted embedding and true word embedding,
        # also denoted `sim1` in the text above
        sim1 = round(cosine(embed_true, embed_prediction), 4)

        # The predicted word with distance similarity value between predicted embedding and the embedding of the word
        # in the vocabulary that is closest to this predicted embedding
        word_prediction, sim2 = get_nearest_words(model_name, embed_prediction, [word_a, word_b, word_c], 1)[0]

        # Get the top `nr` nearest words
        nearest_words = get_nearest_words(model_name, embed_prediction, [word_a, word_b, word_c], nr)

        # Print whether or not the true word was in the top nr
        partially_correct = word_true in [word[0] for word in nearest_words]

        print(f"Embedding: {model_name}")
        # Print all top nr words with their distance
        for word in nearest_words:
            print(f"{word[0]} => {round(word[1], 4)}")
        print(f"Predicted: {word_prediction} ({round(sim2, 4)}) - True: {word_true} ({sim1})")
        print(f"Correct? {word_prediction == word_true} - In the top {nr}? {partially_correct}")
        print("---------------------------------------------------")

The method we have created above is relatively simple. Let us consider the major steps of the method. The method boils down to: 1) concatenating all models such that it is easier to iterate over all models, 2) get the embeddings of each model such that we can easily iterate over them, 3) store the model names in a list such that we can easily iterate over them, 4) create a list of tuples of size four where each word in the tuple represents a word in the analogy, 5) iterate over each tuple in the analogies we want to look at, 6) compute the embedding of each word in the tuple, 7) fill in the analogy function using the first three words, 8) make a prediction based on the outcome of the analogy function and return the `nr` nearest words using the cosine distance (we use this distance measure since it was mentioned in [1]), 9) compare if the actual word (given as input parameter) is equal to the predicted word. This is the main idea behind the method. We have also made it easier to return the top `nr` of nearest words and print the top `nr` nearest words for each prediction together with the cosine distances to give us more of an idea as to what the model is predicting.

Important: note that we filter out the words that are in the analogy. That is, for some analogy "$x$ is to $y$ as $z$ is to ?", we filter out $x$, $y$, and $z$ from the top `nr` of entries since Mikolov et al. [1] state that they discard the input question words (the words used in the analogy).

In [None]:
# Concatenate all models such that we can easily iterate over all models
models = skipgram_models + cbow_models

# Store the embeddings of all models such that we can easily iterate over them
word_embeddings = [model.get_weights()[0] for model in models]

# Store the model names such that we can easily iterate over them
model_names = ["skipgram_50", "skipgram_150", "skipgram_300", "cbow_50", "cbow_150", "cbow_300"]

# Set the number of top words to print
nr = 10

print_analogy(analogy=('queen', 'king', 'woman', 'man'), embeddings=word_embeddings, models=models, model_names=model_names, nr=nr)

Analogy Task: queen is to king as woman is to ?
---------------------------------------------------
Embedding: skipgram_50
snail => 0.3851
vii => 0.4129
doze => 0.4504
guessed => 0.4739
readily => 0.488
lory => 0.4897
handsome => 0.4925
five => 0.4936
furious => 0.498
steam => 0.5014
Predicted: snail (0.3851) - True: man (0.7142)
Correct? False - In the top 10? False
---------------------------------------------------
Embedding: skipgram_150
unusually => 0.6119
childhood => 0.6158
eaglet => 0.6328
lory => 0.6361
odd => 0.6413
deeply => 0.6432
cautiously => 0.6484
snail => 0.6511
begin => 0.6548
dodo => 0.6619
Predicted: unusually (0.6119) - True: man (0.7403)
Correct? False - In the top 10? False
---------------------------------------------------
Embedding: skipgram_300
eaglet => 0.625
m => 0.6408
snail => 0.6409
readily => 0.6848
uncommonly => 0.686
unusually => 0.693
conger => 0.6959
childhood => 0.7022
thousand => 0.7059
thimble => 0.7097
Predicted: eaglet (0.625) - True: man (0.80

In [None]:
analogies = [('he', 'is', 'we', 'are'), ('love', 'hate', 'little', 'large'), ('small', 'smaller', 'large', 'larger'), ('man', 'woman', 'king', 'queen'), ('mouse', 'mice', 'cat', 'cats')]
for analogy in analogies:
    print_analogy(analogy=analogy, embeddings=word_embeddings, models=models, model_names=model_names)

Analogy Task: he is to is as we is to ?
---------------------------------------------------
Embedding: skipgram_50
size => 0.5029
barrowful => 0.5464
knowing => 0.5502
delightful => 0.5512
pity => 0.5528
chatte => 0.5558
hold => 0.5713
flame => 0.5719
say => 0.5723
mean => 0.574
Predicted: size (0.5029) - True: are (0.8379)
Correct? False - In the top 10? False
---------------------------------------------------
Embedding: skipgram_150
pretending => 0.633
oh => 0.642
four => 0.6506
chuckled => 0.6753
cleared => 0.6834
whenever => 0.6861
used => 0.6903
lullaby => 0.6962
delightful => 0.6978
nurse => 0.7029
Predicted: pretending (0.633) - True: are (0.9271)
Correct? False - In the top 10? False
---------------------------------------------------
Embedding: skipgram_300
chuckled => 0.694
cheshire => 0.7195
used => 0.7348
shore => 0.7435
longed => 0.7436
barrowful => 0.7496
oh => 0.7505
pretending => 0.7512
nine => 0.7513
poker => 0.7567
Predicted: chuckled (0.694) - True: are (0.916)
Corr

### Different examples of analogies

Let us consider some of the more interesting analogies. We pick analogies such that the relations in the analogies are syntactically different. This means that we, e.g., do not pick five adjective - comparative analogies (so not small - smaller <-> small - smaller, large - larger <-> clear - clearer, and other analogies of the same kind). We also decided to pick words that occur relatively frequently in the corpus, since we expect our models to perform better on those than on words that have only appeared a single time. Many analogies other than the ones we list here were experimented with; these can be found as comments in the code above. The five categories we cover for this task together with their respective analogies are:

**Pronoun to Verb:**
He is to is as we is to ...? (are)

**Antonyms:**
Love is to hate as little is to ...? (large)

**Adjective to its Comparative:**
Small is to smaller as large is to larger ...? (larger)

**Gender structure:**
A man is to a woman as a king is to a ...? (queen)

**Singular to Plural:**
A mouse is to mice as a cat is to ...? (cats)

Note that we filter out the words that are in the analogy. That is, for some analogy "$x$ is to $y$ as $z$ is to ?", we filter out $x$, $y$, and $z$ from the top `nr` of entries since Mikolov et al. [1] state that they discard the input question words (the words used in the analogy).

### Comparing the performance on the analogies between the word embeddings and briefly discuss your results

Let us consider the results of using the six models to predict the five analogies above. The results we obtained are presented in the table below. Note that the annotated number in the "True word" and "Predicted word" columns represents the distance between the predicted embedding based on the analogy function and the embedding of the word annotated next to the number (this is the nearest word with respect to the cosine distance). Furthermore, note that this number has been rounded at four decimal places.

| Analogy task | True word  | Predicted word | Embedding | Correct?|
|------|------|------|------|------|
|  He is to is as we is to ?  | are (0.7187) | do (0.4837) | SG_50 | False|
|  He is to is as we is to ?   | are (0.9673) | used (0.6482)  | SG_150 | False|
|  He is to is as we is to ?   | are (0.8766) | longed (0.6806) | SG_300 | False|
|  He is to is as we is to ? | are (0.9965) | knowing (0.4759) | CBOW_50 | False|
|  He is to is as we is to ?   | are (0.9177) | jogged (0.6363) | CBOW_150 | False|
|  He is to is as we is to ?   | are (0.9072) | guessed (0.7075) | CBOW_300 | False|
|  Love is to hate as little is to ?     | large (0.7033) | horse (0.4196) | SG_50 | False|
|  Love is to hate as little is to ?     | large (0.7775) | unlocking (0.569) | SG_150 | False|
|  Love is to hate as little is to ?     | large (0.9304) | rock (0.603) | SG_300 | False|
|  Love is to hate as little is to ?     | large (0.6845) | melancholy (0.3903) | CBOW_50 | False|
|  Love is to hate as little is to ?     | large (0.943) | kissed (0.486) | CBOW_150 | False|
|  Love is to hate as little is to ?     | large (0.9081) | kissed (0.6259)| CBOW_300 | False|
| Small is to smaller as large is to ?   | larger (0.5697) | our (0.4819) | SG_50 | False|
| Small is to smaller as large is to ?   | larger (0.6191) (third one) | panting (0.6100) | SG_150 | False|
| Small is to smaller as large is to ?   |  larger (0.6625) (second one) | giddy (0.6602) | SG_300 | False|
| Small is to smaller as large is to ?   | larger (0.6007) | ourselves (0.5046) | CBOW_50 | False|
| Small is to smaller as large is to ?   | larger (0.715) | flamingoes (0.6184) | CBOW_150 | False|
| Small is to smaller as large is to ?   | larger (0.6984) | bitter (0.6398) | CBOW_300 | False|
|  A man is to a woman as a king is to a ?   | queen (0.5978) | leave (0.4269) | SG_50 | False|
|  A man is to a woman as a king is to a ?   | queen (0.7337) | childhood (0.4903) | SG_150 | False|
|  A man is to a woman as a king is to a ?   | queen (0.8092) | childhood (0.5515)  | SG_300 | False|
|  A man is to a woman as a king is to a ?   | queen (0.5559) | thimble (0.3271) | CBOW_50 | False|
|  A man is to a woman as a king is to a ?   | queen (0.7842) | neatly (0.4794)  | CBOW_150 | False|
|  A man is to a woman as a king is to a ?   | queen (0.737) | odd (0.5122) | CBOW_300 | False|
| A mouse is to mice as a cat is to ?    | cats (0.7198) | sent (0.4014) | SG_50 | False|
| A mouse is to mice as a cat is to ?    | cats (0.7134) | mistake (0.5799) | SG_150 | False|
| A mouse is to mice as a cat is to ?    | cats (0.8525) | seeing (0.6483) | SG_300 | False|
| A mouse is to mice as a cat is to ?    | cats (0.8261) | existence (0.4654) | CBOW_50 | False|
| A mouse is to mice as a cat is to ?    | cats (0.9492)| mistake (0.5566) | CBOW_150 | False|
| A mouse is to mice as a cat is to ?    | cats (0.9827)| mistake (0.6009)  | CBOW_300 | False|

In terms of performance, we observe that none of the six models managed to correctly predict any of the analogies; all predictions were incorrect. The major reason for this concerns the size of our corpus. Our corpus, namely `alice.txt`, has only 26,283 words before processing the file, which is a lot less than the millions of words that are mentioned by Mikolov et al. in [1]. This suggests that our models are simply not able to learn much from our corpus since there is not that much to learn from.

Note, however, that for the analogy "*small* is to *smaller* as *large* is to ? (*larger*)", SG_150 and SG_300 almost predicted the correct word (the true word was in the top 3 and top 2, respectively). This result can be observed in the output of the code cell above this cell and in the table. Especially SG_300 was extremely close to predicting this word correctly. A possible reason for these almost correct predictions could be due to the fact that this type of analogy is relatively easy to solve compared to some of our other analogies. More specifically, "small" and "smaller" are very similar because "smaller" is simply "small" + "er", and "large" and "larger" are very similar because "larger" is simply "large" + "r". While the predictions by the models were not completely correct (these words were not the nearest words), we still thought it would be interesting to mention this fact.

Given the poor performance of both models on the analogies (on our corpus), it is tough to say which model performs better. To be able to draw sound conclusions on which type of model is more suitable for predicting analogies, we believe that we would require a larger corpus. If we were to draw a conclusion from the results we have right now, we would argue that Skipgram performs slightly better since it almost predicted the correct word twice, as mentioned in the paragraph above. This is supported by a comment from Mikolov in [4], where he, based on his experience, argues that Skipgram works well with a small amount of the training data and represents rare words or phrases well.

From the table, we observe that for the lowest dimension (namely `dim=50`), both types of models are more inclined to predict different words than models with larger dimensions (namely `dim=150` and `dim=300`). See, for example, the analogy "Love is to hate as little is to ?", where CBOW_150 and CBOW_300 both predicted "kissed", while CBOW_50 predicted "melancholy". Another example is the analogy "A man is to a woman as a king is to a ?", where SG_150 and SG_300 both predicted "childhood", while SG_50 predicted "leave". This occurrence could stem from the fact that our corpus size is rather small, which results in us suspecting that there are fewer "features" to learn from (e.g., fewer diverse words with different meanings in different contexts) than in much larger corpora. Note that this occurrence only happens three times in total. We thought that it would be something interesting enough to mention in this section.

We initially ran these methods without excluding the words that were present in the analogy. When we did not exclude these words, we observed that we often predicted one of the three words that were present in the analogy. Intuitively, this makes sense as these words are used in the input and, therefore, heavily influence the result. Based on these experiments and applying some logic, it makes a lot of sense why Mikolov et al. [1] discard the input question words (the words used in the analogy).

[4] https://groups.google.com/forum/#!searchin/word2vec-toolkit/c-bow/word2vec-toolkit/NLvYXU99cAM/E5ld8LcDxlAJ

## Miscellaneous

Let us take a closer look at the size of the training data used by Skipgram and CBOW.


### Which model has more training samples?
Let us focus on our corpus first and look at which model has more training samples. After that, we will explore why this is the case from a higher level. Note that in our analysis, we use that $L=2$. The shape of the training samples for CBOW is $(27165, 4)$, which can be seen as $27165$ samples, where each sample contains $4$ (context) words. This means that, for CBOW, the size of the training data equals the total number of words in the corpus.

The shape of the training samples for Skipgram is $(94556,)$, which can be seen as $94556$ samples, each containing a single word. This means that, for Skipgram, the size of the training data is slightly less than `n_samples` $\cdot$ `window_size` $\cdot 2$ in most cases. The actual size is somewhat less because we cannot capture all words in the full window size for words at the beginning and end of sentences. We will cover this more in the next few paragraphs.

Based on the information above, we can clearly conclude that Skipgram has more training samples if both CBOW and Skipgram have the same sentences as input.

### Why?
Let us consider how the data is generated for each method.

In CBOW, we have a training sample for each word we want to predict. This means that the total number of training samples is equal to the total number of words in the corpus. Moreover, each training sample consists of all the words that are contained in a window around the word we want to predict. Let $L$ denote the window size. The total number of words per training sample equals $2L$ after applying padding to make sure each training sample has $2L$ words.

In Skipgram, we want to predict the context words based on an input word. In practice, we generated the training samples by specifying a window size $L$ and iterating over all sentences. For each word $w_t$ in the sentence, we add the word pairs $(w_t, w_{t-L}), \cdots, (w_t, w_{t-1}), (w_t, w_{t+1}), \cdots (w_t, w_{t+L})$ to the set of training samples. Note that it might not always be possible to make use of the full window size because it might be the case that, for example, at the beginning of a sentence, we cannot consider the $L$ words before the first word of that sentence since these words do not exist in our sentence. (This 'issue' has to do with us considering all words within a certain sentence.) This problem also appears when we consider words at the end of the sentence. This is a reason as to why we have slightly less than the number we proposed above.

From these descriptions of the data generation methods for both models, it is evident why Skipgram has more training samples than CBOW for an equal number of sentences with the same content. In conclusion, Skipgram has slightly less than `n_samples` $\cdot$ `window_size` $\cdot 2$ samples due to the way the samples are created. On the other hand, CBOW has `n_samples` training samples. Again, the former is larger in size (assuming `window_size` $\geq 1$).