# L2: Word representations

In this lab you will implement the **skip-gram model with negative sampling (SGNS)** from Lecture&nbsp;2.4, and use it to train word embeddings on the text of the [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

⚠️ The dataset for this lab contains 18M tokens. This is very little as far as word embedding datasets are concerned – for example, the original word2vec model was pre-trained on 100B tokens. In spite of this, you will need to think about efficiency when processing the data and training your models. In particular, wherever possible you should use iterators rather than lists, and vectorize operations (using [NumPy](https://numpy.org) or [PyTorch](https://pytorch.org)) as much as possible.

## Load the data

The data for this lab comes as a bz2-compressed plain text file. It consists of 1,163,769 sentences, with one sentence per line and tokens separated by spaces. The cell below contains a wrapper class `SimpleWikiDataset` that can be used to iterate over the sentences (lines) in the text file. On the Python side of things, each sentence is represented as a list of tokens (strings).

In [1]:
import bz2

class SimpleWikiDataset():
    
    def __init__(self, max_sentences=None):
        self.max_sentences = max_sentences
    
    def __iter__(self):
        with bz2.open('simplewiki.txt.bz2', 'rt', encoding='utf-8') as sentences:
            for i, sentence in enumerate(sentences):
                if self.max_sentences and i >= self.max_sentences:
                    break
                yield sentence.split()

Using this class, we define two variants of the dataset: the full dataset and a minimal version with the first 1% of the sentences in the full dataset. The latter will be useful to test code without running it on the full dataset.

In [2]:
# Dataset with all sentences (N = 1,163,769)
full_dataset = SimpleWikiDataset()

# Minimal dataset
mini_dataset = SimpleWikiDataset(max_sentences=11638)

The next code cell defines a generator function that allows you to iterate over all tokens in a dataset:

In [3]:
def tokens(sentences):
    for sentence in sentences:
        for token in sentence:
            yield token

To illustrate how to use this function, here is code that prints the number of tokens in the full dataset:

In [4]:
print(sum(1 for t in tokens(full_dataset)))

17594885


## Problem 1: Build the vocabulary and frequency table

Your first task is to construct the embedding **vocabulary** – the set of unique words that will receive an embedding. Because you will eventually need to map words to vector dimensions, you will represent the vocabulary as a dictionary that maps words (strings) to a contiguous range of integers.

Along with the vocabulary, you will also construct the **frequency table**, that is, the table that holds the absolute frequencies (counts) in the data, for all words in your vocabulary. This will simply be an array of integers, indexed by the word ids in the vocabulary.

To construct the vocabulary and the frequency table, complete the skeleton code in the cell below:

In [5]:
import numpy as np

def make_vocab_and_counts(sentences, min_count=5):
    word_freq = {}
    for sentence in sentences:
        for word in sentence:
            if word in word_freq:
                word_freq[word] += 1
            else:
                word_freq[word] = 1
    
    filtered_words = {word: count for word, count in word_freq.items() if count >= min_count}
    
    vocab = {word: idx for idx, word in enumerate(filtered_words.keys())}
    
    counts = np.array(list(filtered_words.values()), dtype=np.int32)
    
    return vocab, counts


Your code should comply with the following specification:

**make_vocab_and_counts** (*sentences*, *min_count* = 5)

> Reads from an iterable of *sentences* (lists of string tokens) and returns a pair *vocab*, *counts* where *vocab* is a dictionary representing the vocabulary and *counts* is a 1D-[ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) with the absolute frequencies (counts) of the words in the vocabulary. The dictionary *vocab* maps words to a contiguous range of integers starting at&nbsp;0. In the *counts* array, the entry at index $i$ is the count of that word in *vocab* which maps to $i$. Words that occur less than *min_count* times are excluded from the vocabulary.

### 🤞 Test your code

To test your code, print the sizes of the vocabularies constructed from the two datasets, as well as the count totals. The correct vocabulary size for the minimal dataset is 3,231; for the full dataset, the correct vocabulary size is 73,339. The correct totals are 155,818 for the minimal dataset and 17,297,355 for the full dataset.

In [6]:
vocab_minimal, counts_minimal = make_vocab_and_counts(mini_dataset)
vocab_full, counts_full = make_vocab_and_counts(full_dataset)

vocab_size_minimal = len(vocab_minimal)
vocab_size_full = len(vocab_full)


vocab_minimal_all, counts_minimal_all = make_vocab_and_counts(mini_dataset,-1)
total_counts_minimal_all  = counts_minimal_all.sum()

vocab_full_all, counts_full_all = make_vocab_and_counts(full_dataset,-1)
total_counts_full_all = counts_full_all.sum()

total_counts_minimal = counts_minimal.sum()
total_counts_full = counts_full.sum()

# Print the results
print(f"Minimal Dataset: Vocabulary Size = {vocab_size_minimal}, Total Counts = {total_counts_minimal}")
print(f"Full Dataset: Vocabulary Size = {vocab_size_full}, Total Counts = {total_counts_full}")

assert vocab_size_minimal == 3231, f"Unexpected minimal dataset vocabulary size: {vocab_size_minimal}"
assert vocab_size_full == 73339, f"Unexpected full dataset vocabulary size: {vocab_size_full}"
assert total_counts_minimal == 155818, f"Unexpected total counts for minimal dataset: {total_counts_minimal}"
assert total_counts_full == 17297355, f"Unexpected total counts for full dataset: {total_counts_full}"

print("All tests passed.")

Minimal Dataset: Vocabulary Size = 3231, Total Counts = 155818
Full Dataset: Vocabulary Size = 73339, Total Counts = 17297355
All tests passed.


## Problem 2: Preprocess the data

Your next task is to preprocess the training data. This involves the following:

* Discard words that are not in the vocabulary
* Map each word to its vocabulary id
* Randomly discard words according to the subsampling strategy covered in Lecture&nbsp;2.4
* Discard sentences that have become empty

As a reminder, the subsampling strategy involves discarding tokens $w$ with probability

$$
P(w) = \max (0, 1-\sqrt{tN/\#(w)})
$$

where $\#(w)$ is the count of $w$, $N$ is the total number of counts, and $t$ is the chosen threshold (default value: 0.001).

The cell below contains skeleton code for a generator function `preprocess`:

In [7]:
import numpy as np
import random

def preprocess(vocab, counts, sentences, threshold=0.001):
    N = np.sum(counts)  # Total count of all words
    
    for sentence in sentences:
        preprocessed_sentence = []
        for word in sentence:
            if word in vocab:  
                word_id = vocab[word]  
                

                word_count = counts[word_id]
                P_w = max(0, 1 - np.sqrt(threshold * N / word_count))
                if random.random() > P_w: 
                    preprocessed_sentence.append(word_id)
        
        if preprocessed_sentence:
            yield preprocessed_sentence


Extend this skeleton code into a function that implements the preprocessing. Your code should comply with the following specification:

**preprocess** (*vocab*, *counts*, *sentences*, *threshold* = 0.001)

> Reads from an iterable of *sentences* (lists of string tokens) and yields the preprocessed sentences as non-empty lists of word ids (integers). Words not in *vocab* are discarded. The remaining words are randomly discarded according to the subsampling strategy with the given *threshold*. In the non-empty sentences, each token is replaced by its id in the vocabulary.

**⚠️ Please observe** that your function should *yield* the preprocessed sentences, not return a list with all of them. That is, we ask you to write a *generator function*. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)

### 🤞 Test your code

Test your code by comparing the total number of tokens in the preprocessed version of each dataset with the corresponding number for the original data. The former should be ca. 59% of the latter for the minimal dataset, and ca. 69% for the full dataset. The exact percentage will vary slightly because of the randomness in the sampling. You may want to repeat your computation several times.

In [8]:
import numpy as np
import random

def compute_total_tokens(sentences):
    return sum(len(sentence) for sentence in sentences)

total_tokens_minimal_original = compute_total_tokens(mini_dataset)
total_tokens_full_original = compute_total_tokens(full_dataset)

preprocessed_sentences_minimal = list(preprocess(vocab_minimal, counts_minimal, mini_dataset))
preprocessed_sentences_full = list(preprocess(vocab_full, counts_full, full_dataset))


total_tokens_minimal_preprocessed = compute_total_tokens(preprocessed_sentences_minimal)
total_tokens_full_preprocessed = compute_total_tokens(preprocessed_sentences_full)


percentage_minimal = (total_tokens_minimal_preprocessed / total_tokens_minimal_original) * 100
percentage_full = (total_tokens_full_preprocessed / total_tokens_full_original) * 100

print(f"Minimal Dataset: {percentage_minimal:.2f}% tokens retained after preprocessing.")
print(f"Full Dataset: {percentage_full:.2f}% tokens retained after preprocessing.")



Minimal Dataset: 59.24% tokens retained after preprocessing.
Full Dataset: 69.46% tokens retained after preprocessing.


## Problem 3: Generate the training examples

Your next task is to translate the preprocessed sentences into training examples for the skip-gram model: both *positive examples* (target word–context word pairs actually observed in the data) and *negative examples* (pairs randomly sampled from a noise distribution).

**⚠️ We expect that solving this problem will take you the longest time in this lab.**

### General strategy

The general plan for solving this problem is to implement a generator function that traverses the preprocessed sentences, at each position of the text samples a window, and then extracts all positive examples from it. For each positive example, the function also generates $k$ negative examples, where $k$ is a hyperparameter. Finally, all examples (positive and negative) are combined into the tensor representation described below.

### Representation

How should you represent a batch of training examples? Writing $B$ for the batch size, the obvious choice would be to represent the inputs as a matrix of shape $[B, 2]$ and the output labels (positive/negative) as a vector of length $B$. This representation would be quite wasteful on the input side, however, as each target word (index) from a positive example would have to be repeated in all negative samples. For example ($k=3$):

Here you will use a different representation: First, instead of a single input batch, there will be a *pair* of input batches – a vector for the target words and a matrix for the context words. If the target word vector has length $B$, the context word matrix has shape $[B, 1+k]$. The $i$th element of the target word vector is the target word for *all* context words in the $i$th row of the context word matrix: the first column of that row comes from a positive example, the remaining columns come from the $k$ negative samples. Accordingly, the batch with the output labels will be a matrix of the same shape as the context word matrix, with its first column set to&nbsp;1 and its remaining columns set to&nbsp;0. Corresponding to the example above:

For the present problem, you will only be concerned with the two input batches; the output batch will be constructed in the training procedure. In fact, for a fixed batch size $B$, that batch is always exactly the same, so you will only have to build it once.

### Negative sampling

Recall from Lecture&nbsp;2.4 that the probability of a word $c$ to be selected as the context word in a negative sample is proportional to its exponentiated count $\#(c)^\alpha$, where $\alpha$ is a hyperparameter (default value: 0.75).

To implement negative sampling from this distribution, you can follow a standard recipe: Start by pre-computing an array containing the *cumulative sums* of the exponentiated counts. Then, generate a random cumulative count $n$, and find that index in the pre-computed array at which $n$ should be inserted to keep the array sorted. That index identifies the sampled context word.

All operations in this recipe can be implemented efficiently in PyTorch; the relevant functions are [`torch.cumsum`](https://pytorch.org/docs/stable/generated/torch.cumsum.html) and [`torch.searchsorted`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). For optimal efficiency, you should sample all $B \times k$ negative examples in a batch at once.

Here is skeleton code for this problem:

In [23]:
import torch
import numpy as np

def training_examples(vocab, counts, sentences, window=5, num_ns=5, batch_size=1<<19, ns_exponent=0.75):
    # doesnt use vocab because the preprocess was done outside this func
    freqs = np.array(counts)**ns_exponent
    norm_freqs = freqs / freqs.sum()
    cumsum_freqs = torch.cumsum(torch.tensor(norm_freqs, dtype=torch.float), dim=0)
    
    # Initialize batch accumulators
    target_batch = torch.zeros(batch_size, dtype=torch.long)
    context_batch = torch.zeros((batch_size, 1 + num_ns), dtype=torch.long)
    batch_index = 0
    
    for sentence in sentences:
        sentence_length = len(sentence)
        for target_index, target_word_id in enumerate(sentence):
            # Randomly choose the window size
            dynamic_window = np.random.randint(1, window+1)
            context_start = max(0, target_index - dynamic_window)
            context_end = min(sentence_length, target_index + dynamic_window + 1)
            context_indices = [i for i in range(context_start, context_end) if i != target_index]
            
            for context_index in context_indices:
                if batch_index == batch_size:
                    yield target_batch.clone(), context_batch.clone()
                    target_batch.zero_()
                    context_batch.zero_()
                    batch_index = 0  # Reset for next batch
                    
                # Assign target word ID
                target_batch[batch_index] = target_word_id
                
                
                negative_samples = torch.searchsorted(cumsum_freqs, torch.rand(num_ns)).tolist()
                context_words = [sentence[context_index]] + negative_samples
                context_batch[batch_index] = torch.tensor(context_words, dtype=torch.long)
                batch_index += 1

    if batch_index > 0:
        yield target_batch[:batch_index], context_batch[:batch_index]


Your code should comply with the following specification:

**training_examples** (*vocab*, *counts*, *sentences*, *window* = 5, *num_ns* = 5, *batch_size* = 524,288, *ns_exponent*=0.75)

> Reads from an iterable of *sentences* (lists of string tokens), preprocesses them using the function implemented in Problem&nbsp;2, and then yields pairs of input batches for gradient-based training, represented as described above. Each batch contains *batch_size* positive examples. The parameter *window* specifies the maximal distance between a target word and a context word in a positive example; the actual window size around any given target word is sampled uniformly at random. The parameter *num_ns* specifies the number of negative samples per positive sample. The parameter *ns_exponent* specifies the exponent in the negative sampling (called $\alpha$ above).

### 🤞 Test your code

To test your code, compare the total number of positive samples (across all batches) to the total number of tokens in the (un-preprocessed) minimal dataset. The ratio between these two values should be ca. 2.64. If you can spare the time, you can make the same comparison on the full dataset; here, the expected ratio is 3.25. As before, the numbers may vary slightly because of randomness, so you may want to run the comparison more than once.

In [24]:
import torch
import numpy as np

vocab = vocab_minimal
counts = counts_minimal
sentences = preprocessed_sentences_minimal

total_positive_samples = 0
total_tokens = total_counts_minimal_all

for target_batch, context_batch in training_examples(vocab, counts, sentences):
    total_positive_samples += len(target_batch)

# Calculate the ratio between total positive samples and total tokens
ratio = total_positive_samples / total_tokens

print(ratio)



2.626771653543307


In [11]:
total_counts_minimal_all

170815

In [79]:
import torch
import numpy as np

vocab = vocab_full
counts = counts_full
sentences = preprocessed_sentences_full

total_positive_samples = 0
total_tokens = total_counts_full_all

for target_batch, context_batch in training_examples(vocab, counts, sentences):
    total_positive_samples += len(target_batch)

ratio = total_positive_samples / total_tokens

print(ratio)



3.254216381635913


## Problem 4: Implement the model

Now it is time to implement the skip-gram model as such. The cell below contains skeleton code for this. As you will recall from Lecture&nbsp;2.4, the core of the implementation is formed by two embedding layers: one for the target word representations, and one for the context word representations. Your task is to implement the missing `forward()` method.

In [62]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SGNSModel(nn.Module):
    
    def __init__(self, vocab, embedding_dim):
        super(SGNSModel, self).__init__()
        self.vocab = vocab
        vocab_size = len(vocab)
        self.w = nn.Embedding(vocab_size, embedding_dim)
        self.c = nn.Embedding(vocab_size, embedding_dim)
    
    def forward(self, w, c):
        w_embed = self.w(w) #Shape: [B, embedding_dim]
        c_embed = self.c(c) #Shape: [B, k+1, embedding_dim]
        
        dot_products = torch.bmm(w_embed.unsqueeze(1), c_embed.transpose(1, 2)).squeeze(1)
        #BMM for Batch dot product
        
        return dot_products


Your implementation of the `forward()` method should comply with the following specification:

**forward** (*self*, *w*, *c*)

> The input to this methods is a tensor *w* with target words of shape $[B]$ and a tensor *c* with context words of shape $[B, 1+k]$, where $B$ is the batch size and $k$ is the number of negative samples. The two tensors are structured as explained for Problem&nbsp;3. The output of the method is a tensor $D$ of shape $[B, k+1]$ where entry $D_{ij}$ is the dot product between the embedding vector for the $i$th target word and the embedding vector for the context word in row $i$, column $j$.

**💡 Hint:** To compute a dot product $x^\top y$, you can first compute the Hadamard product $z = x \odot y$ and then sum up the elements of $z$.

### 🤞 Test your code

Test your code by creating an instance of the model, and check that `forward` returns the expected result on random input tensors *w* and *c*. To help you, the following function will return a random example from the first 100 examples produced by `training_examples`.

In [14]:
import numpy as np

def random_example(vocab, counts, sentences):
    skip = np.random.randint(100)
    for i, example in enumerate(training_examples(vocab, counts, sentences, num_ns=1, batch_size=5)):
        if i >= skip:
            break
    return example

In [57]:
model = SGNSModel(vocab_minimal,embedding_dim = 100 )

mini_dataset_indexed = [[vocab_minimal[word] for word in sentence if word in vocab_minimal] for sentence in mini_dataset]

w, c = random_example(vocab_minimal, counts_minimal, mini_dataset_indexed)

w_tensor = torch.tensor(w, dtype=torch.long)
c_tensor = torch.tensor(c, dtype=torch.long)

output = model.forward(w_tensor, c_tensor)

expected_shape = (w_tensor.size(0), c_tensor.size(1))
assert output.shape == expected_shape, f"Output shape {output.shape} does not match expected shape {expected_shape}."

print("Test passed! Output shape is correct.")
print("Output:", output)


Test passed! Output shape is correct.
Output: tensor([[  5.2916, -11.1369],
        [ -0.8718,  -6.2550],
        [-15.3995, -11.1504],
        [ 15.7195,   3.4277],
        [  4.0296,  -1.5676]], grad_fn=<SqueezeBackward1>)


  w_tensor = torch.tensor(w, dtype=torch.long)
  c_tensor = torch.tensor(c, dtype=torch.long)


## Problem 5: Train the model

Once you have a working model, it is time to train it. The training loop for the skip-gram model will be very similar to the prototypical training loop that you already know from previous notebooks, with two things to note:

First, instead of categorical cross entropy, you will use binary cross entropy. Just like the standard implementation of the softmax classifier, the skip-gram model does not include a final non-linearity, so you should use [`binary_cross_entropy_with_logits()`](https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.binary_cross_entropy_with_logits.html).

The second thing to note is that you will have to create the tensor with the output labels, as explained already in Problem&nbsp;3. This should be a matrix of size $[B, 1+k]$ whose first column contains $1$s and whose remaining columns contains $0$s.

Here is skeleton code for the training loop, including default values for the most important hyperparameters:

In [58]:
import torch.nn.functional as F
import torch.optim as optim

def train(dataset, embedding_dim=50, window=5, num_ns=5, batch_size=1<<20, n_epochs=1, lr=0.1):

    vocab, counts = make_vocab_and_counts(dataset)
    model = SGNSModel(vocab, embedding_dim)
    preprocessed_sentences = list(preprocess(vocab, counts, dataset))
    
    # Initialize the optimizer
    optimizer = optim.Adam(model.parameters(), lr=lr)
    # Training loop
    for epoch in range(n_epochs):
        total_loss = 0
        # Ensure sentences_indexed is used and produced by make_vocab_and_counts beforehand
        for target, context in training_examples(vocab, counts, preprocessed_sentences):
            # Check if target and context are already tensors
            if not isinstance(target, torch.Tensor):
                target = torch.tensor(target, dtype=torch.long)
            if not isinstance(context, torch.Tensor):
                context = torch.tensor(context, dtype=torch.long)
            # Create output labels tensor
            labels = torch.zeros(context.size(0), context.size(1))
            labels[:, 0] = 1  # First column for positive examples
            
            # Forward pass
            predictions = model(target, context)
            
            # Compute loss
            loss = F.binary_cross_entropy_with_logits(predictions, labels)
            print(loss.item())
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch [{epoch+1}/{n_epochs}], Loss: {total_loss:.4f}")
    
    return model


To show you how `train` is meant to be used, the code in the next cell trains a model on the minimal dataset.

In [63]:
model = train(mini_dataset, n_epochs=10)

2.904158353805542
Epoch [1/10], Loss: 2.9042
2.479555130004883
Epoch [2/10], Loss: 2.4796
2.105211019515991
Epoch [3/10], Loss: 2.1052
1.7880841493606567
Epoch [4/10], Loss: 1.7881
1.5040004253387451
Epoch [5/10], Loss: 1.5040
1.2471671104431152
Epoch [6/10], Loss: 1.2472
1.016473650932312
Epoch [7/10], Loss: 1.0165
0.8292652368545532
Epoch [8/10], Loss: 0.8293
0.7023128271102905
Epoch [9/10], Loss: 0.7023
0.6384559273719788
Epoch [10/10], Loss: 0.6385


### 🤞 Test your code

Test your implementation of the training loop by training a model on the minimal dataset. This should only take a few seconds. You will not get useful word vectors, but you will be able to see whether your code runs without errors.

Once you have passed this test, you can train a model on the full dataset. Print the loss to check that the model is actually learning; if the loss is not decreasing, try to find the problem before wasting time (and energy) on useless training.

Training on the full dataset will take some time – on a CPU, you should expect 10–40 minutes per epoch, depending on hardware. To give you some guidance: The total number of positive examples is approximately 58M, and the batch size is chosen so that each batch contains roughly 10% of these examples. To speed things up, you can train using a GPU; our reference implementation runs in less than 2 minutes per epoch on [Colab](http://colab.research.google.com).

In [50]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.cuda.is_available())

True


In [76]:
# TODO: Train your model on the full dataset here
model = train(full_dataset, n_epochs=2)

2.880861282348633
2.567476987838745
2.328270196914673
2.0846574306488037
1.8627952337265015
1.6731767654418945
1.495949387550354
1.3428343534469604
1.1858158111572266
1.0549097061157227
0.9506063461303711
0.8579502105712891
0.779617965221405
0.736412763595581
0.6878771781921387
0.6618800759315491
0.6341748833656311
0.6050359010696411
0.5972283482551575
0.5803269743919373
0.5648996233940125
0.5540962815284729
0.5443975329399109
0.5361320376396179
0.5296789407730103
0.5369572639465332
0.5205166935920715
0.5224615335464478
0.5150646567344666
0.5097680687904358
0.5010051727294922
0.4969541132450104
0.4989548623561859
0.48888882994651794
0.4892235994338989
0.48369288444519043
0.4751012325286865
0.488967627286911
0.4717646539211273
0.4746742248535156
0.47321367263793945
0.48081982135772705
0.47351399064064026
0.46241891384124756
0.46162691712379456
0.4587744176387787
0.48168084025382996
0.45764294266700745
0.453471302986145
0.46322011947631836
0.4540613889694214
0.44270145893096924
0.4378797

## Problem 6: Analyse the embeddings (reflection)

Now that you have a trained model, you will probably be curious to see what it has learned. You can inspect your embeddings using the [Embedding Projector](http://projector.tensorflow.org). To that end, click on the ‘Load’ button, which will open up a dialogue with instructions for how to upload embeddings from your computer.

You will need to upload two tab-separated files. To create them, you can use the following code:

In [77]:
def save_model(model):
    # Extract the embedding vectors as a NumPy array
    embeddings = model.w.weight.detach().numpy()
    
    # Create the word–vector pairs
    items = sorted((i, w) for w, i in model.vocab.items())
    items = [(w, e) for (i, w), e in zip(items, embeddings)]
    
    # Write the embeddings and the word labels to files
    with open('vectors.tsv', 'wt',encoding = 'utf-8') as fp1, open('metadata.tsv', 'wt',encoding = 'utf-8') as fp2:
        for w, e in items:
            print('\t'.join('{:.5f}'.format(x) for x in e), file=fp1)
            print(w, file=fp2)

In [78]:
save_model(model)

Take some time to explore the embedding space. In particular, inspect the local neighbourhoods of words that you are curious about, say the 10 closest neighbours. Document your exploration in a short reflection piece (ca. 150&nbsp;words). Respond to the following prompts:

* Which words did you try? Which results did you get? Did you do anything else than inspecting local neighbourhoods?
* Based on what you know about word embeddings, did you expect your results? How do you explain them?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

**We tried Stockholm first, here's the result:**

- **Helsinki**: 0.262
- **Sweden**: 0.330
- **London**: 0.342
- **Oslo**: 0.347
- **Brussels**: 0.365
- **Gothenburg**: 0.373
- **December**: 0.377
- **Copenhagen**: 0.378
- **At**: 0.396
- **Suburb**: 0.398

**Then we tried football:**

- **Footballer**: 0.224
- **Basketball**: 0.239
- **Soccer**: 0.247
- **Club**: 0.249
- **Player**: 0.270
- **Goalkeeper**: 0.275
- **Tennis**: 0.333
- **Japanese**: 0.337
- **Retired**: 0.340
- **Handball**: 0.341

**And then we tried 'Isolate 11 points' to have a clear view.** Later we squeeze the data with PCA into 2D. To be specific, we set X 'Component#2', and found that Gothenburg, Sweden are pointing to the same direction as Stockholm does.

**Yes, we expect the results we observed.** After training, the word embeddings are able to show the contextual relationship and connection information. So just like what we saw in the first question, words with similar meanings are close to each other in the embedding space. Because we found that in the training process, we let the model learn the similarity of contextual words through embeddings. So in the visualization, contextual words are close.

**What did you learn? How, exactly, did you learn it? Why does this learning matter?**

We learn skip gram model in general, and learned the details of this model through programming, by watching lectures, and looking up information online. Because we can use it to embed words and even develop our own models. Moreover, last week we used the existed trained embedding layers, this week we train it by ourselves. We believe this experience is important to a continuous learning process in NLP.




👍 Well done!