# Chapter 11: Dense Vector Representations
Training embeddings: A simple implementation of skipgrams with negative sampling

Adapted from _Distributed Representations of Words and Phrases and their Compositionality_, Sect. 2.2, by Mikolov et al. 2013.

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## Modules

In [1]:
import regex as re
import os
import numpy as np
from tqdm import tqdm
from collections import Counter
import math
import random
import torch
import torch.nn as nn

## Parameters

The embedding size, context size, and negative counts

In [2]:
embedding_dim = 50
w_size = 2
c_size = w_size * 2 + 1
K_NEG = 5
t = 1e-3
power = 0.75
DOWNSAMPLING = False

## Corpus Files

We read the files and we store the corpus in a string

In [3]:
PATH = '../datasets/'

In [4]:
CORPUS = 'HOMER'  # 'DICKENS'

In [5]:
if CORPUS == 'DICKENS':
    folder = PATH + 'dickens/'
elif CORPUS == 'HOMER':
    folder = PATH + 'classics/'

In [6]:
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
    return files

In [7]:
if CORPUS == 'DICKENS':
    files = get_files(folder, 'txt')
elif CORPUS == 'HOMER':
    files = ['iliad.txt', 'odyssey.txt']
files

['iliad.txt', 'odyssey.txt']

In [8]:
files = [folder + file for file in files]
files

['../datasets/classics/iliad.txt', '../datasets/classics/odyssey.txt']

In [9]:
text = ''
for file in files:
    with open(file, encoding='utf8') as f:
        text += ' ' + f.read().strip()

In [10]:
text[:100]

' BOOK I\n\nSing, O goddess, the anger of Achilles son of Peleus, that brought\ncountless ills upon the '

## Processing the Corpus

### Tokenizing

We set all the text in lowercase

In [11]:
text = text.lower()
words = re.findall('\p{L}+', text)
words[:5]

['book', 'i', 'sing', 'o', 'goddess']

In [12]:
vocab = sorted(list(set(words)))
vocab[:10]

['a',
 'abantes',
 'abarbarea',
 'abas',
 'abate',
 'abated',
 'abetting',
 'abhorred',
 'abians',
 'abide']

In [13]:
vocab_size = len(vocab)
vocab_size

9768

In [14]:
idx2word = dict(enumerate(vocab))
word2idx = {v: k for k, v in idx2word.items()}
# word2idx

In [15]:
words_idx = [word2idx[word] for word in words]

### Downsampling

We can downsample the frequent words. We first count the words, then we discard randomly some words in the text, depending on their frequency. Frequent words will often be discarded. Rare words, never. We will have to count them again after sampling. We first count the words. We will have to count them again after sampling

In [16]:
counts = Counter(words)
word_cnt = sum(counts.values())
word_cnt

271506

In [17]:
counts['the'], counts['he'], counts['penelope']

(15794, 4728, 104)

In [18]:
dist = {k: v/word_cnt for k, v in counts.items()}

In [19]:
dist['the'], dist['he'], dist['penelope']

(0.05817182677362563, 0.017413979801551346, 0.00038304862507642555)

The discard probability threshold, following § 2.3 of the paper
$$
P(w_i) = 1 - t\sqrt{\frac{C(w)}{C(w_i)}}
$$
with $t \approx 0.003$.

In [20]:
discard_probs = dict(counts)
for key in discard_probs:
    discard_probs[key] = max(0, 1 - math.sqrt(t/(counts[key]/word_cnt)))

In [21]:
discard_probs['the'], discard_probs['he'], discard_probs.get('penelope')

(0.8688876357073579, 0.7603645958887684, 0)

In [22]:
subsampled_word_seq = []
for word in words:
    if discard_probs[word] < np.random.random():
        subsampled_word_seq += [word]

In [23]:
if DOWNSAMPLING:
    word_seq = subsampled_word_seq

### Recounting the words after the discard operation

In [24]:
counts = Counter(words)
word_cnt = sum(counts.values())
word_cnt

271506

In [25]:
counts['the'], counts['he'], counts['penelope']

(15794, 4728, 104)

In [26]:
dist = {k: v/word_cnt for k, v in counts.items()}

In [27]:
dist['the'], dist['he'], dist['penelope']

(0.05817182677362563, 0.017413979801551346, 0.00038304862507642555)

### Power transform

We apply a power transform to a list of counts and we return power transformed probabilities:
$$
\frac{\text{cnt}(w)^\text{power}}{\sum_i \text{cnt}(w_i)^\text{power}}
$$

In [28]:
def power_transform(dist, power):
    dist_pow = {k: math.pow(v, power)
                for k, v in dist.items()}
    total = sum(dist_pow.values())
    dist_pow = {k: v/total
                for k, v in dist_pow.items()}
    return dist_pow

In [29]:
dist_pow = power_transform(counts, power)

In [30]:
dist_pow['the'], dist_pow['he'], dist_pow.get('penelope')

(0.02018120223430666, 0.008167441549206619, 0.0004665013862973331)

### Negative sampling
For each positive pair, and word and a context word, we draw $k$ words randomly to form negative pairs.

We build the index and probability lists for the random choice function

In [31]:
dist_pow_idx = {word2idx[k]: v for k, v in dist_pow.items()}

`random.choices` needs the index and the probabilities

In [32]:
draw_idx, probs = zip(*dist_pow_idx.items())

Given the words in the context, we draw $k$ as many words.

In [33]:
random.choices(draw_idx, weights=probs, k=K_NEG * 2 * w_size)

[5735,
 8289,
 8548,
 8548,
 3744,
 2562,
 867,
 2183,
 5285,
 371,
 7839,
 81,
 2125,
 2957,
 3711,
 5506,
 5335,
 8741,
 2448,
 5028]

## The pairs

For all the words, we form positive and negative pairs. We extract the context words of a word from its neighbors in the word sequence to form the positive pairs and at random to form the negative ones.

In [34]:
X_i = []
X_c = []
y = []
for idx, widx in tqdm(enumerate(words_idx[w_size:-w_size], w_size)):
    # We create the start and end indices as in range(start, end)
    start_idx = idx - w_size
    end_idx = idx + w_size + 1
    X_i += [words_idx[idx]] * (K_NEG + 1) * 2 * w_size
    X_c += [words_idx[c_idx] for c_idx in
            [*range(start_idx, idx), *range(idx + 1, end_idx)]]
    X_c += random.choices(draw_idx, weights=probs,
                          k=K_NEG * 2 * w_size)
    # X_c += list(np.random.choice(draw_idx, size=K_NEG * 2 * w_size, p=probs))
    y += [1] * w_size * 2 + [0] * w_size * 2 * K_NEG

271502it [00:34, 7918.45it/s]


We build two inputs: The left input is the input word and the right one is a context word.

In [35]:
y[:10]

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

In [36]:
X_i[:10]

[7663, 7663, 7663, 7663, 7663, 7663, 7663, 7663, 7663, 7663]

In [37]:
X_c[:10]

[1043, 4355, 5691, 3697, 3242, 8566, 8741, 4916, 6386, 9179]

In [38]:
y = torch.unsqueeze(torch.FloatTensor(y), dim=1)
X_i = torch.LongTensor(X_i)
X_c = torch.LongTensor(X_c)

In [39]:
X = torch.hstack((torch.unsqueeze(X_i, dim=0).T,
                 torch.unsqueeze(X_c, dim=0).T))

In [40]:
X.size()

torch.Size([6516048, 2])

## The Architecture

And now the architecture

In [41]:
class Skipgram(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_embedding = nn.Embedding(vocab_size,
                                        embedding_dim)
        self.o_embedding = nn.Embedding(vocab_size,
                                        embedding_dim)

    def forward(self, X):
        i_embs = self.i_embedding(X[:, 0])
        o_embs = self.o_embedding(X[:, 1])
        x = (i_embs * o_embs).sum(dim=-1, keepdim=True)
        # i_embs = torch.unsqueeze(self.embedding_i(X[:, 0]), dim=1)
        # c_embs = torch.unsqueeze(self.embedding_o(X[:, 1]), dim=-1)
        # x = torch.bmm(i_embs, c_embs)
        # x = torch.squeeze(x, dim=1)
        # print(x)
        return x

In [42]:
model = Skipgram()

In [43]:
class NegSamplingLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, y_pred, y):
        p = y * torch.log(torch.sigmoid(y_pred))
        n = (1.0 - y) * torch.log(torch.sigmoid(-y_pred))
        return -(p + n).mean(dim=0)

In [44]:
loss_fn = NegSamplingLoss()
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)

## Cosine similarity

A few test words

In [45]:
if CORPUS == 'HOMER':
    test_words = ['he', 'she', 'ulysses', 'penelope', 'achaeans', 'trojans',
                  'achilles', 'sea', 'helen', 'ship', 'her', 'fight']
elif CORPUS == 'DICKENS':
    test_words = ['he', 'she', 'her', 'sea', 'ship',
                  'fight', 'table', 'london', 'monday']

In [46]:
def most_sim_vecs(u, E, N=10):
    cos = nn.CosineSimilarity()
    cos_sim = cos(u.unsqueeze(dim=0), E)
    sorted_vectors = sorted(range(len(cos_sim)),

                            key=lambda k: -cos_sim[k])
    return sorted_vectors[1:N + 1]

In [47]:
def sim_test_words(test_words, word2idx, model, N=10):
    most_sim_words = {}
    with torch.no_grad():
        E = model.state_dict()['i_embedding.weight']
        for w in test_words:
            most_sim_words[w] = most_sim_vecs(E[word2idx[w]], E, N)
            most_sim_words[w] = list(map(idx2word.get, most_sim_words[w]))
            print(w, most_sim_words[w])

## Training the Model

In [48]:
BATCH_SIZE = 512

In [49]:
from torch.utils.data import TensorDataset, DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

In [50]:
model.state_dict()

OrderedDict([('i_embedding.weight',
              tensor([[-9.6653e-04, -4.3254e-01, -6.0842e-01,  ..., -4.2213e-01,
                        7.6791e-02,  7.7908e-01],
                      [-6.0923e-01, -5.3053e-01, -1.1234e+00,  ..., -2.7969e-02,
                       -9.9242e-01, -8.3003e-02],
                      [ 1.6799e-01,  6.7851e-01, -6.3538e-01,  ...,  7.6077e-01,
                        1.0847e+00,  3.9061e-01],
                      ...,
                      [-1.9245e+00,  1.2706e-01,  4.5101e-01,  ...,  1.2122e+00,
                       -1.2003e+00,  6.0470e-01],
                      [ 9.2151e-01,  5.6377e-01, -2.0107e+00,  ...,  1.9570e+00,
                       -1.1256e+00, -5.0541e-01],
                      [-1.5164e-01,  6.4384e-01, -5.4709e-01,  ..., -6.4668e-01,
                       -1.1798e+00,  5.2661e-01]])),
             ('o_embedding.weight',
              tensor([[-0.5835,  0.2553,  0.0725,  ..., -0.1127,  0.1957,  0.8125],
                      [-0.70

In [51]:
model.state_dict()['i_embedding.weight']

tensor([[-9.6653e-04, -4.3254e-01, -6.0842e-01,  ..., -4.2213e-01,
          7.6791e-02,  7.7908e-01],
        [-6.0923e-01, -5.3053e-01, -1.1234e+00,  ..., -2.7969e-02,
         -9.9242e-01, -8.3003e-02],
        [ 1.6799e-01,  6.7851e-01, -6.3538e-01,  ...,  7.6077e-01,
          1.0847e+00,  3.9061e-01],
        ...,
        [-1.9245e+00,  1.2706e-01,  4.5101e-01,  ...,  1.2122e+00,
         -1.2003e+00,  6.0470e-01],
        [ 9.2151e-01,  5.6377e-01, -2.0107e+00,  ...,  1.9570e+00,
         -1.1256e+00, -5.0541e-01],
        [-1.5164e-01,  6.4384e-01, -5.4709e-01,  ..., -6.4668e-01,
         -1.1798e+00,  5.2661e-01]])

In [52]:
for epoch in tqdm(range(5)):
    train_loss = 0
    train_acc = 0
    batch_cnt = 0
    for X_batch, y_batch in dataloader:
        y_batch_pred = model(X_batch)
        loss = loss_fn(y_batch_pred, y_batch)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print()
    sim_test_words(test_words, word2idx, model)

  0%|          | 0/5 [00:00<?, ?it/s]


he ['him', 'they', 'it', 'i', 'them', 'to', 'for', 'that', 'you', 'as']
she ['they', 'he', 'i', 'you', 'this', 'them', 'who', 'had', 'me', 'him']
ulysses ['it', 'him', 'and', 'to', 'of', 'is', 'he', 'who', 'for', 'them']
penelope ['giants', 'spoils', 'gates', 'against', 'chuckle', 'astynous', 'trojans', 'day', 'meant', 'did']
achaeans ['all', 'his', 'in', 'and', 'the', 'as', 'them', 'on', 'to', 'a']
trojans ['it', 'about', 'was', 'ulysses', 'on', 'that', 'are', 'is', 'one', 'them']
achilles ['but', 'man', 'them', 'up', 'he', 'said', 'sea', 'with', 'had', 'him']
sea ['with', 'ulysses', 'achilles', 'up', 'about', 'to', 'by', 'in', 'of', 'on']
helen ['into', 'estates', 'maia', 'immortals', 'reach', 'handed', 'marvellously', 'irritated', 'gear', 'lulled']
ship ['hands', 'all', 'he', 'it', 'him', 'them', 'i', 'have', 'me', 'at']
her ['was', 'me', 'he', 'him', 'will', 'it', 'them', 'i', 'had', 'as']


 20%|██        | 1/5 [00:35<02:23, 35.78s/it]

fight ['by', 'man', 'one', 'all', 'made', 'did', 'from', 'when', 'are', 'as']

he ['they', 'she', 'this', 'i', 'ulysses', 'him', 'it', 'when', 'as', 'me']
she ['he', 'they', 'then', 'this', 'when', 'him', 'ulysses', 'had', 'i', 'for']
ulysses ['he', 'him', 'now', 'it', 'this', 'when', 'is', 'then', 'for', 'was']
penelope ['spoils', 'gates', 'trojans', 'did', 'against', 'ranks', 'house', 'meant', 'day', 'agamemnon']
achaeans ['them', 'men', 'but', 'all', 'and', 'it', 'him', 'trojans', 'were', 'on']
trojans ['two', 'achaeans', 'while', 'was', 'them', 'men', 'horses', 'way', 'for', 'were']
achilles ['now', 'ulysses', 'them', 'however', 'he', 'then', 'she', 'and', 'armour', 'is']
sea ['into', 'hand', 'on', 'fell', 'upon', 'out', 'body', 'up', 'seat', 'spear']
helen ['immortals', 'reach', 'pray', 'attend', 'along', 'days', 'threw', 'gear', 'cast', 'whatever']
ship ['hands', 'them', 'fire', 'however', 'were', 'fell', 'people', 'all', 'ulysses', 'him']
her ['their', 'and', 'when', 'a', 'my', 

 40%|████      | 2/5 [01:11<01:47, 35.69s/it]

fight ['made', 'but', 'men', 'was', 'about', 'even', 'all', 'so', 'for', 'do']

he ['they', 'she', 'it', 'him', 'ulysses', 'had', 'i', 'this', 'we', 'so']
she ['he', 'they', 'then', 'him', 'ulysses', 'this', 'had', 'when', 'it', 'i']
ulysses ['hector', 'he', 'she', 'now', 'this', 'i', 'achilles', 'they', 'it', 'but']
penelope ['spoils', 'headlong', 'agamemnon', 'say', 'hector', 'bird', 'meant', 'faithful', 'think', 'rule']
achaeans ['trojans', 'men', 'people', 'first', 'danaans', 'them', 'idomeneus', 'gods', 'suitors', 'wretch']
trojans ['achaeans', 'while', 'day', 'men', 'fight', 'them', 'two', 'first', 'suitors', 'among']
achilles ['ulysses', 'hector', 'them', 'he', 'she', 'now', 'then', 'this', 'they', 'menelaus']
sea ['fell', 'into', 'seat', 'up', 'out', 'skin', 'set', 'homeward', 'thick', 'suitors']
helen ['place', 'immortals', 'goes', 'days', 'along', 'pray', 'always', 'attend', 'gear', 'whatever']
ship ['hands', 'battlements', 'thrown', 'armour', 'top', 'body', 'fire', 'sword', 

 60%|██████    | 3/5 [01:46<01:11, 35.58s/it]

fight ['seeing', 'make', 'take', 'trojans', 'fetch', 'keep', 'ought', 'consider', 'offered', 'fought']

he ['she', 'they', 'it', 'i', 'him', 'ulysses', 'as', 'so', 'we', 'this']
she ['he', 'they', 'then', 'ulysses', 'him', 'this', 'i', 'had', 'when', 'it']
ulysses ['hector', 'achilles', 'telemachus', 'she', 'now', 'then', 'he', 'i', 'this', 'menelaus']
penelope ['agamemnon', 'fraud', 'herself', 'think', 'say', 'doubt', 'nobly', 'headlong', 'spoils', 'bitterest']
achaeans ['trojans', 'danaans', 'argives', 'men', 'people', 'gods', 'idomeneus', 'first', 'chromius', 'suitors']
trojans ['achaeans', 'ointments', 'suitors', 'tripods', 'watch', 'lycians', 'headlong', 'argives', 'danaans', 'first']
achilles ['ulysses', 'hector', 'he', 'menelaus', 'flitted', 'then', 'she', 'them', 'minerva', 'i']
sea ['into', 'out', 'suitors', 'thick', 'worker', 'fell', 'down', 'seat', 'breastplate', 'heifer']
helen ['grain', 'gear', 'always', 'sitting', 'place', 'goes', 'days', 'bidden', 'sarpedon', 'estates']


 80%|████████  | 4/5 [02:21<00:35, 35.40s/it]

fight ['vine', 'seeing', 'take', 'conversation', 'consider', 'fetch', 'break', 'fought', 'make', 'ought']

he ['she', 'they', 'it', 'i', 'ulysses', 'him', 'we', 'so', 'as', 'had']
she ['he', 'they', 'then', 'ulysses', 'i', 'him', 'this', 'had', 'it', 'telemachus']
ulysses ['hector', 'achilles', 'he', 'she', 'now', 'telemachus', 'i', 'then', 'menelaus', 'this']
penelope ['nobly', 'doubt', 'agamemnon', 'fraud', 'overlaid', 'think', 'herself', 'escorting', 'sir', 'bitterest']
achaeans ['trojans', 'argives', 'danaans', 'chromius', 'suitors', 'men', 'first', 'people', 'amyntor', 'lycians']
trojans ['achaeans', 'ointments', 'argives', 'tripods', 'danaans', 'suitors', 'lycians', 'frenzied', 'chromius', 'astride']
achilles ['ulysses', 'hector', 'menelaus', 'he', 'minerva', 'flitted', 'she', 'then', 'agamemnon', 'nestor']
sea ['suitors', 'heifer', 'place', 'out', 'fire', 'wool', 'thick', 'waters', 'worker', 'down']
helen ['estates', 'grain', 'player', 'visiting', 'gear', 'sarpedon', 'sitting', 

100%|██████████| 5/5 [02:57<00:00, 35.45s/it]

fight ['vine', 'seeing', 'conversation', 'sandy', 'take', 'consider', 'prosperity', 'fetch', 'blind', 'boldness']





In [53]:
sim_test_words(test_words, word2idx, model, N=3)

he ['she', 'they', 'it']
she ['he', 'they', 'then']
ulysses ['hector', 'achilles', 'he']
penelope ['nobly', 'doubt', 'agamemnon']
achaeans ['trojans', 'argives', 'danaans']
trojans ['achaeans', 'ointments', 'argives']
achilles ['ulysses', 'hector', 'menelaus']
sea ['suitors', 'heifer', 'place']
helen ['estates', 'grain', 'player']
ship ['hands', 'snowflakes', 'instance']
her ['his', 'their', 'my']
fight ['vine', 'seeing', 'conversation']


In [54]:
import pandas as pd

df = pd.DataFrame(
    model.i_embedding.weight.detach().numpy(),
    index=[idx2word[i] for i in range(len(idx2word))])

In [55]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
a,-0.115307,-0.173987,0.287323,-0.292108,-0.174936,-0.101785,0.122769,0.135889,-0.049494,0.132621,...,-0.152037,0.497077,0.095745,-0.000599,-0.054082,-0.067820,0.207127,0.282712,-0.051499,0.288629
abantes,-0.949044,-0.803107,-0.874356,-0.083416,-0.274116,-1.982658,-1.345640,1.685233,0.994246,0.965259,...,-0.020426,0.056160,-1.201164,-0.310857,-1.049361,-0.630436,0.580878,-0.120673,-0.946586,-0.416576
abarbarea,-0.406129,0.064634,-0.376594,1.093006,-0.130595,0.751816,0.303328,0.753493,-0.683100,-1.023630,...,0.613051,0.277999,-1.532778,0.947653,0.485238,-0.544482,1.796147,0.927751,0.748793,0.152093
abas,-2.293785,0.802299,-1.640728,0.446294,-0.488436,-1.244138,-1.170374,0.171965,0.446263,-0.963141,...,0.998260,0.072511,0.029066,0.025902,-1.360026,1.025785,-0.781029,0.761685,1.028968,-0.158237
abate,-1.220544,-0.684756,-1.301203,0.104218,-0.773782,-0.719762,0.651743,1.552368,0.812724,0.936741,...,0.040870,0.936374,0.459256,-1.575856,0.658909,-1.624894,0.603778,0.439352,1.096513,-1.439350
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zeal,-1.491527,0.276204,0.348701,0.129279,-0.724967,-0.096656,-0.356217,0.975433,-0.450713,1.449306,...,-0.719412,0.136814,-1.677482,-0.358625,0.603940,0.583222,-0.390639,0.007743,0.023471,1.018648
zelea,0.930837,-0.456653,0.437797,0.368714,-0.374955,-0.344187,0.183678,-0.533435,-0.190855,-0.824678,...,-2.426399,0.027964,0.089947,-0.942544,-0.121506,0.212807,-0.240124,-1.747478,-1.056274,-1.270342
zephyrus,-2.425524,-0.574800,0.911303,-2.049945,-1.059410,0.014786,-0.582743,1.620918,-0.000823,0.571010,...,0.980937,-0.364519,-0.605205,0.368861,-1.030033,0.685074,0.179813,0.829338,-1.951077,0.229063
zethus,0.539042,-0.009175,-1.669050,-0.854871,0.413806,-0.128843,-0.210729,0.148031,0.838526,-0.469147,...,-0.890101,0.320231,-0.070076,0.540252,-0.496157,-1.924736,-0.240213,2.060474,-1.802748,-0.765669
