# Homework 1: Word2vec + Negative Sampling

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

### Homework task: Use Negative Sampling

## Theory

There are two types of word to vec.

### Skipgram

Predicting outside word $o$ from central $c$. We have two embedding mattrix $u$ and $v$.



![](https://i.ibb.co/xgT4k8b/2020-10-02-10-10-21.png)

$P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}$.

More formally we need to maximize Likelohood:
$$
L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m, j \neq 0} P\left(w_{t+j} \mid w_{t}, \theta\right)
$$

$$
L_{\log}(\theta) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P\left(w_{t+j} \mid w_{t}, \theta\right) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log \frac{\exp \left(u_{t+j}^{T} v_{t}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)} = \\ = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} u_{t+j}^{T} v_{t} - \log \sum_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)
$$

$$
loss = -L_{\log}
$$

Let's count derivative!

**Reminder**

$$\frac{\partial x^T y}{\partial y} = x$$



$$
\frac{\partial L_{log}(\theta)}{\partial v_t} = u_o - \dfrac{1}{\sum_w\exp(u_w^T v_t)}\cdot\sum_x \exp(u_x^T v_t) u_x = u_o - \sum_x \frac{\exp(u_x v_t)}{\sum_w \exp(u_w v_t)} u_x = \\ = u_0 - \sum_x P(u_x| v_t) u_x
$$

### CBOW

![](https://lena-voita.github.io/resources/lectures/word_emb/w2v/cbow_skip-min.png)

## Practice

### Variant 1

```python
class Model(nn.Module):
    def __init__(self, voc_size, emb_dim):
        self.u = nn.Embedding(voc_size, emb_dim)
        self.v = nn.Embedding(voc_size, emb_dim)

w2v = Model(...)

def step(word, context):
    for c_word in context:
        loss = - w2v.u(word).T.dot(w2v.v(c_word))
        cum_exp = 0
        for i in range(voc_size):
            if i == c_word:
                continue
            cum_exp += w2v.u(word).T.dot(w2v.v(c_word)).exp()
        loss += torch.log(cum_exp)
        loss.backward()
        ...
```

### Variant 2

![](https://i.ibb.co/qydjBbv/2020-10-02-12-16-33.png)

```python
class Model(nn.Module):
    def __init__(self, voc_size, emb_dim):
        self.u = nn.Embedding(voc_size, emb_dim)
        self.v = nn.Linear(emb_dim, voc_size, bias=False)

    def forward(self, x):
        return self.v(self.u(x))

w2v = Model(...)
criterion = nn.CrossEntropyLoss()

def step(word, context):
    for c_word in context:
        preds = w2v(word)
        loss = criterion(preds, c_word)
        loss.backward()
        ...
```


## Homework #1

### Negative sampling

Instead of updating all context vectors we can sample 5-10.

$$
loss = -\log \sigma\left(u_{context}^{T} v_{center}\right)-\sum_{w \in\left\{w_{i_{1}}, \ldots, w_{i_{K}}\right\}} \log \sigma\left(-u_{w}^{T} v_{center}\right)
$$

**How can we sample negative words?**

According to their probability: $p_{sample} (w) = p_{word} (w) ^{3/4}$

Information $= -\log(P)$


### Not all of the words are equally important

### Distance between center and context

![](https://lena-voita.github.io/resources/lectures/word_emb/research/w2v_position-min.png)


In [2]:
import numpy as np
import pandas as pd
import collections
import itertools
import nltk
import re
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from nltk.corpus import stopwords
from itertools import zip_longest
from string import ascii_lowercase
import snowballstemmer

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip /content/text8.zip -d text8/

--2021-03-12 00:39:59--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.24
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2021-03-12 00:40:42 (712 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  /content/text8.zip
  inflating: text8/text8             


In [5]:
with open('/content/text8/text8', 'r') as txt_file:
    text_8 = txt_file.read()

In [6]:
print(len(text_8))
text_8[0:500]

100000000


' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philoso'

In [11]:
class SkipGramBatcher(object):
    def __init__(self, text, window_size = 3, batch_size = 2048, vocab_size=60000):
        pass
        # your code goes here

In [12]:
class SkipGramNegativeSampling(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        pass
        # your code goes here

In [None]:
criterion = # your code goes here
optimizer = # your code goes here
scheduler = # your code goes here

In [None]:
epochs = # your code goes here
train_losses = # your code goes here
verbose_at_batch = # your code goes here

In [None]:
def train_model(model, loss_function, optimizer, scheduler, train_losses, verbose_at_batch, epochs):
    # your code goes here