# Assignment 1.4: Negative sampling (15 points)

You may have noticed that word2vec is really slow to train. Especially with big (> 50 000) vocabularies. 
Negative sampling is the solution.

The task is to implement word2vec with negative sampling.

This is what was discussed in Stanford lecture. The main idea is in the formula:

$$ L = \log\sigma(u^T_o \cdot u_c) + \sum^k_{i=1} \mathbb{E}_{j \sim P(w)}[\log\sigma(-u^T_j \cdot u_c)]$$

Where $\sigma$ - sigmoid function, $u_c$ - central word vector, $u_o$ - context (outside of the window) word vector, $u_j$ - vector or word with index $j$.

The first term calculates the similarity between positive examples (word from one window)

The second term is responsible for negative samples. $k$ is a hyperparameter - the number of negatives to sample.
$\mathbb{E}_{j \sim P(w)}$
means that $j$ is distributed accordingly to unigram distribution.

Thus, it is only required to calculate the similarity between positive samples and some other negatives. Not across all the vocabulary.

Useful links:
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

Данные скачаны в ноутбуке task2_preprocessing.ipynb

In [6]:
from sklearn.manifold import TSNE

from batcher import Batcher
from w2v_model import Word2Vec

In [None]:
batcher = Batcher('text8.txt', min_count=15)

In [None]:
model = Word2Vec(batcher, 100, 'cbow_neg_sampling', device='cpu')

In [None]:
model.fit(epochs=1, batch_size=512, num_negative_samples=50, lr=1)

График функции ошибки

![Loss plot](imgs/loss_negative_sampling.png)

Значение в крайней правой точке. Можно увидеть время обучение модели

![Fit time](imgs/fit_time_negative_sampling.png)

In [8]:
import joblib

In [9]:
model = joblib.load('data/w2v_reuters_sampling_10k_iters.pkl')

PCA для полученных векторов

![PCA](imgs/pca_negative_sampling.png)

TSNE проекция на двумерное пространство

![TSNE](imgs/tsne_negative_sampling.png)