# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) 
and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. 
One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D 
 (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

**Код модели и класс для ее использования находится в файле w2v_model**

## Обучение модели без негативного сэмплирования

In [7]:
import nltk
import joblib
from pprint import pprint

from batcher import Batcher
from w2v_model import Word2Vec
from nltk.corpus import reuters

nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /home/alexey/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /home/alexey/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
START_TOKEN = 'START'
END_TOKEN = 'END'


def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
            Params:
                category (string): category name
            Return:
                list of lists, with words from each of the processed files
        """
    files = reuters.fileids(category)
    return [START_TOKEN + ' ' + reuters.raw(f).lower() + ' ' + END_TOKEN for f in files]

def save_reuters():
    data = read_corpus()
    data = [re.sub(r'\s+', ' ', text) for text in data]
    with open('reuters.txt', 'w+') as f:
        f.write('\n'.join(data))
        
save_reuters()

In [1]:
from batcher import Batcher
from w2v_model import Word2Vec

In [9]:
batcher = Batcher('reuters.txt', min_count=15)

Making a dictionary of words: 578it [00:01, 496.44it/s]


In [17]:
model = Word2Vec(batcher, 100, 'cbow', device='cpu')

In [18]:
model.fit(epochs=2, batch_size=128, num_negative_samples=0, lr=0.1)

loss: 6.6875: : 1008it [01:29, 11.31it/s]
loss: 6.5612: : 1008it [01:47,  9.37it/s]


График функции ошибки

![](imgs/loss_w2v.png)

TSNE двумерная проекция

![](imgs/tsne_w2v.png)

Визуализация ближайших соседей слова pay

![](imgs/pay_tsne.png)

![](imgs/pay_similar.png)

Для получения векторов слов реализована функция ```__getitem__``` в классе ```Word2Vec```

In [5]:
model['apple']

array([[-0.19411553, -0.4543066 , -0.30762923, -0.15537794,  1.2287014 ,
         0.02199285,  0.5052601 ,  1.1425176 , -1.4699855 ,  0.05500521,
        -0.50839436,  1.5444202 , -0.84225476, -0.6410309 ,  0.6089825 ,
         0.31013072,  0.5426019 , -0.6780794 ,  0.27120188,  0.30824423,
        -0.8558358 , -0.05787476,  0.6689864 , -0.93997914, -0.71960145,
         0.3158765 , -1.3750368 , -1.5036819 , -0.20110574,  0.50951546,
        -2.3031626 ,  0.23426062, -0.20502484, -0.24831456,  0.07429146,
        -0.13465947, -0.6839557 ,  0.45570114, -0.69297814, -0.6911782 ,
        -0.312758  , -0.21947247,  0.08345015,  0.00360702,  0.45657602,
        -1.0257844 , -0.02398044, -0.660159  , -0.259278  ,  0.07834294,
         0.4120891 ,  0.04391201,  0.31088114,  0.28587025,  0.47354126,
        -1.6509819 ,  0.31241482,  0.02018441,  0.27388293,  0.5192699 ,
        -0.29151487,  0.1390649 ,  0.04169053,  2.23582   , -0.70483965,
         0.42826295, -0.71624374,  1.136356  ,  0.4

In [6]:
model['abcabc']

ValueError: Word abcabc not in model vocabulary