# Naive word2vec

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch (numpy too, if you love to calculate gradients on your own and want some extra points, but don't forget to numerically check your gradients) and code from your previous task. Again: you don't have to implement negative sampling (you may reduce your vocabulary size for faster computation).

**Results of this task**:
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)
 * qualitative evaluations of word vectors: nearest neighbors, word analogies

**Extra:**
 * quantitative evaluation:
   * for intrinsic evaluation you can find datasets [here](https://aclweb.org/aclwiki/Analogy_(State_of_the_art))
   * for extrincis evaluation you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation. If you chose to do this, please use the same datasets across tasks 3, 4, 5 and 6.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

If you struggle with something, ask your neighbor. If it is not obvious for you, probably someone else is looking for the answer too. And in contrast, if you see that you can help someone - do it! Good luck!

#### Imports

In [3]:
from skipgram import SkipGram, SkipGramBatcher
import torch
import gc
import datetime
import pickle
import numpy as np
import pandas as pd

#### Constants

In [2]:
# select whether to train model during this run (or just load it from saved file)
TRAIN = True

In [40]:
VOCAB_SIZE = 20000
BATCH_SIZE = 5000
EMBEDDINGS_DIM = 150
EPOCH_NUM = 2
WINDOW_SIZE = 2
LOGS_PERIOD = 10
np.random.seed(42)

#### select device

In [36]:
USE_GPU = True

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('using device:', device)

using device: cuda


#### Load corpus into batcher

In [37]:
text = []
with open('./data/text8', 'r') as text8:
    text = text8.read().split()

# text = ['first', 'used', 'against', 'early', 'working', 'radicals', 'including', 'class', 'other']
batcher = SkipGramBatcher(corpus=text, vocab_size=VOCAB_SIZE,
                          batch_size=BATCH_SIZE, window_size=WINDOW_SIZE,
                          drop_stop_words=True)
# free memory
text = []
gc.collect()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0

#### Create and train model

In [53]:
if TRAIN:
    loss_history = []
    corpus_size = len(batcher.corpus_indexes)

    model = SkipGram(VOCAB_SIZE, EMBEDDINGS_DIM)
    model.to(device)
    loss_fun = torch.nn.NLLLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

In [54]:
if TRAIN:
    learning_started = datetime.datetime.now()

    cumulative_loss = 0
    for epoch in range(EPOCH_NUM):
        for i, (context, target) in enumerate(batcher):
            tensor_context = torch.from_numpy(context).type(torch.cuda.LongTensor)
            tensor_target = torch.from_numpy(target).type(torch.cuda.LongTensor)
            tensor_context.to(device)
            tensor_target.to(device)

            model.zero_grad()

            log_probs = model(tensor_context)
            loss = loss_fun(log_probs, tensor_target)
            loss.backward()
            optimizer.step()
            cumulative_loss += loss

            if i % LOGS_PERIOD == 0:
                print(f'Cumulative loss on {(i * BATCH_SIZE / corpus_size) * 100:.1f}%:' + \
                      f'{(cumulative_loss / BATCH_SIZE) :.7f}')
                loss_history.append(loss.data)
                cumulative_loss = 0
        
        
        # after every epoch we save:
                                    # the model
                                    # loss history
        learning_ended = datetime.datetime.now()
        learning_time = (learning_ended - learning_started).total_seconds()
        learning_ended = learning_ended.strftime("%H-%M %d-%m-%Y")

        torch.save(model, f'./models/skipgram(epochs_completed-{epoch})(epoch_num-{EPOCH_NUM})' + \
                   f'(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
                   f'(emb-{EMBEDDINGS_DIM})(wind-{WINDOW_SIZE})(consumed-{learning_time})'+ \
                   f'(finished-{learning_ended}).pytorchmodel')

        with open(f'./data/loss/loss_history(epochs_completed-{epoch})(epoch_num-{EPOCH_NUM})' + \
                  f'(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
                  f'(emb-{EMBEDDINGS_DIM})(wind-{WINDOW_SIZE})(consumed-{learning_time})'+ \
                  f'(finished-{learning_ended}).pickle', 'wb') as f:
            pickle.dump(loss_history, f)

    with open(f'./data/batcher/batcher(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
           f'(wind-{WINDOW_SIZE})'+ \
           f'(learning_finished-{learning_ended}).pickle', 'wb') as f:
        pickle.dump(batcher, f)

Cumulative loss on 0.0%:0.0017351
Cumulative loss on 0.5%:0.0171895
Cumulative loss on 0.9%:0.0170728
Cumulative loss on 1.4%:0.0170380
Cumulative loss on 1.8%:0.0169965
Cumulative loss on 2.3%:0.0169600
Cumulative loss on 2.8%:0.0169549
Cumulative loss on 3.2%:0.0169336
Cumulative loss on 3.7%:0.0169338
Cumulative loss on 4.1%:0.0169180
Cumulative loss on 4.6%:0.0169036
Cumulative loss on 5.1%:0.0168893
Cumulative loss on 5.5%:0.0168745
Cumulative loss on 6.0%:0.0168678
Cumulative loss on 6.4%:0.0168598
Cumulative loss on 6.9%:0.0168733
Cumulative loss on 7.3%:0.0168465
Cumulative loss on 7.8%:0.0168389
Cumulative loss on 8.3%:0.0168537
Cumulative loss on 8.7%:0.0168273
Cumulative loss on 9.2%:0.0168165
Cumulative loss on 9.6%:0.0168305
Cumulative loss on 10.1%:0.0168073
Cumulative loss on 10.6%:0.0168074
Cumulative loss on 11.0%:0.0167964
Cumulative loss on 11.5%:0.0168042
Cumulative loss on 11.9%:0.0167806
Cumulative loss on 12.4%:0.0167742
Cumulative loss on 12.9%:0.0168031
Cumulat

Cumulative loss on 8.3%:0.0160787
Cumulative loss on 8.7%:0.0161089
Cumulative loss on 9.2%:0.0160895
Cumulative loss on 9.6%:0.0160778
Cumulative loss on 10.1%:0.0160720
Cumulative loss on 10.6%:0.0160886
Cumulative loss on 11.0%:0.0160949
Cumulative loss on 11.5%:0.0160940
Cumulative loss on 11.9%:0.0160778
Cumulative loss on 12.4%:0.0160817
Cumulative loss on 12.9%:0.0161068
Cumulative loss on 13.3%:0.0160893
Cumulative loss on 13.8%:0.0160856
Cumulative loss on 14.2%:0.0160657
Cumulative loss on 14.7%:0.0161064
Cumulative loss on 15.2%:0.0160872
Cumulative loss on 15.6%:0.0160680
Cumulative loss on 16.1%:0.0160889
Cumulative loss on 16.5%:0.0160732
Cumulative loss on 17.0%:0.0160944
Cumulative loss on 17.4%:0.0160902
Cumulative loss on 17.9%:0.0160795
Cumulative loss on 18.4%:0.0160966
Cumulative loss on 18.8%:0.0160789
Cumulative loss on 19.3%:0.0160882
Cumulative loss on 19.7%:0.0160910
Cumulative loss on 20.2%:0.0160643
Cumulative loss on 20.7%:0.0160721
Cumulative loss on 21.1%

FileNotFoundError: [Errno 2] No such file or directory: './data/batcher/batcher(vocab-20000)(batch-5000)(wind-2)(learning_finished-14-42 04-03-2019).pickle'

#### Saving model

In [None]:
if TRAIN:
    learning_ended = datetime.datetime.now()
    learning_time = (learning_ended - learning_started).total_seconds()
    learning_ended = learning_ended.strftime("%H-%M %d-%m-%Y")
    
    torch.save(model, f'./models/skipgram(epoch_num-{EPOCH_NUM})(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
               f'(emb-{EMBEDDINGS_DIM})(wind-{WINDOW_SIZE})(consumed-{learning_time})'+ \
               f'(finished-{learning_ended}).pytorchmodel')
    
    with open(f'./data/loss/loss_history(epoch_num-{EPOCH_NUM})(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
           f'(emb-{EMBEDDINGS_DIM})(wind-{WINDOW_SIZE})(consumed-{learning_time})'+ \
           f'(finished-{learning_ended}).pickle', 'wb') as f:
        pickle.dump(loss_history, f)
        
    with open(f'./data/batcher/batcher(vocab-{VOCAB_SIZE})(batch-{BATCH_SIZE})' + \
           f'(wind-{WINDOW_SIZE})'+ \
           f'(learning_finished-{learning_ended}).pickle', 'wb')

### Model evaluation

 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)
 * qualitative evaluations of word vectors: nearest neighbors, word analogies

It took 105 minutes to train model

#### Plotting loss

In [None]:
from utils import plot_moving_average

with open(f'./data/loss/loss_history(epoch_num-2)(vocab-5000)(batch-50)' + \
          '(emb-100)(wind-2)(consumed-3083.799288)(finished-16-33 03-03-2019).pickle', 'rb') as f:
    loss_history = pickle.load(f)
# transform from 1x1 tensor to np array
loss_history = np.asarray([entry.data.numpy().item() for entry in loss_history], dtype=np.float32)

In [None]:
plot_moving_average(pd.Series(loss_history), 128, plot_actual=False)

#### Function to map token (and word) to corresponding word vector

In [None]:
from utils import EmbeddingsEval
trained_model = torch.load('./models/skipgram(epoch_num-2)(vocab-5000)(batch-50)' + \
                           '(emb-100)(wind-2)(consumed-3083.799288)(finished-16-33 03-03-2019).pytorchmodel')
trained_model

In [None]:
intristic_matrix =  (trained_model.embedding_layer.weight.data.numpy() +
                     trained_model.linear_layer.weight.data.numpy()) / 2


emb_eval = EmbeddingsEval(intristic_matrix, words_to_tokens=batcher.words_to_tokens,
                          tokens_to_words=batcher.tokens_to_words)

In [None]:
emb_eval.tokens_to_embeddings([1, 2, 3])

#### Beautiful visualizations (PCA)

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

Take the most popular words

In [None]:
num_words = 200
words = batcher.tokens_to_words(np.arange(0, num_words))
embeddings = emb_eval.tokens_to_embeddings(np.arange(0, num_words))

In [None]:
pca = PCA(n_components=2)
points2d = pca.fit_transform(embeddings)


fig, ax = plt.subplots(figsize=(16, 16))
ax.scatter(points2d[:, 0], points2d[:, 1])

for i, word in enumerate(words):
    ax.annotate(word, (points2d[i, 0] + 0.01, points2d[i, 1] + 0.01), fontsize='small')

#### qualitative evaluations of word vectors: nearest neighbors, word analogies

In [None]:
for token_list in emb_eval.tokens_to_neighbours(batcher.words_to_tokens(['paris', 'france', 'king'])):
    print(batcher.tokens_to_words(token_list))

In [None]:
similar = emb_eval.most_similar(positive=['old', 'buy'], negative=['sell'])