# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

In [0]:
import torch
import numpy as np
import os
import pandas as pd
from pprint import pprint
from collections import Counter, deque
import re
import string
import requests
import random

START_TOKEN = '<START>'
END_TOKEN = '<END>'
UNK_TOKEN = '<UNK>'

np.random.seed(17)
random.seed(17)

In [3]:
#Скачиваем корпус
if not os.path.isfile('text8'):
    with open('data.zip', 'wb') as f:
        r = requests.get('http://mattmahoney.net/dc/text8.zip')
        f.write(r.content)
    !unzip 'data.zip' 

with open('text8') as f:
    corpus = f.read().lower().split()

Archive:  data.zip
  inflating: text8                   


In [4]:
print(corpus[:10])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


Как видно ниже, если выбирать минимальную частоту за 5, то обработка корпуса занимает 30 с небольшим минут (все же надеюсь, что это дело можно как-то ускорить, буду рад, если проверяющий знает и поделится способом :) ). Если ноутбук будет запущен заново, то лучше использовать ячейку ниже следующей, где выбирается 10k самых часто встречающихся слов в корпусе (там обработка очень быстрая).

In [11]:
%%time

freq_dict = Counter(corpus)
vocabulary = [word for word, freq in freq_dict.items() if freq >= 5]
vocabulary.append(UNK_TOKEN)

data = []

for word in corpus:
  if word in vocabulary:
    data.append(word)
  else:
    data.append(UNK_TOKEN)

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

['anarchism', 'originated', 'as', 'a', 'term']
CPU times: user 31min 8s, sys: 463 ms, total: 31min 8s
Wall time: 31min 10s


In [18]:
%%time

VOCAB_SIZE = 10000
freq_dict = Counter(corpus)
no_rare_dict = freq_dict.most_common(VOCAB_SIZE)
min_freq = no_rare_dict[-1][1]
vocabulary = [x[0] for x in no_rare_dict]
vocabulary.append(UNK_TOKEN)

data = []
for i, word in enumerate(corpus):
  if freq_dict[word] > min_freq:
    data.append(word)
  else:
    data.append(UNK_TOKEN)

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

CPU times: user 10.2 s, sys: 82 ms, total: 10.3 s
Wall time: 10.3 s


In [0]:
class Batcher(object):
  def __init__(self, batch_size, window_size, data):
    self.batch_size = batch_size
    self.window_size = window_size
    self.data = data

  def __iter__(self):
    return self

  def __next__(self):
    global idx
    batch_size = self.batch_size
    window_size = self.window_size
    data = self.data
    x_batch, labels_batch, context = [], [], []

    for i in range(batch_size):
      if (idx - window_size < 0) or (idx + window_size > len(data) - 1):
        idx = (idx + 1) % len(data)
      else:
        x_batch.append(word2idx[data[idx]])
        labels_batch.append(list(word2idx[word] for word in data[idx-window_size:idx] + data[idx+1:idx+window_size+1]))
        idx = (idx + 1) % len(data)

    return (x_batch, labels_batch)

In [0]:
idx = 0
batch_size = 10
window_size = 2
batcher = Batcher(batch_size, window_size, data)
build_batch = iter(batcher)

In [21]:
print('data:', [di for di in data[:16]])

batch, labels = next(build_batch)
print('\nWindow_size = {}:'.format(window_size))
print('\nbatch: {}'.format([idx2word[idx] for idx in batch]))
print()
print('labels: {}'.format([[idx2word[idx] for idx in context] for context in labels]))

data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', '<UNK>', 'including', 'the']

Window_size = 2:

batch: ['as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']

labels: [['anarchism', 'originated', 'a', 'term'], ['originated', 'as', 'term', 'of'], ['as', 'a', 'of', 'abuse'], ['a', 'term', 'abuse', 'first'], ['term', 'of', 'first', 'used'], ['of', 'abuse', 'used', 'against'], ['abuse', 'first', 'against', 'early'], ['first', 'used', 'early', 'working']]


In [22]:
#Для проверки выведем небольшой кусочек корпуса
pprint(" ".join(word for word in data[:100]))

('anarchism originated as a term of abuse first used against early working '
 'class <UNK> including the <UNK> of the english revolution and the <UNK> '
 '<UNK> of the french revolution whilst the term is still used in a <UNK> way '
 'to describe any act that used violent means to destroy the organization of '
 'society it has also been taken up as a positive label by self defined '
 'anarchists the word anarchism is derived from the greek without <UNK> ruler '
 'chief king anarchism as a political philosophy is the belief that rulers are '
 'unnecessary and should be abolished although there are differing')
