# Word2vec preprocessing

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make SkipGramBatcher class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size,)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
         ['used', 'against', 'working', 'class'],
         ['against', 'early', 'class', 'radicals'],
         ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']

# Skip-Gram

indices_to_words(x_batch) = ['against', 'early', 'working', 'class']

indices_to_words(labels_batch) = ['used', 'working', 'early', 'radicals']]

```

If you struggle with something, ask your neighbor. If it is not obvious for you, probably someone else is looking for the answer too. And in contrast, if you see that you can help someone - do it! Good luck!

In [1]:
from collections import Counter
import gc
import numpy as np
from nltk.corpus import stopwords
import nltk


class SkipGramBatcher:
    def __init__(self, corpus, vocab_size, window_size=3,
                 batch_size=128, drop_stop_words=True,
                 shuffle_batch=True, unk_text='<UNK>'):
        self.window_size = window_size
        self.vocab_size = vocab_size - 1
        self.batch_size = batch_size
        self.unk_text = unk_text
        self.shuffle_batch = shuffle_batch

        # drop stop words from corpus if it's needed
        if drop_stop_words:
            nltk.download('stopwords')
            stop_words = set(stopwords.words('english'))
            cleaned_corpus = [word for word in corpus if not word in stop_words]
            corpus = cleaned_corpus

        # Count all word occurrences and select vocab_size most common
        self._counted_words = Counter(corpus).most_common(self.vocab_size)
        # create mappings using dict comprehension
        self._token_to_word = {idx: word for idx, (word, count) in enumerate(self._counted_words)}
        self._word_to_token = {word: idx for idx, (word, count) in enumerate(self._counted_words)}

        # append '<UNK>' token to dictionaries
        last_token = len(self._token_to_word)
        self._token_to_word[last_token] = self.unk_text
        self._word_to_token[self.unk_text] = last_token
        tokenized = self.words_to_tokens(corpus, error_on_unk=False)

        # transform corpus from strings to tokens, to reduce memory usage
        self._corpus_tokens = np.asarray(tokenized, dtype=np.int32)

        # create shuffled sequence to make batch sampling random
        self._batch_shuffled_sequence = np.arange(len(self._corpus_tokens))

        # clean memory
        corpus = []
        gc.collect()

    def words_to_tokens(self, words, error_on_unk=True):
        """Function to transform iterable of words into list of tokens"""

        unk_index = self._word_to_token[self.unk_text]
        idxes = [self._word_to_token.get(word, unk_index) for word in words]
        if error_on_unk and unk_index in idxes:
            raise IndexError("Some words are not present in the dictionary")
        return idxes

    def tokens_to_words(self, tokens):
        """Function to transfrom iterable of tokens into list of words"""

        words = [self._token_to_word[token] for token in tokens]
        return words

    def _get_random_positive_sample(self, center_pos):
        """Internal function to get a random sample within the selected window_size"""

        left_window = np.arange(max(0, center_pos - self.window_size),
                                center_pos)
        right_window = np.arange(center_pos + 1,
                                 min(center_pos + self.window_size + 1, len(self._corpus_tokens)))
        window = np.concatenate((left_window, right_window))
        position = np.random.choice(window)
        return self._corpus_tokens[position]

    def __iter__(self):
        if self.shuffle_batch:
            np.random.shuffle(self._batch_shuffled_sequence)
        self.batch_start_pos = 0
        return self

    def __next__(self):
        if self.batch_start_pos >= len(self._corpus_tokens):
            raise StopIteration
        else:
            # get a list of shuffled numbers
            batch_position_in_corpus = self._batch_shuffled_sequence[np.arange(
                self.batch_start_pos,
                min(self.batch_start_pos + self.batch_size, len(self._batch_shuffled_sequence))
            )]
            center_words_batch = np.asarray(self._corpus_tokens[batch_position_in_corpus])
            # draw a word from window of a selected word
            context_words_batch = np.asarray([self._get_random_positive_sample(selected_word_position)
                                  for selected_word_position in batch_position_in_corpus]).flatten()
            self.batch_start_pos += self.batch_size
            return center_words_batch, context_words_batch

In [2]:
# text = []
# with open('./data/text8', 'r') as text8:
#     text = text8.read().split()

text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'class', 'other']

In [14]:
batcher = SkipGramBatcher(text, vocab_size=8, window_size=2, batch_size=3,
                          drop_stop_words=False, shuffle_batch=False)

* 7 stands for `<UNK>` here

In [15]:
for center_batch, context_batch in batcher:
    print(center_batch, context_batch)

[1 2 3] [3 3 1]
[4 5 0] [3 0 5]
[6 7 0] [0 7 7]
[7] [7]
