## Solving tasks from Distributive semantics section of NLP Course from tepik

### 1. Generation of examples for training Word2Vec Skip Gram Negative Sampling

We are training Word2Vec Skip Gram Negative Sampling with a window of a given width. For example, a window of size 5 implies that words that are no more than 2 positions to the left or right from the central word are considered positive examples. The center word is not counted as a context word.

Write function, which generates training examples from the text. Every training example must look like a 3-element tuple $(CenterWord,CtxWord,Label)$, where $CenterWord∈N$ - token identifier in the middle of the window, $CtxWord∈N$ - identifier of adjacent token, $Label∈{0,1} - 1$ if $CtxWordCtxWord$ is positive and $0$, it is a negative example.

Function should return the list with training examples.

Arugment ns_rate sets the number of negative examples to generate for each positive example. When sampling negative words, it is usually not checked whether the word appears in the window. Thus, among negative examples, positive ones may appear.

Input text was already tokenized and tokens were replaced with their identifiers.

Tests are generated randomly, constraints:

 - len(text) < 20
 - window_size <= 11, нечётное
 - vocab_size < 100
 - ns_rate < 3
Words have identifiers 0..vocab_size - 1 (as returns np.random.randint).

NB, that -3 // 2 != -(3 // 2).

In [1]:
import sys
import ast
import numpy as np
import random

In [2]:
def parse_array(s):
    return np.array(ast.literal_eval(s))

def read_array():
    return parse_array(sys.stdin.readline())

def write_array(arr):
    print(repr(arr.tolist()))

def get_window(text, window_size):
    for backward, current in enumerate(range(len(text)), start=0 - (window_size // 2)):
        if backward < 0:
            backward = 0
        context = list(text[backward:current]) + list(text[current + 1:current + 1 + window_size // 2])
        center = text[current]
        yield center, context
        
def generate_w2v_sgns_samples(text, window_size, vocab_size, ns_rate):
    """
    text - list of integer numbers - ids of tokens in text
    window_size - odd integer - width of window
    vocab_size - positive integer - number of tokens in vocabulary
    ns_rate - positive integer - number of negative tokens to sample per one positive sample

    returns list of training samples (CenterWord, CtxWord, Label)
    """
    res = []

    for center, context_values in get_window(text, window_size):
        for context in context_values:
            res.append([center, context, 1])
            for n in range(ns_rate):
                res.append([center, random.choice(np.array(range(0, vocab_size))), 0])

    return res

In [3]:
text = [1, 0, 1, 0, 0, 5, 0, 3, 5, 5, 3, 0, 5, 0, 5, 2, 0, 1, 3]
window_size = 4
vocab_size = 6
ns_rate = 1

result = generate_w2v_sgns_samples(text, window_size, vocab_size, ns_rate)

write_array(np.array(result))

[[1, 0, 1], [1, 2, 0], [1, 1, 1], [1, 2, 0], [0, 1, 1], [0, 4, 0], [0, 1, 1], [0, 4, 0], [0, 0, 1], [0, 5, 0], [1, 1, 1], [1, 4, 0], [1, 0, 1], [1, 5, 0], [1, 0, 1], [1, 3, 0], [1, 0, 1], [1, 2, 0], [0, 0, 1], [0, 3, 0], [0, 1, 1], [0, 0, 0], [0, 0, 1], [0, 0, 0], [0, 5, 1], [0, 2, 0], [0, 1, 1], [0, 1, 0], [0, 0, 1], [0, 4, 0], [0, 5, 1], [0, 4, 0], [0, 0, 1], [0, 5, 0], [5, 0, 1], [5, 3, 0], [5, 0, 1], [5, 3, 0], [5, 0, 1], [5, 1, 0], [5, 3, 1], [5, 5, 0], [0, 0, 1], [0, 5, 0], [0, 5, 1], [0, 4, 0], [0, 3, 1], [0, 3, 0], [0, 5, 1], [0, 0, 0], [3, 5, 1], [3, 2, 0], [3, 0, 1], [3, 3, 0], [3, 5, 1], [3, 0, 0], [3, 5, 1], [3, 0, 0], [5, 0, 1], [5, 0, 0], [5, 3, 1], [5, 1, 0], [5, 5, 1], [5, 4, 0], [5, 3, 1], [5, 3, 0], [5, 3, 1], [5, 1, 0], [5, 5, 1], [5, 5, 0], [5, 3, 1], [5, 0, 0], [5, 0, 1], [5, 5, 0], [3, 5, 1], [3, 0, 0], [3, 5, 1], [3, 2, 0], [3, 0, 1], [3, 2, 0], [3, 5, 1], [3, 3, 0], [0, 5, 1], [0, 2, 0], [0, 3, 1], [0, 2, 0], [0, 5, 1], [0, 2, 0], [0, 0, 1], [0, 4, 0], [5, 3, 1]