# Word2Vec model

Write a program that trains Word2Vec model. Do not use print() instructions in your code, otherwise test procedure will not succeed; the message "Wrong Answer" indicates answer format is incorrect (print() in the code, missing words in the dictionary, etc.). The message "Embeddings are not good enough" means you're on the right track and you should focus on the model improvement. In this version of the assignment the checks on the embeddings are easier.

You may think of the input string as being pre-processed with the following function:

In [2]:
import re

import string

def clean(inp: str) -> str:

    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))

    inp = re.sub(r'\s+', ' ', inp.lower())

    return inp

I.e. given the input "Your string!" the output will be "your string ".

Input: data (string) - cleaned documents without punctuation in one line
Output: w2v_dict (dict: key (string) - a word from vocabulary, value (numpy array) - the word's embedding)

Time limit: 25 seconds
Memory limit: 128 MB

## word2vec lite

In [3]:

import numpy as np

def train(data: str):
    """
    return: w2v_dict: dict
            - key: string (word)
            - value: np.array (embedding)
    """
    words = data.split()
    vocab = set(words)
    word2idx = {w: np.array(idx) for (idx, w) in enumerate(vocab)}
    return word2idx

corpus = 'he is a king she is a queen he is a man she is a woman warsaw is poland capital berlin is germany capital paris is france capital'
loli4 = train(corpus)

### word2vec hard

In [4]:
# мое решение 


text = '''As the eight strange beings applauded, one of them even cupping a hand over her lipsticked mouth to cheer, Joel tried to grasp what was happening. The nine of them sat in a fire rimmed cavern around a conference table shaped from warm volcanic rock. A chandelier of human bones dangled from the cavern’s ceiling, and it rattled around at random like wind chimes. A massive goat-man with reddish-black skin and wicked horns on his head towered above the seven others, who flanked him to either side. They looked like pure stereotype. A fat slob with sixteen chins, a used car saleman looking guy with gold and silver jewelry all over him, a sultry dominatrix in skin tight leather. On the other side a disheveled looking college drop out, a pretty boy staring in a mirror, a bald, muscular dude who looked like someone’s pissed off step-dad and a sour faced woman glancing jealously around the room. Just where the hell was he? Joel concentrated on his last memory. He remembered highlighting pages as his private jet, “The Holy Gust,” flew over the sapphire waters of the Bahamas. He had been reviewing his sermon for Sunday – dotting the I’s and crossing the crosses, a little god humor there, praise him – and the pilot’s voice had crackled over the intercom about turbulence. Kimberly, his personal assistant, had taken his plow out of her mouth and put on her seat belt. The plane had shook and then'''.lower()

words = text.split()
vocab = set(words)
word2idx = {w: idx for (idx, w) in enumerate(vocab)}
idx2word = {idx: w for (idx, w) in enumerate(vocab)}

from types import SimpleNamespace
import random
random.seed(42)

def generate_negative_samples(target_index, index_range, k):
    '''
    index_range: ranges of index to select from
    '''
    random_index = random.sample(index_range, 6) # количество конкретное (почему 6?)
    
    return  SimpleNamespace(
                target=word2idx[words[target_index]],
                context=[word2idx[word] for word in [words[index] for index in random_index]],
                label=0
            )

def text_to_train(words, context_window=2, k=6):
    '''
    Make training data from words.
    
    For 1 positive sample, generate `k` negative samples
    '''
    pos = []
    neg = []
    context_range = range(-context_window, context_window+1)
    for current_index in range(context_window, len(words) - context_window ) :
        #Positive Samples
        for relative_index in context_range:
            if current_index + relative_index != current_index:
                pos.append(SimpleNamespace(
                    target=word2idx[words[current_index]],
                    context=word2idx[words[current_index+relative_index]],
                    label=1
                ))
        #Negative Samples
        for _ in context_range:
            
            rand = random.random()
            
            lhs_index_range = None
            rhs_index_range = None
            # select from lhs of target
            if  (current_index - context_window - 2*k) > 0:
                #This also accounts for the fact that there should be ample samples on the LHS to select from
                lhs_index_range = range(0, current_index - context_window)
                
            if (current_index + context_window + 2*k ) < len(words):
                # If random value is >= 0.5 or there are not enough samples on the LHS
                rhs_index_range = range(current_index + context_window, len(words))
            
            if lhs_index_range and rhs_index_range:
                index_range = random.choice([lhs_index_range, rhs_index_range])
            elif lhs_index_range:
                index_range = lhs_index_range
            else:
                index_range = rhs_index_range

            neg.append(
                    generate_negative_samples(
                        current_index,
                        index_range=index_range,
                        k=k
                    )
                )
    return pos, neg

pos_data, neg_data = text_to_train(words)

print(pos_data[0])
print(neg_data[0])
# def train(data: str):
#     """
#     return: w2v_dict: dict
#             - key: string (word)
#             - value: np.array (embedding)
#     """

#     return {}

namespace(target=160, context=3, label=1)
namespace(target=160, context=[125, 34, 147, 76, 75, 146], label=0)
