### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'


In [7]:
from gensim.models import KeyedVectors
import numpy as np
import re
from tqdm.auto import tqdm
from numba import njit

In [20]:
data_file = open("./data/task1_objects_contexts_polish.txt")

data_lines = data_file.readlines()
# data_lines = [re.sub("[ ].+_", " ", line) for line in data_file.readlines()]

full_data = [line.rstrip().split(" ") for line in data_lines]
full_data = np.array(full_data)

print(full_data[:10])

# data = data[:5000]
# print(data.shape)

# print(np.unique(data).shape)


[['nagromadzenie' 'G2_następstwo']
 ['temat' 'G2_skarbnica']
 ['zaspokojenie' 'G1_pragnienie']
 ['dudkiewicz' 'SUBJ_pokonać']
 ['odpis' 'AND_wyciąg']
 ['entuzjazm' 'AND_znajomość']
 ['zakład' 'G1_alpinizm']
 ['ręka' 'przeciwny']
 ['odroczenie' 'G1_realizacja']
 ['rysunek' 'AND_górnik']]


In [21]:
data = full_data[:100000]
unique_words = np.unique(data)
print(len(unique_words))

56613


In [22]:
word_to_index = {word: index for index, word in enumerate(unique_words)}
index_to_word = unique_words


def word_to_one_hot_vector(word):
    result = np.zeros(len(word_to_index))
    result[word_to_index[word]] = 1

    return result

In [11]:
class NegativeSampler:
    def __init__(self, words):
        sorted_words = np.sort(words)
        self.words, counts = np.unique(sorted_words, return_counts=True)
        self.probabilites = counts / counts.sum()
        self.probabilites = self.probabilites ** 0.75 / (self.probabilites ** 0.75).sum()

    def get(self, size=None):
        return np.random.choice(a = self.words, size=size, p = self.probabilites)

In [12]:
@njit
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

In [13]:
class Network:
    def __init__(self, vec_size, vocabulary_size):
        self.vec_size = vec_size
        self.vocabulary_size = vocabulary_size

        self.W_word = np.random.rand(vocabulary_size, vec_size) * 10 - 5
        self.W_context = np.random.rand(vocabulary_size, vec_size) * 10 - 5

        self.W_word = torch.from_numpy(self.W_word).float().to(device="cuda")
        self.W_context = torch.from_numpy(self.W_context).float().to(device="cuda")

    def forward(self, x):
        v = x @ self.W_word

        z = v @ self.W_context.T

        return v, z


$$
\begin{aligned}
\frac{\partial}{\partial v_x} \log{\sigma(v_c \cdot v_x)}
&= \frac{1}{\sigma(v_c \cdot v_x)} \cdot \frac{\partial}{\partial v_x} \sigma(v_c \cdot v_x) \\
&= \frac{1}{\sigma(v_c \cdot v_x)} \cdot \sigma(v_c \cdot v_x)(1-\sigma(v_c \cdot v_x)) \cdot \frac{\partial}{\partial v_x} (v_c \cdot v_x) \\
&= (1-\sigma(v_c \cdot v_x)) \cdot v_c
\end{aligned}
$$

In [14]:
def log_sigmoid_derivative(vc, vx):    
    return (1 - sigmoid(torch.dot(vc, vx))) * vc


In [24]:
vec_size = 50
k = 5
learning_rate = 1
num_epochs = 5

sampler = NegativeSampler(data[:,1])
net = Network(vec_size=vec_size, vocabulary_size=len(unique_words))


for epoch in range(num_epochs):
    for i in tqdm(range(len(data))):
        word, context = data[i]
        word_index = word_to_index[word]
        context_index = word_to_index[word]

        u_c = net.W_context[context_index]
        v_w = net.W_word[word_index]

        random_contexts_index = sampler.get_index(k)

        # random_contexts_index =  [word_to_index[rc] for rc in random_contexts]
        
        u_rs = net.W_context[random_contexts_index]



        # objective =  -torch.log(sigmoid(torch.dot(u_c, v_w))) - torch.log(sigmoid(torch.dot(-u_rs, v_w))).sum()

        W_word_grad = torch.zeros_like(net.W_word)
        W_context_grad = torch.zeros_like(net.W_context)

        W_word_grad[word_index] += -log_sigmoid_derivative(u_c, v_w) - sum([log_sigmoid_derivative(-ur, v_w) for ur in u_rs])
        W_context_grad[context_index] += -log_sigmoid_derivative(v_w, u_c)

        for rci, u_r in zip(random_contexts_index, u_rs):
            W_context_grad[rci] -= log_sigmoid_derivative(-v_w, u_r)

        net.W_word -= learning_rate * W_word_grad
        net.W_context -= learning_rate * W_context_grad

100%|██████████| 100000/100000 [29:02<00:00, 57.38it/s]
100%|██████████| 100000/100000 [28:25<00:00, 58.64it/s]
100%|██████████| 100000/100000 [29:09<00:00, 57.16it/s]
100%|██████████| 100000/100000 [29:01<00:00, 57.43it/s]
100%|██████████| 100000/100000 [27:36<00:00, 60.35it/s]


In [25]:
f = open('task1_w2v_vectors.txt', "w")

word_count = 0

for i, word in enumerate(unique_words):
    if re.search("[_]", word) == None:
        word_count += 1


f.write(str(word_count) + " " + str(vec_size) + "\n")

for i, word in enumerate(unique_words):
    if re.search("[_]", word) == None:
        # print(i, word, net.W_word[i])
        # print(np.array2string(net.W_word[i] , precision=5, separator=" ")[1:-1])
        line = word + " " + " ".join([str(n)[:7] for n in net.W_word[i]]) + "\n"
        f.write(line)

f.close()

In [26]:

task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w0 in example_words:
    print ('WORD:', w0)
    for w, v in task1_wv.most_similar(w0):
        print ('   ', w, v)
    print ()

WORD: pies
    niepodważalny 0.6319450736045837
    underground 0.545934796333313
    półfinalistka 0.5200155377388
    madrygał 0.5080426931381226
    popularność 0.4967035949230194
    jarzmo 0.4937137961387634
    pięciokrotny 0.4878771901130676
    nadtlenek 0.4850349724292755
    rozsypywanie 0.46927425265312195
    genetyzm 0.46138328313827515

WORD: smok
    rod 0.607578694820404
    znieważanie 0.5199453830718994
    szofer 0.4913886785507202
    ślub 0.47276371717453003
    gildia 0.4708929657936096
    dociskanie 0.4677221179008484
    zły 0.46679234504699707
    barak 0.45267611742019653
    teoretyk 0.44642752408981323
    biodro 0.43907320499420166

WORD: miłość
    stłumienie 0.5130264163017273
    rock 0.5128613114356995
    beret 0.48549214005470276
    lechia 0.4771725535392761
    przeszkolenie 0.47533541917800903
    zdyskontowanie 0.46027833223342896
    gekon 0.45985713601112366
    tresura 0.4564157724380493
    rozkład 0.45604395866394043
    rowerowy 0.448988974

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [59]:
# The cell for your presentation
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [None]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 