### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'


In [4]:
import pandas as pd
import numpy as np
from collections import defaultdict
from tqdm.auto import tqdm

# https://www.jasonosajima.com/ns.html

In [30]:
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

In [144]:
class Word2Vec:


    def __init__(self, data, dim, lr, k):

        self._data = data
        self._vocab = np.unique(data[['object', 'context']].values)
        self._objects = np.unique(data[['object']].values)
        self._vocab_mapping = {word: idx for idx, word in enumerate(self._vocab)}
        self._vocab_count = None
        self._vocab_prob = None
        self._positives = None
        self._calculate_sampling_prob()
        self._prepare_positives()

        self._lr = lr
        self._dim = dim
        self._k = k
        self._W = np.random.rand(len(self._vocab), dim)

    def _calculate_sampling_prob(self):

        self._vocab_count = defaultdict(int)

        for word in self._data['object']:
            self._vocab_count[word] += 1

        norm = sum([freq**(3/4) for freq in self._vocab_count.values()])
        self._vocab_prob = {word: (freq** (3 / 4) / norm) for word, freq in self._vocab_count.items()}

    def _prepare_positives(self):

        self._positives = defaultdict(set)
        for _, row in tqdm(self._data.iterrows()):
            self._positives[row.context] |= {row.object}

    # def word_to_one_hot(self, word):
    #
    #     one_hot = np.zeros(len(self._vocab_mapping))
    #     one_hot[self._vocab_mapping[word]] = 1
    #     return one_hot

    def get_embedding(self, word):

        return self._W[self._vocab_mapping[word], :]

        # return self.word_to_one_hot(word) @ self._W

    def update_embedding(self, word, gradient):
        # idx = self.word_to_one_hot(word).astype(bool)
        # self._W[idx] -= self._lr * gradient
        self._W[self._vocab_mapping[word], :] -= self._lr * gradient

    def forward(self, word):

        return sigmoid(self._W[self._vocab_mapping[word], :])

    def get_negative_sample(self, word, k):

        negatives = []
        for _ in range(k):
            negative = np.random.choice(list(self._vocab_prob.keys()),
                         p=list(self._vocab_prob.values()))

            while negative in self._positives[word]:
                negative = np.random.choice(list(self._vocab_prob.keys()),
                             p=list(self._vocab_prob.values()))
            negatives.append(negative)

        return negatives

    # def step(self, word, context):
    # 
    #     loss = -np.log(sigmoid(self.get_embedding(word) @ self.get_embedding(context)))
    #     negatives = self.get_negative_sample(word, self._k)
    #     for negative in negatives:
    #         loss -= np.log(sigmoid(-self.get_embedding(negative) @ self.get_embedding(context)))
    # 
    #     derivative_object = -self.get_embedding(context)*(1 - sigmoid(self.get_embedding(word) @ self.get_embedding(context)))
    #     derivative_context = -self.get_embedding(word)*(1 - sigmoid(self.get_embedding(word) @ self.get_embedding(context)))
    #     for negative in negatives:
    #         derivative_context += self.get_embedding(negative)*(1 - sigmoid(-self.get_embedding(negative) @ self.get_embedding(context)))
    # 
    #     self.update_embedding(word, derivative_object)
    #     self.update_embedding(context, derivative_context)
    # 
    #     return loss
    


    def step(self, word, context, negatives):

        loss = -np.log(sigmoid(self._W[self._vocab_mapping[word], :] @ self.get_embedding(context))) - np.log(sigmoid(-self._W[self._vocab_mapping[negatives], :] @ self._W[self._vocab_mapping[context], :]))

        loss -= 
        derivative_object = -self.get_embedding(context)*(1 - sigmoid(self.get_embedding(word) @ self.get_embedding(context)))
        derivative_context = -self.get_embedding(word)*(1 - sigmoid(self.get_embedding(word) @ self.get_embedding(context)))
        for negative in negatives:
            derivative_context += self.get_embedding(negative)*(1 - sigmoid(-self.get_embedding(negative) @ self.get_embedding(context)))

        self.update_embedding(word, derivative_object)
        self.update_embedding(context, derivative_context)

        return loss    

    def save_embeddings(self, path):
        with open(path, 'w') as f:
            f.write(f'{str(len(self._objects))} {str(self._dim)} \n')
            for word in self._objects:
                embedding = self.get_embedding(word)
                f.write(f'{word} {" ".join(list(map(str, list(embedding))))}  \n')

In [148]:
data = pd.read_csv('task1_objects_contexts_polish.txt', sep=' ', names=['object', 'context'])
data.context = data.context.map(lambda x: x if len(x.split("_")) == 1 else x.split("_")[1])

In [149]:
w = Word2Vec(data, 20, 0.003, 5)

0it [00:00, ?it/s]

In [None]:
for epoch in range(20):
    losses = []
    for _, (word, context) in tqdm(data.iterrows()):
        loss = w.step(word, context)
        losses.append(loss)
    print(np.mean(np.array(losses)))

0it [00:00, ?it/s]

25.005294288435874


0it [00:00, ?it/s]

24.570116761745922


0it [00:00, ?it/s]

24.185943783834592


0it [00:00, ?it/s]

23.79314292653527


0it [00:00, ?it/s]

In [None]:
w.save_embeddings('task1_w2v_vectors.txt')

In [None]:
from gensim.models import KeyedVectors
task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w0 in example_words:
    print ('WORD:', w)
    for w, v in task1_wv.most_similar(w0):
        print ('   ', w, v)
    print ()

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [59]:
# The cell for your presentation

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [None]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 