### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'


In [1]:
from gensim.models import KeyedVectors
import numpy as np
import re
from tqdm.auto import tqdm
from numba import njit
import torch

In [21]:
data_file = open("./data/task1_objects_contexts_polish.txt")

data_lines = data_file.readlines()
# data_lines = [re.sub("[ ].+_", " ", line) for line in data_file.readlines()]

full_data = [line.rstrip().split(" ") for line in data_lines]
full_data = np.array(full_data)

print(full_data[:10])

# data = data[:5000]
# print(data.shape)

# print(np.unique(data).shape)


[['nagromadzenie' 'G2_następstwo']
 ['temat' 'G2_skarbnica']
 ['zaspokojenie' 'G1_pragnienie']
 ['dudkiewicz' 'SUBJ_pokonać']
 ['odpis' 'AND_wyciąg']
 ['entuzjazm' 'AND_znajomość']
 ['zakład' 'G1_alpinizm']
 ['ręka' 'przeciwny']
 ['odroczenie' 'G1_realizacja']
 ['rysunek' 'AND_górnik']]


In [22]:
data = full_data
unique_words = np.unique(data)
print(len(unique_words))

184246


In [23]:
word_to_index = {word: index for index, word in enumerate(unique_words)}
index_to_word = unique_words


def word_to_one_hot_vector(word):
    result = np.zeros(len(word_to_index))
    result[word_to_index[word]] = 1

    return result

In [38]:
class NegativeSampler:
    def __init__(self, words):
        sorted_words = np.sort(words)
        self.words, counts = np.unique(sorted_words, return_counts=True)
        self.indices = np.arange(len(self.words))
        self.probabilites = counts / counts.sum()
        self.probabilites = self.probabilites ** 0.75 / (self.probabilites ** 0.75).sum()

    def get(self, size=None):
        return np.random.choice(a = self.words, size=size, p = self.probabilites)

    def get_index(self, size=None):
        return np.random.choice(a = self.indices, size=size, p = self.probabilites)

In [39]:
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

In [40]:
class Network:
    def __init__(self, vec_size, vocabulary_size):
        self.vec_size = vec_size
        self.vocabulary_size = vocabulary_size

        self.W_word = np.random.rand(vocabulary_size, vec_size) * 10 - 5
        self.W_context = np.random.rand(vocabulary_size, vec_size) * 10 - 5

        self.W_word = torch.from_numpy(self.W_word).float().to(device="cuda")
        self.W_context = torch.from_numpy(self.W_context).float().to(device="cuda")

    def forward(self, x):
        v = x @ self.W_word

        z = v @ self.W_context.T

        return v, z


$$
\begin{aligned}
\frac{\partial}{\partial v_x} \log{\sigma(v_c \cdot v_x)}
&= \frac{1}{\sigma(v_c \cdot v_x)} \cdot \frac{\partial}{\partial v_x} \sigma(v_c \cdot v_x) \\
&= \frac{1}{\sigma(v_c \cdot v_x)} \cdot \sigma(v_c \cdot v_x)(1-\sigma(v_c \cdot v_x)) \cdot \frac{\partial}{\partial v_x} (v_c \cdot v_x) \\
&= (1-\sigma(v_c \cdot v_x)) \cdot v_c
\end{aligned}
$$

In [41]:
def log_sigmoid_derivative(vc, vx):    
    return (1 - sigmoid(torch.dot(vc, vx))) * vc


In [42]:
vec_size = 50
k = 5
learning_rate = 0.003
num_epochs = 5

sampler = NegativeSampler(data[:,1])
net = Network(vec_size=vec_size, vocabulary_size=len(unique_words))


for epoch in range(num_epochs):
    for i in tqdm(range(len(data))):
        word, context = data[i]
        word_index = word_to_index[word]
        context_index = word_to_index[word]

        u_c = net.W_context[context_index]
        v_w = net.W_word[word_index]

        random_contexts_index = sampler.get_index(k)
        
        u_rs = net.W_context[random_contexts_index]

        W_word_grad = torch.zeros_like(net.W_word)
        W_context_grad = torch.zeros_like(net.W_context)

        W_word_grad[word_index] += -log_sigmoid_derivative(u_c, v_w) - sum([log_sigmoid_derivative(-ur, v_w) for ur in u_rs])
        W_context_grad[context_index] += -log_sigmoid_derivative(v_w, u_c)

        for rci, u_r in zip(random_contexts_index, u_rs):
            W_context_grad[rci] -= log_sigmoid_derivative(-v_w, u_r)

        net.W_word -= learning_rate * W_word_grad
        net.W_context -= learning_rate * W_context_grad

  0%|          | 17699/5525116 [01:00<5:16:14, 290.25it/s]


KeyboardInterrupt: 

In [25]:
f = open('task1_w2v_vectors.txt', "w")

word_count = 0

for i, word in enumerate(unique_words):
    if re.search("[_]", word) == None:
        word_count += 1


f.write(str(word_count) + " " + str(vec_size) + "\n")

for i, word in enumerate(unique_words):
    if re.search("[_]", word) == None:
        # print(i, word, net.W_word[i])
        # print(np.array2string(net.W_word[i] , precision=5, separator=" ")[1:-1])
        line = word + " " + " ".join([str(n)[:7] for n in net.W_word[i]]) + "\n"
        f.write(line)

f.close()

In [26]:

task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w0 in example_words:
    print ('WORD:', w0)
    for w, v in task1_wv.most_similar(w0):
        print ('   ', w, v)
    print ()

WORD: pies
    niepodważalny 0.6319450736045837
    underground 0.545934796333313
    półfinalistka 0.5200155377388
    madrygał 0.5080426931381226
    popularność 0.4967035949230194
    jarzmo 0.4937137961387634
    pięciokrotny 0.4878771901130676
    nadtlenek 0.4850349724292755
    rozsypywanie 0.46927425265312195
    genetyzm 0.46138328313827515

WORD: smok
    rod 0.607578694820404
    znieważanie 0.5199453830718994
    szofer 0.4913886785507202
    ślub 0.47276371717453003
    gildia 0.4708929657936096
    dociskanie 0.4677221179008484
    zły 0.46679234504699707
    barak 0.45267611742019653
    teoretyk 0.44642752408981323
    biodro 0.43907320499420166

WORD: miłość
    stłumienie 0.5130264163017273
    rock 0.5128613114356995
    beret 0.48549214005470276
    lechia 0.4771725535392761
    przeszkolenie 0.47533541917800903
    zdyskontowanie 0.46027833223342896
    gekon 0.45985713601112366
    tresura 0.4564157724380493
    rozkład 0.45604395866394043
    rowerowy 0.448988974

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [2]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, word2vec
# from gensim.utils import simple_preprocess
import gensim.utils as utils
import re
from collections import Counter
from itertools import permutations


In [5]:
# The cell for your presentation
wiki_links = word2vec.LineSentence("./data/simple.wiki.links.txt")
# model = Word2Vec(sentences=wiki_links, vector_size=100, window=2, min_count=1, workers=16)
# model.save("wiki_v1.model")

In [5]:

model = Word2Vec.load("wiki_v1.model")

example_words = ["adolf_hitler", "white", "black", "dog", "67th_academy_awards",]

for w0 in example_words:
    print ('WORD:', w0)
    for w, v in model.wv.most_similar(w0):
        print ('   ', w, v)
    print ()

WORD: adolf_hitler
    joseph_stalin 0.9806583523750305
    pope_john_paul_ii 0.9689143896102905
    benito_mussolini 0.9687402248382568
    napoleon 0.9680567383766174
    josip_broz_tito 0.9678819179534912
    berlin_wall 0.9678562879562378
    nazi_party 0.9673212766647339
    fascism 0.9652743339538574
    paul_von_hindenburg 0.964741051197052
    armistice 0.9642995595932007

WORD: white
    yellow 0.9943046569824219
    ear 0.9888944029808044
    fur 0.9888084530830383
    plastic 0.9881877899169922
    brown 0.9872660040855408
    parasite 0.9870258569717407
    rabbit 0.9870242476463318
    ox 0.9865660667419434
    cow 0.9858281016349792
    ceramic 0.9856430888175964

WORD: black
    face 0.9959166049957275
    gorilla 0.9953488707542419
    chimpanzee 0.9950746893882751
    cow 0.9947810769081116
    symptom 0.994511067867279
    force 0.9944021105766296
    ground 0.994140088558197
    system 0.9939526915550232
    brass 0.9938763380050659
    colour 0.9938370585441589

WOR

In [6]:

wiki_links_file = open("./data/simple.wiki.links.txt", "r")

splited_links = re.split(r"[_ \n(),:]", wiki_links_file.read())

splited_links = list(filter(None, splited_links))

word_count = Counter(splited_links)

rare_words = [word for (word, count) in word_count.items() if count <= 3 and count >= 2]


In [7]:
wiki_titles = set([title for link in wiki_links for title in link])

In [8]:
titles_with_word = {}

for title in wiki_titles:
    words_in_title = re.split(r"[_ \n(),:]", title)
    for word in words_in_title:
        if word not in titles_with_word:
            titles_with_word[word] = [title]
        else:
            titles_with_word[word].append(title)

In [11]:
titles_with_rare_word = { word: titles for word, titles in titles_with_word.items() if len(titles) <= 3 and len(titles) >= 2 }

In [15]:
print(list(titles_with_rare_word.values())[:20])

[['wikt:anhydrous', 'anhydrous'], ['shah_qotb_ol_din', 'shah_qotb_ol_din_heydar'], ['feline_zoonosis', 'zoonosis'], ['file:exit_1,_shuanglian_station_20201017.jpeg', 'file:exit_2,_shuanglian_station_20210701.jpeg', 'shuanglian_metro_station'], ['thornton_cleveleys', 'cleveleys', 'blackpool_north_and_cleveleys_(uk_parliament_constituency)'], ["lauri_ingman's_first_cabinet", "lauri_ingman's_second_cabinet"], [':category:discoveries_by_wilhelm_dieckvoß', ':de:wilhelm_dieckvoß', 'wilhelm_dieckvoß'], [':wikt:equip', 'wikt:equip', ':ca:plantilla:equip_irc'], ['lobsang_tenzin', 'lobsang_sangay'], ['gpu', 'mali_(gpu)'], ['felician_of_foligno', 'primus_and_felician', 'felician_college'], ['erinn_bartlett', 'erinn_hayes'], ['grat_coalition', 'grat'], ['user:peterdownunder/cyberbullying', 'user_talk:peterdownunder/cyberbullying'], ['veyrières,_cantal', 'veyrières,_corrèze'], ['special:contributions/2804:18:1082:690b:140b:77ca:e9c0:905b', 'user_talk:2804:18:1082:690b:140b:77ca:e9c0:905b'], ['speci

In [24]:

related_titles = [pair for rel_titles in titles_with_rare_word.values() for pair in permutations(rel_titles) ]

In [29]:
wiki_links_and_related = list(wiki_links) + related_titles

In [31]:
# The cell for your presentation
model_related = Word2Vec(sentences=wiki_links_and_related, vector_size=100, window=2, min_count=1, workers=16)
model_related.save("wiki_v2.model")

In [39]:
# for i in range(0, 4):
#     print(f"{names[i] : <10}{marks[i] : ^10}{div[i] : ^10}{id[i] : >5}")

for w0 in example_words:
    print ('WORD:', w0)
    for (w1, v1), (w2, v2) in zip(model.wv.most_similar(w0), model_related.wv.most_similar(w0)):
        print(f"   {w1 : <30}{v1 : <30}{w2 : <30}{v2 : <5}")
    print ()

WORD: adolf_hitler
   joseph_stalin                 0.9806583523750305            joseph_stalin                 0.9785109162330627
   pope_john_paul_ii             0.9689143896102905            pope_john_paul_ii             0.9718990921974182
   benito_mussolini              0.9687402248382568            benito_mussolini              0.9714155793190002
   napoleon                      0.9680567383766174            european_commission           0.9713098406791687
   josip_broz_tito               0.9678819179534912            irish_republican_army         0.9681921005249023
   berlin_wall                   0.9678562879562378            new_year                      0.9675443172454834
   nazi_party                    0.9673212766647339            weimar_republic               0.9669830799102783
   fascism                       0.9652743339538574            louis_xiv_of_france           0.9668290615081787
   paul_von_hindenburg           0.964741051197052             ho_chi_minh           

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


In [3]:
def file_to_corpus(path):
    f = open(path, "r")
    return [list(utils.simple_tokenize(line)) for line in f.readlines()]

def file_to_word_count(path):
    f = open(path, "r")
    tokens = list(utils.simple_tokenize(f.read()))
    return Counter(tokens)

In [4]:
lower_corpus = file_to_corpus("./data/task3_polish_lower.txt")
upper_corpus = file_to_corpus("./data/task3_polish_upper.txt")
lower_words_count = file_to_word_count("./data/task3_polish_lower.txt")
upper_words_count = file_to_word_count("./data/task3_polish_upper.txt")

In [5]:
lower_words_sorted = sorted(lower_words_count.items(), key=lambda x: x[1], reverse=True)
upper_words_sorted = sorted(upper_words_count.items(), key=lambda x: x[1], reverse=True)

In [49]:
all_words = set(list(lower_words_count.keys()) + list(map(lambda w: w.lower(), upper_words_count.keys())))
all_words_count = {word: lower_words_count.get(word, 0) + upper_words_count.get(word.upper(), 0) for word in all_words}
translated_words = sorted(all_words_count.keys(), key=lambda w: all_words_count[w], reverse=True)[:1000]
non_translated_words = sorted(all_words_count.keys(), key=lambda w: all_words_count[w], reverse=True)[1000:]

In [7]:
translations = {w: w.upper() for w in translated_words}
translations.update({w.upper(): w for w in translated_words})

In [11]:
def translate_sentence(sentence, translations):
    return [translations[word] if word in translations.keys() else word for word in sentence]

def translate_corpus(corpus, translations):
    return [translate_sentence(sentence, translations) for sentence in corpus]

def translate_sentence_2(sentence, translations):
    if sentence == []:
        return [[]]
    
    head = sentence[0]
    tail = sentence[1:]

    translated_tail = translate_sentence_2(tail, translations)

    result = list(map(lambda suffix: [head] + suffix, translated_tail))

    if head in translations:
        result += list(map(lambda suffix: [translations[head]] + suffix, translated_tail))
        
    return result

def translate_corpus_2(corpus, translations):
    return [ts for sentence in corpus for ts in translate_sentence_2(sentence, translations)]

def translate_sentence_3(sentence, translations):
    result = []
    for i, word in enumerate(sentence):
        if word in translations:
            result.append(sentence[:i] + [translations[word]] + sentence[i+1:])
    return result

def translate_corpus_3(corpus, translations):
    return [ts for sentence in corpus for ts in translate_sentence_3(sentence, translations)]

        

test_corpus = [["A"], ["A", "B"], ["A", "B", "C"]]
test_translations = {"A": "a", "B": "b"}

# translate_corpus_2(test_corpus, test_translations)
translate_sentence_3(["A", "C", "B", "D"], test_translations)

[['a', 'C', 'B', 'D'], ['A', 'C', 'b', 'D']]

In [13]:
train_corpus = upper_corpus + lower_corpus + translate_corpus_3(upper_corpus, translations) + translate_corpus_3(lower_corpus, translations)

In [18]:
model_task_3 = Word2Vec(sentences=train_corpus, vector_size=100, window=5, min_count=1, workers=16)

In [19]:
model_task_3.save("task_3.model")

In [52]:
def word_score(model, word):
    translate = lambda x: x.lower()
    case_pred = lambda x: x[0].islower()

    if word.islower():
        translate = lambda x: x.upper()
        case_pred = lambda x: x[0].isupper()
    
    try:
        rank = model.wv.rank(word, translate(word))
    except Exception:
        return 0
    most_similar = model.wv.most_similar(word, topn=rank)

    return 1/len(list(filter(case_pred, most_similar)))

In [54]:
# model_task_3 = Word2Vec.load("task_3.model")

print("Average scores:")
print(f"Words from D: {np.mean([word_score(model_task_3, word) for word in translations.keys()])}")
words_to_test = np.random.choice(non_translated_words, size=2000, replace=False)
print(f"Words outside D: {np.mean([word_score(model_task_3, word) for word in words_to_test])}")

Average scores:
Words from D: 0.99975
Words outside D: 0.3557967452567467


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [None]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 