### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'


In [2]:
from collections import defaultdict

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tqdm.auto import tqdm

In [2]:
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

In [3]:
class Word2Vec:

    def __init__(self, data, dim, lr):

        self._data = data
        self._object_encoder = LabelEncoder()
        self._data['object'] = self._object_encoder.fit_transform(data['object'])
        self._context_encoder = LabelEncoder()
        self._data['context'] = self._context_encoder.fit_transform(data['context'])
        self._objects = np.unique(data[['object']].values)
        self._contexts = np.unique(data[['context']].values)
        self._vocab_count = None
        self._vocab_prob = None
        self._positives = None
        self._calculate_sampling_prob()
        self._prepare_positives()

        self._lr = lr
        self._dim = dim
        self._Wo = np.random.rand(len(self._objects), dim)
        self._Wc = np.random.rand(len(self._contexts), dim)

    def _calculate_sampling_prob(self):

        self._vocab_count = defaultdict(int)

        for word in self._data['object']:
            self._vocab_count[word] += 1

        norm = sum([freq**(3/4) for freq in self._vocab_count.values()])
        self._vocab_prob = {word: (freq** (3 / 4) / norm) for word, freq in self._vocab_count.items()}

    def _prepare_positives(self):

        self._positives = defaultdict(set)
        for _, row in tqdm(self._data.iterrows()):
            self._positives[row.context] |= {row.object}

    def update_object_embedding(self, word_idx, gradient):

        self._Wo[word_idx, :] -= self._lr * gradient

    def update_context_embedding(self, word_idx, gradient):

        self._Wc[word_idx, :] -= self._lr * gradient

    def get_negatives_sample(self, k):

        negatives = np.random.choice(list(self._vocab_prob.keys()), size=(len(self._data), k), p=list(self._vocab_prob.values()))

        return negatives

    def step(self, idx, negatives):
        word_idx, context_idx = self._data.iloc[idx]
        negatives = list(set(negatives) - set(self._positives[context_idx]))

        loss = -np.log(sigmoid(self._Wo[word_idx, :] @ self._Wc[context_idx, :])) - np.sum(np.log(sigmoid(-self._Wo[negatives, :] @ self._Wc[context_idx, :])))
        derivative_object = -self._Wc[context_idx, :]*(1 - sigmoid(self._Wo[word_idx, :] @ self._Wc[context_idx, :]))
        derivative_context = -self._Wo[word_idx, :]*(1 - sigmoid(self._Wo[word_idx, :] @ self._Wc[context_idx, :])) \
                             + np.sum(self._Wo[negatives, :].T * (1 - sigmoid(-self._Wc[context_idx, :].reshape(1, -1) @ self._Wo[negatives, :].T)), axis=1)

        self.update_object_embedding(word_idx, derivative_object)
        self.update_context_embedding(context_idx, derivative_context)

        return loss

    def save_embeddings(self, path):
        with open(path, 'w') as f:
            f.write(f'{str(len(self._objects))} {str(self._dim)} \n')
            for word in self._objects:
                embedding = self._Wo[word]
                f.write(f'{self._object_encoder.inverse_transform([word])[0]} {" ".join(list(map(str, list(embedding))))}  \n')

In [4]:
data = pd.read_csv('task1_objects_contexts_polish.txt', sep=' ', names=['object', 'context'])

In [5]:
w = Word2Vec(data, 20, 0.01)

0it [00:00, ?it/s]

In [6]:
for epoch in range(20):
    negatives = w.get_negatives_sample(5)
    print(f"EPOCH {epoch}")
    losses = []
    for i in tqdm(range(len(negatives))):
        loss = w.step(i, negatives[i])
        losses.append(loss)
    print(np.mean(np.array(losses)))

EPOCH 0


  0%|          | 0/5525116 [00:00<?, ?it/s]

6.07505048998839
EPOCH 1


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.205103797833125
EPOCH 2


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.208122511127881
EPOCH 3


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.15037952256908
EPOCH 4


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.102408981545655
EPOCH 5


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.0664201420036274
EPOCH 6


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.037111534128647
EPOCH 7


  0%|          | 0/5525116 [00:00<?, ?it/s]

4.015244475996828
EPOCH 8


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.99764462159789
EPOCH 9


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.9853046416755697
EPOCH 10


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.976390358595205
EPOCH 11


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.9686737359713704
EPOCH 12


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.964321558658121
EPOCH 13


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.962011747205926
EPOCH 14


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.960365536200353
EPOCH 15


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.9600800851822706
EPOCH 16


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.9605015391904077
EPOCH 17


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.961706487544347
EPOCH 18


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.963314780543266
EPOCH 19


  0%|          | 0/5525116 [00:00<?, ?it/s]

3.9665752325498493


In [7]:
w.save_embeddings('task1_w2v_vectors.txt')

In [11]:
from gensim.models import KeyedVectors
task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w0 in example_words:
    print ('WORD:', w0)
    for w, v in task1_wv.most_similar(w0):
        print ('   ', w, v)
    print ()

WORD: pies
    koń 0.9517124891281128
    kot 0.9100876450538635
    krowa 0.9069613814353943
    zwierzę 0.8956605195999146
    baba 0.8795665502548218
    dziewczyna 0.8694310784339905
    mężczyzna 0.8668525815010071
    ptak 0.8655155301094055
    chłopiec 0.8645903468132019
    facet 0.8518747687339783

WORD: smok
    potwór 0.8994752764701843
    niedźwiedź 0.896720826625824
    dziewczę 0.896420419216156
    kot 0.8909311890602112
    maluch 0.8871726989746094
    panienka 0.8804367780685425
    dzieciak 0.877396285533905
    mucha 0.8752036690711975
    wiedźma 0.8732087016105652
    tygrys 0.869754433631897

WORD: miłość
    wiara 0.8628854155540466
    duch 0.7667196393013
    prawda 0.766288161277771
    bóg 0.760909914970398
    wyobraźnia 0.7606624364852905
    młodość 0.7587139010429382
    radość 0.7577516436576843
    zło 0.7552950382232666
    uczucie 0.7470588088035583
    przyjaźń 0.7453605532646179

WORD: rower
    motocykl 0.9155734777450562
    wózek 0.90378886461

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [1]:
# The cell for your presentation
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from tqdm.notebook import tqdm
import itertools
import shutil
import pandas as pd

tqdm.pandas()

In [3]:
wiki = pd.read_csv('simple_wiki_links.txt', sep=' ', names=['object', 'context'])

In [4]:
titles = list(set(map(str, list(wiki.object) + list(wiki.context))))

In [5]:
titles_series = pd.Series(titles)

In [6]:
word_counts = titles_series.apply(lambda row: str(row).split('_')).explode().value_counts()

In [7]:
rare_words = set(word_counts[(word_counts <= 20) & (word_counts > 1)].index)
rare_words = set([w for w in rare_words if len(w) > 3])

In [8]:
len(rare_words)

193022

In [9]:
index = defaultdict(list)
for idx, title in enumerate(titles):
    for rare_word in list(set(title.split('_')) & rare_words):
        index[rare_word].append(idx)

In [10]:
pairs = set()
for titles_list in tqdm(index.values()):
    current_pairs = itertools.combinations(titles_list, 2)
    for pair in current_pairs:
        pairs.add(pair)

  0%|          | 0/193022 [00:00<?, ?it/s]

In [11]:
with open('task2_rare_words_corpus.txt', 'w') as f:
    for title_1, title_2 in pairs:
        f.write(f'{titles[title_1]} {titles[title_2]} \n')

In [12]:
with open('task2_merged_corpus.txt','wb') as wfd:
    for f in ['task2_rare_words_corpus.txt','simple_wiki_links.txt']:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)

In [13]:
raw_data = LineSentence('simple_wiki_links.txt')
model_1 = Word2Vec(sentences=raw_data, vector_size=50, epochs=20, window=2, min_count=1, workers=8)

In [14]:
corpus_data = LineSentence('task2_merged_corpus.txt')
model_2 = Word2Vec(sentences=corpus_data, vector_size=50, epochs=20, window=2, min_count=1, workers=8)

In [15]:
rare_data = LineSentence('task2_rare_words_corpus.txt')
model_3 = Word2Vec(sentences=corpus_data, vector_size=50, epochs=20, window=2, min_count=1, workers=8)

In [19]:
model_1.save('task2_model1')
model_2.save('task2_model2')
model_3.save('task2_model3')

In [18]:
example_titles = ["hyperinflation",
                  "kevin_spacey",
                  "animal",
                  "catholicism",
                  "nonprofit_organization",
                  "critters_(film)",
                  "radosław_wojtaszek",
                  'spotted_sandpiper',
                  'celts_druid']
models = [model_1, model_2, model_3]
for word in example_titles:
    print(f'WORD: {word}')
    for i, model in enumerate(models):
        print(f"MODEL {i}: ")
        try:
            similar = model.wv.most_similar(word, topn=5)
        except Exception:
            similar = []
        for similar_word, score in similar:
            print(similar_word)
    print()

WORD: hyperinflation
MODEL 0: 
world_war_i_reparations
mv_rachel_corrie
kristallnacht
diktat
kurt_schumacher
MODEL 1: 
allied_powers_of_world_war_i
herbert_backe
putsch
planned_economy
attac_(organization)
MODEL 2: 
reichsbürger_movement
world_war_i_reparations
may_coup
nsdap_25_points_manifesto
asylum_seeker

WORD: kevin_spacey
MODEL 0: 
holly_hunter
oliver_stone
matthew_broderick
jodie_foster
ray_liotta
MODEL 1: 
holly_hunter
jack_nicholson
oliver_stone
jodie_foster
ed_harris
MODEL 2: 
oliver_stone
holly_hunter
chicago_film_critics_association_awards_1990
jodie_foster
ray_liotta

WORD: animal
MODEL 0: 
insect
species
mammal
chordate
arthropod
MODEL 1: 
species
insect
mammal
chordate
arthropod
MODEL 2: 
species
insect
mammal
arthropod
teeth

WORD: catholicism
MODEL 0: 
protestantism
protestant
roman_catholicism
evangelicalism
christian_denomination
MODEL 1: 
lutheranism
protestantism
protestant
christian_denomination
evangelicalism
MODEL 2: 
protestantism
protestant
lutheranism
evange

In [17]:
rare_words

{'brickley',
 'olekminsky',
 'venediktov',
 'mcilrath',
 'marlène',
 'huna',
 'beekmantown,',
 'fitzhugh',
 'khana',
 'conrado',
 'verfeil,',
 'torbjörn',
 'križevci,',
 'file:h&amp;c',
 'cockroach',
 'burce',
 'crittall',
 'osmers',
 'hurford',
 'ozon,',
 "slash's",
 'category:k-pop',
 '1501',
 'verreaux',
 'vendt',
 'template:airport',
 'hickel',
 'takahito',
 'claudel',
 'taulé',
 'ojibwa',
 'slobozia',
 'garlands',
 'category:1437',
 'yari',
 'diatomic',
 'file:1950',
 'percent/doc',
 'stringfield',
 'image:eshtaol',
 'zhabotinsky',
 "l'aigle",
 'staring',
 '(horseracing)',
 'morang',
 'hallelujah,',
 'passat',
 'daladier',
 'troupe',
 '639-3',
 'file:tudor',
 'lieberknecht',
 'calais,',
 'sideline',
 'lacquerware',
 'romblon',
 'redoubt',
 '1933)',
 'consilience',
 'multiethnic',
 'juillé,',
 'buckenham',
 'downloads',
 'aketi',
 'ponto',
 'taref',
 'cuellar',
 'ustaad',
 'zizzo',
 'wp:irc',
 'template:cooper',
 'zorica',
 'lenox,',
 'template:second',
 'yenice,',
 'krishianis',
 

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


In [3]:
from sklearn.cluster import KMeans

with open('task3_polish_lower.txt', 'r') as f:
    lower = f.readlines()
with open('task3_polish_upper.txt', 'r') as f:
    upper = f.readlines()

In [4]:
def preprocess(corpus):
    processed_corpus = []
    for sentence in corpus:
        processed_sentence = " ".join([w for w in sentence.strip().split(" ") if w.isalpha()])
        processed_corpus.append(processed_sentence)
    return processed_corpus

In [5]:
lower = preprocess(lower)
upper = preprocess(upper)

In [6]:
lower_tokens = [w.split(" ") for w in lower]
upper_tokens = [w.split(" ") for w in upper]

In [7]:
model_lower = Word2Vec(sentences=lower_tokens, vector_size=50, epochs=20, window=5, min_count=1, workers=4)

In [8]:
lower_vectors = model_lower.wv.vectors
lower_clusters = KMeans(n_clusters=100, random_state=0).fit_predict(lower_vectors)
lower_df = pd.DataFrame({"words": model_lower.wv.index_to_key, "cluster": lower_clusters})
lower_counts = pd.Series(np.concatenate(lower_tokens)).value_counts()
lower_df = lower_df.merge(lower_counts.rename("counts"), left_on="words", right_index=True)

In [9]:
representatives = lower_df.groupby(['cluster']).apply(lambda group: group.sort_values('counts', ascending=False).iloc[:10]).words

In [10]:
representatives

cluster      
0        652            ziemia
         796       miejscowość
         835               las
         871      powierzchnia
         898           gatunek
                     ...      
99       787        zwolnienie
         856         dochodowy
         1017            zakup
         1024            zwrot
         1116    opodatkowanie
Name: words, Length: 994, dtype: object

In [11]:
D_lower = set(representatives)
D_upper = set(pd.Series(map(lambda x: x.upper(), representatives)))

In [12]:
corpus = []
for sentence in lower:
    corpus.append(sentence)
    sentence_set = set(sentence.split(" "))
    in_D = sentence_set & D_lower
    for to_translate in in_D:
        translated = []
        for w in sentence.split(" "):
            if w == to_translate:
                translated.append(w.upper())
            else:
                translated.append(w)
        translated = " ".join(translated)
        if translated != sentence:
            corpus.append(translated)

for sentence in upper:
    corpus.append(sentence)
    sentence_set = set(sentence.split(" "))
    in_D = sentence_set & D_upper
    for to_translate in in_D:
        translated = []
        for w in sentence.split(" "):
            if w == to_translate:
                translated.append(w.lower())
            else:
                translated.append(w)
        translated = " ".join(translated)
        if translated != sentence:
            corpus.append(translated)

In [15]:
with open('translated_corpus.txt', 'w') as f:
    f.write('\n'.join(corpus))

In [20]:
corpus_data = LineSentence('translated_corpus.txt')
model_corpus = Word2Vec(sentences=corpus_data , vector_size=50, epochs=20, window=5, min_count=1, workers=4)

In [24]:
model_corpus.save('task3_model')

In [38]:
def evaluate_lower(word, model):

    similar = model.wv.most_similar(word, topn=1000)
    i = 1
    for w, s in similar:
        if word == w.lower():
            return 1 / i
        if w.isupper():
            i += 1
    return 0

def evaluate_upper(word, model):

    similar = model.wv.most_similar(word, topn=1000)
    i = 1
    for w, s in similar:
        if word == w.upper():
            return 1 / i
        if w.islower():
            i += 1
    return 0

In [39]:
scores = []
for word in D_lower:
    if word:
        scores.append(evaluate_lower(word, model_corpus))
print(np.array(scores).mean())

0.9931466776851585


In [43]:
scores = []
for word in D_upper:
    if word:
        scores.append(evaluate_upper(word, model_corpus))
print(np.array(scores).mean())

0.9956719483568076


In [57]:
words_sample = lower_df.words
for word in words_sample:
    if word:
        scores.append(evaluate_lower(word, model_corpus))
print(np.array(scores).mean())

0.4878803285718434


In [58]:
words_sample = list(set(np.concatenate(upper_tokens)))
for word in words_sample:
    if word:
        scores.append(evaluate_upper(word, model_corpus))
print(np.array(scores).mean())

0.46002020090028484


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [10]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 



FileNotFoundError: [Errno 2] No such file or directory: 'wordsim_relatedness_goldstandard.txt'