### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'



In [2]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from tqdm.auto import tqdm

In [2]:
tqdm.pandas()

In [3]:
data_pdf = pd.read_csv('task1_objects_contexts_polish.txt', sep=' ', names=['object', 'context'])

In [4]:
object_encoder = LabelEncoder()
data_pdf['object'] = object_encoder.fit_transform(data_pdf['object'])

context_encoder = LabelEncoder()
data_pdf['context'] = context_encoder.fit_transform(data_pdf['context'])

In [5]:
class Sampler:
    def __init__(self, data_pdf):
        self.positive_samples = data_pdf.groupby('context').apply(lambda pdf: set(pdf['object'].unique()))
        self.object_distribution = np.power(data_pdf['object'].value_counts(), 3/4)
        self.all_samples = set(data_pdf['object'].unique())

    def sample(self, context, k):
        negative_samples = list(self.all_samples - self.positive_samples[context])
        negative_distribution = self.object_distribution[negative_samples]
        negative_distribution /= np.sum(negative_distribution)
        return np.random.choice(negative_distribution.index, size=k, replace=False, p=negative_distribution.values)

In [5]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [6]:
def object_grad(object_embedding, context_embedding):
    return -(1-sigmoid(np.dot(object_embedding, context_embedding)))*context_embedding

def context_grad(object_embedding, context_embedding, negative_embeddings):
    return (-(1-sigmoid(object_embedding@context_embedding))*object_embedding
            + np.sum((1-sigmoid(-context_embedding.reshape(1, -1) @ negative_embeddings.T))*negative_embeddings.T, axis=1))

In [8]:
def loss_fun(object_embedding, context_embedding, negative_embeddings):
    return (-np.log(sigmoid(object_embedding @ context_embedding))
            - np.sum(sigmoid(-context_embedding.reshape(1, -1) @ negative_embeddings.T)))

In [16]:
class Word2Vec:
    def __init__(self, lr: float, k: int, embeddings_size: int):
        self.lr = lr
        self.k = k
        self.object_embeddings = np.random.rand(len(data_pdf['object'].unique()), embeddings_size)
        self.context_embeddings = np.random.rand(len(data_pdf['context'].unique()), embeddings_size)

    def step(self, object, context, samples):
        negative_objects = np.array(samples)
        self.object_embeddings[object] -= self.lr*object_grad(self.object_embeddings[object], self.context_embeddings[context])
        self.context_embeddings[context] -= self.lr*context_grad(self.object_embeddings[object], self.context_embeddings[context], self.object_embeddings[negative_objects])
        return loss_fun(self.object_embeddings[object], self.context_embeddings[context], self.object_embeddings[negative_objects])

    def run_epoch(self, data, log_freq, distribution):
        negative_samples = np.random.choice(distribution.index, p=distribution.values, size=(len(data), self.k))
        iter = 0
        loss = 0
        for object, context in tqdm(data):
            loss += self.step(object, context, negative_samples[iter])
            iter += 1
            if log_freq is not None and iter % log_freq == 0:
                print('loss: ', loss/iter)

    def fit(self, data_pdf, num_epochs, log_freq = 1000000):
        distribution = np.power(data_pdf['object'].value_counts(), 3/4)
        distribution /= sum(distribution)
        for epoch in range(num_epochs):
            print(epoch)
            self.run_epoch(data_pdf.values, log_freq, distribution)

In [27]:
word2vec = Word2Vec(0.01, 5, 100)
word2vec.fit(data_pdf, 20)

0


  0%|          | 0/5525116 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [23]:
embeddings = word2vec.object_embeddings
embeddings_pdf = pd.DataFrame(embeddings)
embeddings_pdf.index = object_encoder.inverse_transform(embeddings_pdf.index)

In [25]:
with open('task1_w2v_vectors.txt', 'w') as out_file:
    out_file.write(f'{len(embeddings_pdf)} {len(embeddings_pdf.columns)}\n')
    embeddings_pdf.to_csv(out_file, header=False, index=True, sep=' ')

In [26]:
from gensim.models import KeyedVectors
task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w in example_words:
    print ('WORD:', w)
    for w, v in task1_wv.most_similar(w):
        print ('   ', w, v)
    print ()

WORD: pies
    piaśnik 0.6282005906105042
    humoreska 0.6244557499885559
    marnotrawca 0.6106693148612976
    piśmienność 0.610500156879425
    konkieta 0.6069074869155884
    przodownica 0.6068059802055359
    herling 0.605897068977356
    pleciuga 0.6058119535446167
    wyścigówka 0.6022748947143555
    pokemon 0.5997951626777649

WORD: smok
    smużka 0.7996348738670349
    rozniesienie 0.7966349720954895
    nieporównywalność 0.7896068096160889
    brzegówka 0.7860226035118103
    brzeźnica 0.7849063277244568
    kardiochirurgia 0.7835286855697632
    kleń 0.7822569608688354
    zrywność 0.7816804647445679
    kanelura 0.7801304459571838
    sandor 0.7800078988075256

WORD: miłość
    przeszczep 0.4740453362464905
    włościaństwo 0.46320226788520813
    byt 0.45898422598838806
    kabul 0.4547436833381653
    zrównoważanie 0.4523521363735199
    belka 0.4435618221759796
    wyminięcie 0.44016098976135254
    owca 0.440153568983078
    kulka 0.4397164583206177
    poderwanie 0.

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [3]:
import pandas as pd
import numpy as np

from itertools import combinations
from collections import defaultdict
from gensim.models import Word2Vec

In [3]:
wiki_links_pdf = pd.read_csv('task2_simple.wiki.links.txt', sep=' ', names=['object', 'context'])
# The cell for your presentation

In [4]:
wiki_titles = set(np.hstack((wiki_links_pdf['object'], wiki_links_pdf['context'])))

In [6]:
word_counts = pd.Series(list(wiki_titles)).apply(lambda title: str(title).split('_')).explode().value_counts()

In [10]:
rare_words = word_counts[(word_counts > 1) & (word_counts <= 30)].index

In [11]:
articles = defaultdict(list)
wiki_titles = list(wiki_titles)
rare_words = set(rare_words)

for it, title in tqdm(enumerate(wiki_titles)):
    for word in list(set(str(title).split('_')) & rare_words):
        articles[word].append(it)

0it [00:00, ?it/s]

In [13]:
with open('task2_related_wiki_links.txt', 'w') as out_file:
    for vals in tqdm(articles.values()):
        for pair in combinations(vals, 2):
            out_file.write(f'{wiki_titles[pair[0]]} {wiki_titles[pair[1]]}\n')

  0%|          | 0/204926 [00:00<?, ?it/s]

In [17]:
example_titles = [
    'capital_city',
    'flower',
    'mickey_mouse',
    'finance',
    'japan_national_under-23_football_team',
    'john_the_apostle',
    'boss_(gaming)',
    'the_wealth_of_nations',
    'sulfur_dioxide',
    'tandem_bicycle',
]

In [27]:
model = Word2Vec(corpus_file='task2_simple.wiki.links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [28]:
task2_wv = model.wv
task2_wv.save('task2_simple_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print ('WORD:', w)
    for w, v in task2_wv.most_similar(w):
        print ('   ', w, v)
    print ()

WORD: capital_city
    population_density 0.739251434803009
    province 0.7069327235221863
    plain 0.701899528503418
    oceanic_climate 0.6916329264640808
    köppen_climate_classification 0.6897584199905396
    list_of_capital_cities_by_altitude 0.6814950704574585
    category:departments_of_paraguay 0.6782993674278259
    above_sea_level 0.6735855340957642
    category:departments_of_the_republic_of_the_congo 0.6711230874061584
    hill 0.669350266456604

WORD: flower
    asteraceae 0.8791819214820862
    leaf 0.8778598308563232
    berry 0.8747152090072632
    vine 0.8712254166603088
    seed 0.8700944781303406
    perennial 0.8616876602172852
    tail 0.8591261506080627
    fruit 0.856979489326477
    sweet_(taste) 0.8559210300445557
    lime 0.8551871180534363

WORD: mickey_mouse
    minnie_mouse 0.9010185599327087
    goofy 0.8845751285552979
    bugs_bunny 0.8688761591911316
    donald_duck 0.8686318397521973
    daffy_duck 0.857617974281311
    pluto_(disney) 0.854737699031

In [24]:
model = Word2Vec(corpus_file='task2_related_wiki_links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [26]:
task2_wv = model.wv
task2_wv.save('task2_related_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print ('WORD:', w)
    try:
        for w, v in task2_wv.most_similar(w):
            print ('   ', w, v)
    except:
        print(f'{w} not found in the training set')
    print ()

WORD: capital_city
capital_city not found in the training set

WORD: flower
flower not found in the training set

WORD: mickey_mouse
mickey_mouse not found in the training set

WORD: finance
finance not found in the training set

WORD: japan_national_under-23_football_team
japan_national_under-23_football_team not found in the training set

WORD: john_the_apostle
john_the_apostle not found in the training set

WORD: boss_(gaming)
    first-person_(gaming) 0.974440336227417
    item_(gaming) 0.97352534532547
    match_(gaming) 0.9721232056617737
    ken_williams_(gaming) 0.971487283706665
    deathmatch_(gaming) 0.9703531861305237
    spam_(gaming) 0.966529130935669
    kana_kitahara 0.9655419588088989
    stalling_(gaming) 0.9654951691627502
    first_person_(gaming) 0.9649636149406433
    toa_alta,_puerto_rico 0.9643204212188721

WORD: the_wealth_of_nations
    wealth_management 0.9905750751495361
    morgan_stanley_wealth_management 0.9884107112884521
    list_of_countries_by_distrib

In [25]:
import shutil
with open('task2_combined_wiki_links.txt', 'wb') as out_file:
    for file in ['task2_related_wiki_links.txt', 'task2_simple.wiki.links.txt' ]:
        with open(file, 'rb') as in_file:
            shutil.copyfileobj(in_file, out_file)

In [29]:
model = Word2Vec(corpus_file='task2_combined_wiki_links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [41]:
task2_wv = model.wv
task2_wv.save('task2_combined_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print('WORD:', w)
    try:
        for w, v in task2_wv.most_similar(w):
            print('   ', w, v)
    except:
        print(f'{w} not found in the training set')
    print()


NameError: name 'example_titles' is not defined

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


In [43]:
import pandas as pd
import numpy as np
import shutil

from sklearn.cluster import KMeans
from collections import defaultdict
from gensim.models import Word2Vec
from tqdm.auto import tqdm

In [2]:
with open('task3_polish_lower.txt', 'r') as in_file:
    lower_sentences = in_file.read()

In [3]:
lower_sentences = lower_sentences.split('\n')

In [4]:
lower_word_counts = pd.Series(lower_sentences).apply(lambda sentence: str(sentence).split(' ')).explode().value_counts()

In [5]:
model = Word2Vec(corpus_file='task3_polish_lower.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [6]:
task3_wv = model.wv

example_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

for w in example_words:
    print ('WORD:', w)
    for w, v in task3_wv.most_similar(w):
        print ('   ', w, v)
    print ()


WORD: pies
    kot_kota 0.6890381574630737
    kot 0.68598473072052
    koń 0.6662598848342896
    zwierzyć_zwierzę 0.6409239172935486
    dziewczyna 0.6219024658203125
    ubranie 0.6122429966926575
    dzieciak 0.6119948625564575
    chłopak 0.6075925230979919
    chłopiec 0.6048930883407593
    pokarm_pokarmić 0.6029188632965088

WORD: smok
    słoń 0.6976077556610107
    demon 0.6848234534263611
    byk 0.6831806302070618
    bogini 0.6762536764144897
    kontur 0.6749581098556519
    szatan 0.6743872165679932
    płomień 0.6717569231987
    potwór 0.6615457534790039
    anioł 0.661408007144928
    posąg 0.6563248038291931

WORD: miłość
    bóg 0.753804087638855
    dobroć 0.7141699194908142
    dusza 0.706268310546875
    przyjaźnić_przyjaźń 0.6991156935691833
    wiara 0.6935740113258362
    zbawienie 0.6916703581809998
    mądrość 0.6714519262313843
    uczucie 0.6710386872291565
    zmartwychwstanie_zmartwychwstać 0.6666459441184998
    życzliwość 0.6620675325393677

WORD: rowe

In [7]:
lower_kmeans = KMeans(n_clusters=100)
preds = lower_kmeans.fit_predict(task3_wv.vectors)

In [8]:
preds_pdf = pd.DataFrame(preds, columns=['cluster'])

In [9]:
preds_pdf.index = task3_wv.index_to_key

In [10]:
preds_pdf = preds_pdf.join(lower_word_counts.rename('count'))

In [11]:
cluster_representatives_pdf = preds_pdf.groupby('cluster').apply(lambda pdf: pdf.sort_values('count', ascending=False).iloc[:10])

In [12]:
cluster_representatives_pdf

Unnamed: 0_level_0,Unnamed: 1_level_0,cluster,count
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,śmierć,0,1360
0,św.,0,1119
0,ojciec,0,1038
0,matka,0,1012
0,krzyż,0,969
...,...,...,...
99,niemcy_niemiec,99,1348
99,zjednoczony,99,1217
99,iii,99,1159
99,zachodni,99,1150


In [13]:
representatives = [idx[1] for idx in cluster_representatives_pdf.index]

In [29]:
representatives = list(filter(lambda word: word.isalnum() and not word.isnumeric(), representatives))

In [31]:
repr_sentences = defaultdict(list)
representatives = set(representatives)

In [36]:
with open('task3_representatives.txt', 'w') as out_file:
    out_file.write(' '.join(representatives))

In [32]:
for sentence in lower_sentences:
    sentence_words = set(sentence.split(' '))
    for word in list(sentence_words & representatives):
        repr_sentences[word].append(sentence)


In [35]:
with open('task3_lower_to_upper.txt', 'w') as out_file:
    for representative, sentences in tqdm(repr_sentences.items(), total=len(repr_sentences)):
        for sentence in sentences:
            sentence_words = sentence.split(' ')
            for idx, word in enumerate(sentence_words):
                if word == representative:
                    sentence_words[idx] = word.upper()
            out_file.write(' '.join(sentence_words) + '\n')


  0%|          | 0/836 [00:00<?, ?it/s]

In [37]:
with open('task3_polish_upper.txt', 'r') as in_file:
    upper_sentences = in_file.read()
upper_sentences = upper_sentences.split('\n')


In [39]:
representatives_upper = set([representative.upper() for representative in list(representatives)])
repr_sentences = defaultdict(list)

for sentence in upper_sentences:
    sentence_words = set(sentence.split(' '))
    for word in list(sentence_words & representatives_upper):
        repr_sentences[word].append(sentence)

In [40]:
with open('task3_upper_to_lower.txt', 'w') as out_file:
    for representative, sentences in tqdm(repr_sentences.items(), total=len(repr_sentences)):
        for sentence in sentences:
            sentence_words = sentence.split(' ')
            for idx, word in enumerate(sentence_words):
                if word == representative:
                    sentence_words[idx] = word.lower()
            out_file.write(' '.join(sentence_words) + '\n')


  0%|          | 0/836 [00:00<?, ?it/s]

In [44]:
with open('task3_corpus.txt', 'w') as out_file:
    for file in ['task3_polish_lower.txt', 'task3_polish_upper.txt', 'task3_lower_to_upper.txt', 'task3_upper_to_lower.txt']:
        with open(file, 'r') as in_file:
            shutil.copyfileobj(in_file, out_file)

In [93]:
model = Word2Vec(corpus_file='task3_corpus.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [94]:
model.save('task3.model')

In [53]:
upper_words = list(pd.Series(upper_sentences).apply(lambda sentence: str(sentence).split(' ')).explode().unique())

In [67]:
upper_words = np.array([word for word in upper_words if len(word) > 0])

In [63]:
representatives = list(representatives)

In [91]:
def evaluate(word: str, model: Word2Vec):
    distances = model.wv.distances(word, other_words=list(upper_words))
    try:
        position = np.argwhere(upper_words[np.argsort(distances)] == word.upper())[0][0]
    except:
        print(word)
        return 0
    return 1 / (position+1)

In [95]:
scores = []
for word in tqdm(representatives):
    scores.append(evaluate(word, model))

print(np.mean(scores))


  0%|          | 0/836 [00:00<?, ?it/s]

0.8958234206104013


In [82]:
lower_words = lower_word_counts.index

In [83]:
lower_words = np.array([word for word in lower_words if len(word) > 0])

In [96]:
scores = []
for word in tqdm(np.random.choice(lower_words, 5000, replace=False)):
    scores.append(evaluate(word, model))

print(np.mean(scores))


  0%|          | 0/5000 [00:00<?, ?it/s]

without
leonida
andriej
satellite
geschichte
puerta
plamisty
sagem
notecią
okolo
attack
a5
vel
tkacz
come
being
maurice
survey
does
powiedzial
risk
change
triple
cid
since
carla
duza
highway
antoni_antonim
erwin
0.1675091176190198


In [None]:

scores = []
for word in D:
    if word.islower():
        scores.append(evaluate(word, model_corpus))
print(np.array(scores).mean())


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [None]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 