### Assigment 4

**Submission deadlines**:

* get at least 4 points by Tuesday, 12.05.2022
* remaining points: last lab session before or on Tuesday, 19.05.2022

**Points:** Aim to get 12 out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1HaMbhzaBxxNa_z_QJXSDCbv5VddmhVVZ?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Implement simplified word2vec with negative sampling from scratch (using pure numpy). Assume that in the training data objects and contexts are given explicitly, one pair per line, and objects are on the left. The result of the training should be object vectors. Please, write them to a file using *natural* text format, ie

<pre>
word1 x1_1 x1_2 ... x1_N 
word2 x2_1 x2_2 ... x2_N
...
wordK xK_1 xK_2 ... xk_N
</pre>

Use the loss from Slide 3 in Lecture NLP.2, compute the gradient manually. You can use some gradient clipping, or regularisation. 

**Remark**: the data is specially prepared to make the learning process easier. 
Present vectors using the code below. In this task we define success as 'obtaining a result which looks definitely not random'



In [96]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from tqdm.auto import tqdm

In [97]:
tqdm.pandas()

In [98]:
data_pdf = pd.read_csv('task1_objects_contexts_polish.txt', sep=' ', names=['object', 'context'])

In [99]:
object_encoder = LabelEncoder()
data_pdf['object'] = object_encoder.fit_transform(data_pdf['object'])

context_encoder = LabelEncoder()
data_pdf['context'] = context_encoder.fit_transform(data_pdf['context'])

In [100]:
class Sampler:
    def __init__(self, data_pdf):
        self.positive_samples = data_pdf.groupby('context').apply(lambda pdf: set(pdf['object'].unique()))
        self.object_distribution = np.power(data_pdf['object'].value_counts(), 3/4)
        self.all_samples = set(data_pdf['object'].unique())

    def sample(self, context, k):
        negative_samples = list(self.all_samples - self.positive_samples[context])
        negative_distribution = self.object_distribution[negative_samples]
        negative_distribution /= np.sum(negative_distribution)
        return np.random.choice(negative_distribution.index, size=k, replace=False, p=negative_distribution.values)

In [101]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [102]:
def object_grad(object_embedding, context_embedding):
    return -(1-sigmoid(np.dot(object_embedding, context_embedding)))*context_embedding

def context_grad(object_embedding, context_embedding, negative_embeddings):
    return (-(1-sigmoid(object_embedding@context_embedding))*object_embedding
            + np.sum((1-sigmoid(-context_embedding.reshape(1, -1) @ negative_embeddings.T))*negative_embeddings.T, axis=1))

In [103]:
def loss_fun(object_embedding, context_embedding, negative_embeddings):
    return (-np.log(sigmoid(object_embedding @ context_embedding))
            - np.sum(sigmoid(-context_embedding.reshape(1, -1) @ negative_embeddings.T)))

In [104]:
class Word2Vec:
    def __init__(self, lr: float, k: int, embeddings_size: int):
        self.lr = lr
        self.k = k
        self.object_embeddings = np.random.rand(len(data_pdf['object'].unique()), embeddings_size)
        self.context_embeddings = np.random.rand(len(data_pdf['context'].unique()), embeddings_size)

    def step(self, object, context, samples):
        negative_objects = np.array(samples)
        self.object_embeddings[object] -= self.lr*object_grad(self.object_embeddings[object], self.context_embeddings[context])
        self.context_embeddings[context] -= self.lr*context_grad(self.object_embeddings[object], self.context_embeddings[context], self.object_embeddings[negative_objects])
        return loss_fun(self.object_embeddings[object], self.context_embeddings[context], self.object_embeddings[negative_objects])

    def run_epoch(self, data, log_freq, distribution):
        negative_samples = np.random.choice(distribution.index, p=distribution.values, size=(len(data), self.k))
        iter = 0
        loss = 0
        for object, context in tqdm(data):
            loss += self.step(object, context, negative_samples[iter])
            iter += 1
            if log_freq is not None and iter % log_freq == 0:
                print('loss: ', loss/iter)

    def fit(self, data_pdf, num_epochs, log_freq = 1000000):
        distribution = np.power(data_pdf['object'].value_counts(), 3/4)
        distribution /= sum(distribution)
        for epoch in range(num_epochs):
            print(epoch)
            self.run_epoch(data_pdf.values, log_freq, distribution)

In [105]:
word2vec = Word2Vec(0.01, 5, 100)
word2vec.fit(data_pdf, 20)

0


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -0.976483554272937
loss:  -1.3400992543096206
loss:  -1.5321450409319193
loss:  -1.658324154624081
loss:  -1.7505253612235852
1


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.24171407203101
loss:  -2.270901648276874
loss:  -2.2904629613822967
loss:  -2.3048811849044286
loss:  -2.315929483039553
2


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3927689334846476
loss:  -2.4005408104554
loss:  -2.403134796950425
loss:  -2.4043327410947035
loss:  -2.404318121539505
3


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.4091351594575796
loss:  -2.4091184871845384
loss:  -2.40660682144957
loss:  -2.403637415284451
loss:  -2.4004625914182163
4


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.38529273200978
loss:  -2.384298887248675
loss:  -2.3816417914161003
loss:  -2.379505673691447
loss:  -2.376775936987627
5


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.365745142543159
loss:  -2.365264209055516
loss:  -2.362773496095009
loss:  -2.3605348128264025
loss:  -2.3581710022408773
6


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.351082214021668
loss:  -2.350176746681713
loss:  -2.3482948966123516
loss:  -2.347176325622098
loss:  -2.3455854880115394
7


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.341554906687945
loss:  -2.341195241428283
loss:  -2.3401067604251096
loss:  -2.339340450009273
loss:  -2.3385643535662677
8


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3378953289502684
loss:  -2.337793450528365
loss:  -2.337329117158638
loss:  -2.3373217386999716
loss:  -2.3368739804082472
9


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3382983613315256
loss:  -2.3388867824362958
loss:  -2.3384812736516216
loss:  -2.3383699275041407
loss:  -2.338142029682852
10


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.342033898324023
loss:  -2.3426512288061803
loss:  -2.342438706678854
loss:  -2.342877583124321
loss:  -2.3430062163415735
11


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3476531092380117
loss:  -2.3484678671646053
loss:  -2.3486585503966606
loss:  -2.349268881553964
loss:  -2.3496627214495716
12


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3566118121149886
loss:  -2.3572747522531126
loss:  -2.3576562897153446
loss:  -2.3580801144250216
loss:  -2.358673227115826
13


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.365580979058521
loss:  -2.3662356200597117
loss:  -2.367095737430774
loss:  -2.3678132062664488
loss:  -2.36843804281139
14


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.376241827862562
loss:  -2.377153089262103
loss:  -2.377807766103135
loss:  -2.3786442734349085
loss:  -2.379277821616192
15


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3863342654820494
loss:  -2.387491426237623
loss:  -2.388458916139939
loss:  -2.3894602838052874
loss:  -2.390311416760087
16


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.3975970855434103
loss:  -2.398661025792496
loss:  -2.3995538291099727
loss:  -2.400741819020991
loss:  -2.4016317781935426
17


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.410046844944847
loss:  -2.4109064001713154
loss:  -2.4117589725494937
loss:  -2.412471475428429
loss:  -2.413321914258733
18


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.4214263697881115
loss:  -2.421966135179359
loss:  -2.4233579476737837
loss:  -2.424311271507467
loss:  -2.425219977958236
19


  0%|          | 0/5525116 [00:00<?, ?it/s]

loss:  -2.432611052232172
loss:  -2.433632762885281
loss:  -2.4350643908244107
loss:  -2.4361109515654737
loss:  -2.437166534018982


In [106]:
embeddings = word2vec.object_embeddings
embeddings_pdf = pd.DataFrame(embeddings)
embeddings_pdf.index = object_encoder.inverse_transform(embeddings_pdf.index)

In [107]:
with open('task1_w2v_vectors.txt', 'w') as out_file:
    out_file.write(f'{len(embeddings_pdf)} {len(embeddings_pdf.columns)}\n')
    embeddings_pdf.to_csv(out_file, header=False, index=True, sep=' ')

In [108]:
from gensim.models import KeyedVectors
task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w in example_words:
    print ('WORD:', w)
    for w, v in task1_wv.most_similar(w):
        print ('   ', w, v)
    print ()

WORD: pies
    kot 0.8010485768318176
    koń 0.7975372076034546
    zwierzę 0.7651735544204712
    chłopiec 0.7597785592079163
    dziewczyna 0.7183960676193237
    mężczyzna 0.7060914635658264
    chłopak 0.700786828994751
    dziewczynka 0.6936959028244019
    ptak 0.6886632442474365
    kobieta 0.6841621398925781

WORD: smok
    niedźwiedź 0.6620179414749146
    bocian 0.6596109867095947
    bestia 0.6152178049087524
    dzieciak 0.6127355098724365
    ptak 0.6104136109352112
    potwór 0.6066800355911255
    tygrys 0.6056229472160339
    słoń 0.6001629829406738
    wilk 0.5967732667922974
    wąż 0.5922409296035767

WORD: miłość
    wiara 0.7513876557350159
    duch 0.6542425751686096
    uczucie 0.6329718232154846
    śmierć 0.627207338809967
    natura 0.6237786412239075
    młodość 0.6211560964584351
    wolność 0.6175273656845093
    radość 0.6165459752082825
    miłosierdzie 0.6136339902877808
    wyobraźnia 0.6124131083488464

WORD: rower
    auto 0.6595624089241028
    wóze

## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [7]:
import pandas as pd
import numpy as np

from itertools import combinations
from collections import defaultdict
from gensim.models import Word2Vec

In [2]:
wiki_links_pdf = pd.read_csv('task2_simple.wiki.links.txt', sep=' ', names=['object', 'context'])
# The cell for your presentation

In [3]:
wiki_titles = set(np.hstack((wiki_links_pdf['object'], wiki_links_pdf['context'])))

In [4]:
word_counts = pd.Series(list(wiki_titles)).apply(lambda title: str(title).split('_')).explode().value_counts()

In [5]:
rare_words = word_counts[(word_counts > 1) & (word_counts <= 30)].index

In [11]:
articles = defaultdict(list)
wiki_titles = list(wiki_titles)
rare_words = set(rare_words)

for it, title in tqdm(enumerate(wiki_titles)):
    for word in list(set(str(title).split('_')) & rare_words):
        articles[word].append(it)

0it [00:00, ?it/s]

In [13]:
with open('task2_related_wiki_links.txt', 'w') as out_file:
    for vals in tqdm(articles.values()):
        for pair in combinations(vals, 2):
            out_file.write(f'{wiki_titles[pair[0]]} {wiki_titles[pair[1]]}\n')

  0%|          | 0/204926 [00:00<?, ?it/s]

In [9]:
example_titles = [
    'capital_city',
    'flower',
    'mickey_mouse',
    'finance',
    'japan_national_under-23_football_team',
    'john_the_apostle',
    'boss_(gaming)',
    'the_wealth_of_nations',
    'sulfur_dioxide',
    'tandem_bicycle',
]

In [27]:
model = Word2Vec(corpus_file='task2_simple.wiki.links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [28]:
task2_wv = model.wv
task2_wv.save('task2_simple_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print ('WORD:', w)
    for w, v in task2_wv.most_similar(w):
        print ('   ', w, v)
    print ()

WORD: capital_city
    population_density 0.739251434803009
    province 0.7069327235221863
    plain 0.701899528503418
    oceanic_climate 0.6916329264640808
    köppen_climate_classification 0.6897584199905396
    list_of_capital_cities_by_altitude 0.6814950704574585
    category:departments_of_paraguay 0.6782993674278259
    above_sea_level 0.6735855340957642
    category:departments_of_the_republic_of_the_congo 0.6711230874061584
    hill 0.669350266456604

WORD: flower
    asteraceae 0.8791819214820862
    leaf 0.8778598308563232
    berry 0.8747152090072632
    vine 0.8712254166603088
    seed 0.8700944781303406
    perennial 0.8616876602172852
    tail 0.8591261506080627
    fruit 0.856979489326477
    sweet_(taste) 0.8559210300445557
    lime 0.8551871180534363

WORD: mickey_mouse
    minnie_mouse 0.9010185599327087
    goofy 0.8845751285552979
    bugs_bunny 0.8688761591911316
    donald_duck 0.8686318397521973
    daffy_duck 0.857617974281311
    pluto_(disney) 0.854737699031

In [24]:
model = Word2Vec(corpus_file='task2_related_wiki_links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [26]:
task2_wv = model.wv
task2_wv.save('task2_related_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print ('WORD:', w)
    try:
        for w, v in task2_wv.most_similar(w):
            print ('   ', w, v)
    except:
        print(f'{w} not found in the training set')
    print ()

WORD: capital_city
capital_city not found in the training set

WORD: flower
flower not found in the training set

WORD: mickey_mouse
mickey_mouse not found in the training set

WORD: finance
finance not found in the training set

WORD: japan_national_under-23_football_team
japan_national_under-23_football_team not found in the training set

WORD: john_the_apostle
john_the_apostle not found in the training set

WORD: boss_(gaming)
    first-person_(gaming) 0.974440336227417
    item_(gaming) 0.97352534532547
    match_(gaming) 0.9721232056617737
    ken_williams_(gaming) 0.971487283706665
    deathmatch_(gaming) 0.9703531861305237
    spam_(gaming) 0.966529130935669
    kana_kitahara 0.9655419588088989
    stalling_(gaming) 0.9654951691627502
    first_person_(gaming) 0.9649636149406433
    toa_alta,_puerto_rico 0.9643204212188721

WORD: the_wealth_of_nations
    wealth_management 0.9905750751495361
    morgan_stanley_wealth_management 0.9884107112884521
    list_of_countries_by_distrib

In [25]:
import shutil
with open('task2_combined_wiki_links.txt', 'wb') as out_file:
    for file in ['task2_related_wiki_links.txt', 'task2_simple.wiki.links.txt' ]:
        with open(file, 'rb') as in_file:
            shutil.copyfileobj(in_file, out_file)

In [15]:
model = Word2Vec(corpus_file='task2_combined_wiki_links.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [16]:
task2_wv = model.wv
task2_wv.save('task2_combined_wiki_links_w2v_vectors.txt')
example_words = example_titles

for w in example_words:
    print('WORD:', w)
    try:
        for w, v in task2_wv.most_similar(w):
            print('   ', w, v)
    except:
        print(f'{w} not found in the training set')
    print()


WORD: capital_city
    list_of_national_capitals 0.7547511458396912
    population_density 0.7497968077659607
    list_of_capital_cities_by_altitude 0.745806872844696
    province 0.727108359336853
    köppen_climate_classification 0.7214742302894592
    communes_of_burundi 0.7195709943771362
    port 0.712222158908844
    hill 0.7095301151275635
    category:regions_of_burkina_faso 0.7092888355255127
    coast 0.7067225575447083

WORD: flower
    berry 0.9138245582580566
    evergreen 0.91241455078125
    asteraceae 0.9113523364067078
    leaf 0.9072189927101135
    flowers 0.8990909457206726
    sweet_(taste) 0.8957048654556274
    buckwheat 0.8947560787200928
    fruit 0.8932700157165527
    orchidaceae 0.8918755054473877
    shrub 0.8892049193382263

WORD: mickey_mouse
    donald_duck 0.9323548078536987
    pluto_(disney) 0.9035681486129761
    american_pekin_duck 0.8992806673049927
    money_bin 0.8984423279762268
    goofy 0.8974982500076294
    daisy_duck 0.8972910642623901
    

# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


In [25]:
import pandas as pd
import numpy as np
import shutil

from sklearn.cluster import KMeans
from collections import defaultdict
from gensim.models import Word2Vec
from tqdm.auto import tqdm
from string import punctuation

In [34]:
with open('task3_polish_lower.txt', 'r') as in_file:
    lower_sentences = in_file.read()

lower_sentences = lower_sentences.split('\n')

In [30]:
with open('task3_polish_lower_clean.txt', 'w') as out_file:
    for sentence in tqdm(lower_sentences):
        clean_words = []
        for word in sentence.split(' '):
            if word.isnumeric():
                clean_words.append('<num>')
            elif word not in punctuation:
                clean_words.append(word)
        out_file.write(' '.join(clean_words) + '\n')

  0%|          | 0/500001 [00:00<?, ?it/s]

In [33]:
with open('task3_polish_upper.txt', 'r') as in_file:
    upper_sentences = in_file.read()

upper_sentences = upper_sentences.split('\n')

with open('task3_polish_upper_clean.txt', 'w') as out_file:
    for sentence in tqdm(upper_sentences):
        clean_words = []
        for word in sentence.split(' '):
            if word.isnumeric():
                clean_words.append('<num>')
            elif word not in punctuation:
                clean_words.append(word)
        out_file.write(' '.join(clean_words) + '\n')

  0%|          | 0/500001 [00:00<?, ?it/s]

In [35]:
with open('task3_polish_lower_clean.txt', 'r') as in_file:
    lower_sentences = in_file.read()

lower_sentences = lower_sentences.split('\n')

In [37]:
lower_word_counts = pd.Series(lower_sentences).apply(lambda sentence: str(sentence).split(' ')).explode().value_counts()

In [38]:
model = Word2Vec(corpus_file='task3_polish_lower_clean.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [39]:
task3_wv = model.wv

example_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

for w in example_words:
    print ('WORD:', w)
    for w, v in task3_wv.most_similar(w):
        print ('   ', w, v)
    print ()


WORD: pies
    kot_kota 0.7046511173248291
    kot 0.700019896030426
    koń 0.6674674153327942
    dzieciak 0.6413710713386536
    zwierzyć_zwierzę 0.6334779262542725
    ubranie 0.6181987524032593
    zwierzę 0.6084510684013367
    chłopiec 0.6027811169624329
    dziewczyna 0.6027806401252747
    szczeniak 0.5959733128547668

WORD: smok
    szpon 0.6890965700149536
    kontur 0.6756772398948669
    błękitny 0.6708981394767761
    kieł 0.6688380241394043
    słoń 0.668688952922821
    pięść 0.6665152311325073
    wilk 0.665794312953949
    młodzieniec 0.6567079424858093
    tan 0.6550998091697693
    płomień 0.6548458337783813

WORD: miłość
    bóg 0.7502952218055725
    dobroć 0.7225184440612793
    dusza 0.7164352536201477
    zbawienie 0.692441999912262
    życzliwość 0.6892829537391663
    przyjaźnić_przyjaźń 0.6856911182403564
    uczucie 0.6825954914093018
    miłosierdzie 0.6780387163162231
    modlitwa 0.6756280660629272
    zmartwychwstanie_zmartwychwstać 0.6751652956008911



In [40]:
lower_kmeans = KMeans(n_clusters=100)
preds = lower_kmeans.fit_predict(task3_wv.vectors)

preds_pdf = pd.DataFrame(preds, columns=['cluster'])
preds_pdf.index = task3_wv.index_to_key
preds_pdf = preds_pdf.join(lower_word_counts.rename('count'))

In [44]:
cluster_representatives_pdf = preds_pdf.groupby('cluster').apply(lambda pdf: pdf.sort_values('count', ascending=False).iloc[:10])

In [56]:
for n, (cluster, word) in enumerate(cluster_representatives_pdf.index.values):
    if n == 100:
        break
    print(cluster, word)

0 niemal
0 stosunkowo
0 nieuzyskania
0 najprawdopodobniej
0 twardy
0 d
0 przeważnie
0 rzadziej
0 ukryty
0 równoległy
1 częsty_część
1 część
1 głównie
1 zielony
1 górny
1 leśny
1 dolny
1 górski
1 położenie
1 zasiąg_zasięg
2 i
2 postać
2 jednakże
2 trzeci_trzeć
2 powstanie_powstać
2 ewentualnie
2 tzn.
2 znalezienie
2 ujęcie
2 nawiązanie
3 dotyczyć
3 wynikać
3 wymagać
3 stanowić
3 oznaczać
3 przewidywać
3 pozwolić
3 stan_stanowić_stanowy
3 spowodować
3 nastąpić
4 przyjęcie
4 uzyskanie
4 zakończenie
4 wejście
4 wykonanie
4 podjęcie
4 rozpatrzenie
4 ustalenie
4 wydanie
4 przeprowadzenie
5 <num>
5 od
5 pierwszy
5 dwa
5 jeden_jedny
5 drugi
5 kolejny
5 ostatni
5 trzy
5 kilka
6 projekt
6 pytanie
6 wniosek
6 uwaga
6 głos
6 stanowisko
6 odpowiedź
6 propozycja
6 opinia
6 sprawozdanie
7 plus
7 in
7 kalendarz
7 agent
7 e
7 podsumowanie
7 pl.
7 podopieczny
7 klucz
7 sprzedający
8 liczba
8 poziom
8 cena
8 stosunek
8 wzrost
8 stopień
8 wpływ
8 wartość
8 jakość
8 ilość
9 działanie
9 działalność
9 wybór


In [68]:
representatives = [idx[1] for idx in cluster_representatives_pdf.index]

In [69]:
with open('task3_representatives.txt', 'w') as out_file:
    out_file.write(' '.join(representatives))

In [70]:
repr_sentences = defaultdict(list)
representatives = set(representatives)

In [71]:
for sentence in lower_sentences:
    sentence_words = set(sentence.split(' '))
    for word in list(sentence_words & representatives):
        repr_sentences[word].append(sentence)


In [72]:
with open('task3_lower_to_upper.txt', 'w') as out_file:
    for representative, sentences in tqdm(repr_sentences.items(), total=len(repr_sentences)):
        for sentence in sentences:
            sentence_words = sentence.split(' ')
            for idx, word in enumerate(sentence_words):
                if word == representative:
                    sentence_words[idx] = word.upper()
            out_file.write(' '.join(sentence_words) + '\n')

  0%|          | 0/996 [00:00<?, ?it/s]

In [73]:
with open('task3_polish_upper_clean.txt', 'r') as in_file:
    upper_sentences = in_file.read()
upper_sentences = upper_sentences.split('\n')


In [74]:
representatives_upper = set([representative.upper() for representative in list(representatives)])
repr_sentences = defaultdict(list)

for sentence in upper_sentences:
    sentence_words = set(sentence.split(' '))
    for word in list(sentence_words & representatives_upper):
        repr_sentences[word].append(sentence)

In [75]:
with open('task3_upper_to_lower.txt', 'w') as out_file:
    for representative, sentences in tqdm(repr_sentences.items(), total=len(repr_sentences)):
        for sentence in sentences:
            sentence_words = sentence.split(' ')
            for idx, word in enumerate(sentence_words):
                if word == representative:
                    sentence_words[idx] = word.lower()
            out_file.write(' '.join(sentence_words) + '\n')


  0%|          | 0/995 [00:00<?, ?it/s]

In [77]:
with open('task3_corpus.txt', 'w') as out_file:
    for file in ['task3_polish_lower_clean.txt', 'task3_polish_upper_clean.txt', 'task3_lower_to_upper.txt', 'task3_upper_to_lower.txt']:
        with open(file, 'r') as in_file:
            shutil.copyfileobj(in_file, out_file)

In [78]:
model = Word2Vec(corpus_file='task3_corpus.txt', vector_size=100, min_count=1, workers=4, epochs=20)

In [79]:
model.save('task3.model')

In [81]:
upper_words = list(pd.Series(upper_sentences).apply(lambda sentence: str(sentence).split(' ')).explode().unique())

In [82]:
upper_words = np.array([word for word in upper_words if len(word) > 0])

In [83]:
representatives = list(representatives)

In [84]:
def evaluate(word: str, model: Word2Vec):
    distances = model.wv.distances(word, other_words=list(upper_words))
    try:
        position = np.argwhere(upper_words[np.argsort(distances)] == word.upper())[0][0]
    except:
        print(word)
        return 0
    return 1 / (position+1)

In [88]:
def most_similar(word: str, model: Word2Vec, n: int = 10):
    distances = model.wv.distances(word, other_words=list(upper_words))
    return upper_words[np.argsort(distances)][:n]


In [85]:
scores = []
for word in tqdm(representatives):
    scores.append(evaluate(word, model))

print(np.mean(scores))


  0%|          | 0/996 [00:00<?, ?it/s]

<num>
0.8154163699959579


In [87]:
pd.DataFrame(np.array([representatives, scores]).T)

Unnamed: 0,0,1
0,ustalenie,0.16666666666666666
1,pozostać_pozostały,0.027777777777777776
2,państwowy,0.125
3,mieć,1.0
4,do,0.00847457627118644
...,...,...
991,liczyć,1.0
992,połączony,1.0
993,stan_stanowić_stanowy,1.0
994,nadal,0.25


In [89]:
most_similar('ustalenie', model)

array(['POTĘŻNIEJSZY', '4.1', 'DOMESDAY', '0-1', 'DOCHOWAĆ', 'USTALENIE',
       '15,5', 'WIADOMY', 'OCHŁODZENIE', 'FRANCESCO'], dtype='<U50')

In [92]:
most_similar('państwowy', model)

array(['ODMAWIAJĄC', 'LESS', 'FILED', 'WYŁOŻENIE', 'UNDER', 'TŁOKOWY',
       'DZ.U.', 'PAŃSTWOWY', 'OŁAWA', 'PRZEWRACAĆ'], dtype='<U50')

In [93]:
lower_words = lower_word_counts.index

In [94]:
lower_words = np.array([word for word in lower_words if len(word) > 0])

In [95]:
scores = []
for word in tqdm(np.random.choice(lower_words, 5000, replace=False)):
    scores.append(evaluate(word, model))

print(np.mean(scores))


  0%|          | 0/5000 [00:00<?, ?it/s]

widzialem
environment
sagem
erwin
gigabyte
tkacz
before
property
maurice
carla
basidiomycota
giuseppe
powiedzial
attack
material
czegos
never
vel
tank
0.10130278736314723


# Task 4 (4 points)

In this task you are asked to do two things:
1. compare the embeddings computed on small corpus (like Brown Corpus , see: <https://en.wikipedia.org/wiki/Brown_Corpus>) with the ones coming from Google News Corpus
2. Try to use other resourses like WordNet to enrich to corpus, and obtain better embeddings

You can use the following code snippets:

```python
# printing tokenized Brown Corpora
from nltk.corpus import brown
for s in brown.sents():
    print(*s)
    
#iterating over all synsets in WordNet
from nltk.corpus import wordnet as wn

for synset_type in 'avrns': # n == noun, v == verb, ...
    for synset in list(wn.all_synsets(synset_type)))[:10]:
        print (synset.definition())
        print (synset.examples())
        print ([lem.name() for lem in synset.lemmas()])
        print (synset.hyperonims()) # nodes 1 level up in ontology
        
# loading model and compute cosine similarity between words

model = Word2Vec.load('models/w2v.wordnet5.model') 
print (model.wv.similarity('dog', 'cat'))
```

Embeddings will be tested using WordSim-353 dataset, the code showing the quality is in the cell below. Prepare the following corpora:
1. Tokenized Brown Corpora
2. Definitions and examples from Princeton WordNet
3. (1) and (2) together
4. (3) enriched with pseudosentences containing (a subset) of WordNet knowledge (such as 'tiger is a carnivore')

Train 4 Word2Vec models, and raport Spearman correletion between similarities based on your vectors, and similarities based on human judgements.



In [None]:
# Code for computing correlation between W2V similarity, and human judgements

import gensim.downloader
from scipy.stats import spearmanr

gn = gensim.downloader.load('word2vec-google-news-300')

for similarity_type in ['relatedness', 'similarity']:
    ws353 = []
    for x in open(f'wordsim_{similarity_type}_goldstandard.txt'): 
        a,b,val = x.split()
        val = float(val)
        ws353.append( (a,b,val))
    # spearmanr returns 2 vallues: correlation and pval. pval should be close to zero
    print (similarity_type + ':', spearmanr(vals, ys)) 