<a href="https://colab.research.google.com/github/Dimildizio/DS_course/blob/main/Neural_networks/NLP/Embeddings/Simple_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Embeddings

The task is to rank StackOverflow questions based on their semantic representations.

* $X$ - the n number of objects
* $X^l = \{x_1, x_2, ..., x_l\}$ - train dataset

* $i \prec j$ - the order of index pairs of $X^l$ and $i$, $j$ indices.



### Task:
Construct and ranking function $a$ : $X \rightarrow R$ so that
$$i \prec j \Rightarrow a(x_i) < a(x_j)$$

## Embeddings

### Download the corpora

In [1]:
!wget https://zenodo.org/record/1199620/files/SO_vectors_200.bin

--2024-02-24 18:34:20--  https://zenodo.org/record/1199620/files/SO_vectors_200.bin
Resolving zenodo.org (zenodo.org)... 188.184.98.238, 188.185.79.172, 188.184.103.159, ...
Connecting to zenodo.org (zenodo.org)|188.184.98.238|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: /records/1199620/files/SO_vectors_200.bin [following]
--2024-02-24 18:34:21--  https://zenodo.org/records/1199620/files/SO_vectors_200.bin
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 1453905423 (1.4G) [application/octet-stream]
Saving to: ‘SO_vectors_200.bin’


2024-02-24 18:35:16 (25.5 MB/s) - ‘SO_vectors_200.bin’ saved [1453905423/1453905423]



### Imports

In [None]:
!pip install tokenizers

In [2]:
import gc
import nltk
import numpy as np
import pandas as pd
import re
import string

from copy import deepcopy
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import GPT2Tokenizer
from typing import List

In [3]:
nltk.download('stopwords')
stops = set(stopwords.words('english')).union(set(string.punctuation))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Create embeddings

In [4]:
wv_embeddings = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)

### Examples

In [5]:
word = 'dog'
if word in wv_embeddings:
    print(wv_embeddings[word].dtype, wv_embeddings[word].shape)
print(f"Num of words: {len(wv_embeddings.index_to_key)}")

float32 (200,)
Num of words: 1787145


### Question 1

Is 'cat' in top5 most similar words to 'dog'? If yes, which position?

In [6]:
print("'dog' and 'cat' similarity:")
print('\t', wv_embeddings.similarity('dog', 'cat'))
print('\t', wv_embeddings.n_similarity(['dog'], ['cat']))


'dog' and 'cat' similarity:
	 0.6852341
	 0.6852341


In [61]:
def check_occur(req: str, base: str, n: int = 10, flag = True) -> np.ndarray:
  """Checks if one word is similar to another

  :param req: word to check with
  :param base: word check if is similar to
  :param n: top N words to check
  :param flag: a flag to check which slimilarity function to use
  :return: None
  """

  # also funcs most_similar_cosmul or similar_by_word could be used
  result = wv_embeddings.most_similar(base, topn=n)
  for i, (word, perc) in enumerate(result):
    if req == word:
      print(f'{req} is {int(perc*100)}% similar to {base} at position {i}')
      return result
  print(f'{req} is not similar to {base}')
  return result

In [62]:
words = ('cat', 'cats', 'dog', 'dogs')
for requested in words:
  requested = requested.lower()
  for based in words:
    based = based.lower()
    if requested != based:
      result = check_occur(requested, based)
  print()

cat is not similar to cats
cat is not similar to dog
cat is not similar to dogs

cats is not similar to cat
cats is 76% similar to dog at position 3
cats is 90% similar to dogs at position 0

dog is 68% similar to cat at position 1
dog is 76% similar to cats at position 2
dog is 78% similar to dogs at position 3

dogs is not similar to cat
dogs is 90% similar to cats at position 0
dogs is 78% similar to dog at position 1



#### Q1Answer:

We've checked several simialrity function: `similar_by_word`, `most_similar_cosmul` and `most_similar` and got identical (different percentage) results. *(Here the words are ranked starting from 1 unlike positions in the similarity list which start from 0)*:

**"cat" is not in top5 words similar to "dog"**, however "cats" is similar to "dog" coming fourth and and "cats" is similar to "dogs" coming first. Also 'dog' is similar to 'cat' ranked second, to 'cats' ranked third. And 'dogs' similar to 'cats' ranked first.


## Vector representations

In [9]:
class MyTokenizer:
    def __init__(self):
        self.tokenizer = WordPunctTokenizer()

    def tokenize(self, text):
        return self.tokenizer.tokenize(text.lower())

In [10]:
tokenizer = MyTokenizer()

In [11]:
def question_to_vec(question, embeddings=wv_embeddings, tokenizer=tokenizer, dim=200):
    """
    Embeds a sentence into vector representations

        :param question: str
        :param embeddings: embedidngs
        :param dim: size of any vector in repr

        return: embeddings of a questing
    """
    tokens = tokenizer.tokenize(question.lower())
    vecs = []
    for token in tokens:
      if token in embeddings:
        vecs.append(embeddings[token])

    if not vecs:
      return np.zeros(dim, dtype='float32')
    avg_vector = np.mean(vecs, axis=0)
    return avg_vector

### Question 2:

What is the third component (2nd index) of vector representation of `"I love neural networks"` (rounded to 2 digit)?

In [12]:
q2 = 'I love neural networks'
q2_embeds = question_to_vec(q2)
third_component = str(round(q2_embeds[2], 2))
print(f"Third component of '{q2}' is {third_component}")

Third component of 'I love neural networks' is -1.29


#### Q2 Answer:

Third component of `'I love neural networks'` is `-1.29`

## Text similarity

### Explanation

*We assume* that cos distance for duplicates is smaleer than between randomly chosen sentences.



For each of $N$ questions, we'll **generate** $R$ **random negative examples** and **include the actual duplicates** as well. We'll rank $R + 1$ examples for each question using our model and **look at the position of the duplicate**. Ideally, we want the duplicate to be ranked first.

#### Hits@K
The first metric will be the number of correct hits for a given $K$:

$$ \text{Hits@K} = \frac{1}{N}\sum_{i=1}^N \, [rank\_q_i^{'} \le K],$$
* $\begin{equation*}
[x < 0 ] \equiv
 \begin{cases}
   1, &x < 0\\
   0, &x \geq 0
 \end{cases}
\end{equation*}$ - func
* $q_i$ - $i$-th question


#### DCG@K

Another metric will be a simplified DCG metric, which considers the order of elements in the list by multiplying the **relevance of an element** by a **weight equal to the inverse logarithm of its position**:

$$ \text{DCG@K} = \frac{1}{N} \sum_{i=1}^N\frac{1}{\log_2(1+rank\_q_i^{'})}\cdot[rank\_q_i^{'} \le K],$$




With this metric, the model is penalized for higher ranks of correct answers.

### Question 3:

Max `Hits@47 - DCG@1` ?

The duplicate is always ranked within the top 47 positions (Hits@47 = 1), which means that the model correctly identifies the duplicate as similar to the input question.

The duplicate is always ranked first (DCG@1 = 1), showing that the model ranks the duplicate at the top position, with no incorrect rankings above it.

So, to generalize:

**Hits@47** = 1: means all duplicates are in top47

**DCG@1** = 1: means all duplicates are ranked first

*If we just want maximum value for the result then we want the situation to be Hits@47=1 (all duplicates within first47), DCG@1=0 (none of the duplicates ranked first) thus having 1 as maximum*
$$1-0=1$$

So **the answer is 1** however i consider such situation theretical

#### Q3 answer:

Max Hits@47 - DCG@1 = 1

### Question 4:
Find `DCG@10`, if $rank\_q_i^{'} = 9$(round to 1 digit)

Since we dont need a list here we can as well do it in one line of code

In [13]:
dcg = lambda rank, k: 1 / np.log2(1 + rank) if rank <= k else 0
dcg_result = round(dcg(9, 10), 1)
print(f"DCG@10 if rank q'_i is 9: {dcg_result}")

DCG@10 if rank q'_i is 9: 0.3


#### Q4 answer:

DCG@10 if rank_$q'_i=9$: 0.3

### HITS\_COUNT и DCG\_SCORE

Each func has two args: $dup\_ranks$ and $k$.

$dup\_ranks$ is a list which has dubs rankings (positions in ranked list)

In [14]:
def calc_dups_ranks(candidates_ranking: List, copy_answers: List) -> List:
    """
    Calculate the ranks of duplicates in the candidates ranking.
    For each index in a length of copy_answers list:
      1. Find the index of the copy answer in the corresponding candidates ranking sublist
      2. Append the rank to the dup_ranks list

    :param candidates_ranking: (list of lists) List of ranked candidate answers for each question.
    :param copy_answers: (list) List of duplicate answers.
    :returns: List of ranks of duplicates in the candidates ranking.
    """
    dup_ranks = []
    for i in range(len(copy_answers)):
        rank = candidates_ranking[i].index(copy_answers[i]) + 1
        dup_ranks.append(rank)
    return dup_ranks

In [15]:
def hits_count(dup_ranks: List, k: int) -> float:
    """
    Compute Hits@K metric.

    :param dup_ranks: (list) List of ranks of duplicates in the ranked list for each question.
    :param k: (int) Number of top-ranked items to consider.
    :returns: float Hits@K score.
    """
    hits = sum(1 for rank in dup_ranks if rank <= k) / len(dup_ranks)
    return hits

In [16]:
def dcg_score(dup_ranks: List, k: int) -> float:
    """
    Compute DCG@K metric.


    :param dup_ranks: (list) List of ranks of duplicates in the ranked list for each question.
    :param k: (int) Number of top-ranked items to consider.
    :returns: float DCG@K score.
    """
    dcg = sum(1 / np.log2(1 + rank) for rank in dup_ranks if rank <= k) / len(dup_ranks)
    return dcg

#### Test it

In [17]:
copy_answers = ["How does the catch keyword determine the type of exception that was thrown",]

candidates_ranking = [["How Can I Make These Links Rotate in PHP",
                       "How does the catch keyword determine the type of exception that was thrown",
                       "NSLog array description not memory address",
                       "PECL_HTTP not recognised php ubuntu"],]

dup_ranks = calc_dups_ranks(candidates_ranking, copy_answers)

print('HIT:', [hits_count(dup_ranks, k) for k in range(1, 5)])
print('DCG:', [round(dcg_score(dup_ranks, k), 5) for k in range(1, 5)])

HIT: [0.0, 1.0, 1.0, 1.0]
DCG: [0.0, 0.63093, 0.63093, 0.63093]


In [18]:
# correct_answers - metrics for different k's
correct_answers = pd.DataFrame([[0, 1, 1, 1], [0, 1 / (np.log2(3)), 1 / (np.log2(3)), 1 / (np.log2(3))]],
                               index=['HITS', 'DCG'], columns=range(1,5))
correct_answers

Unnamed: 0,1,2,3,4
HITS,0,1.0,1.0,1.0
DCG,0,0.63093,0.63093,0.63093


## Data

[arxiv link](https://drive.google.com/file/d/1QqT4D0EoqJTy7v9VrNCYD-m964XZFR7_/edit)

`train.tsv` - train dataset <br> For every row `\t` separated: **< question >, < similar question >**


`validation.tsv` - test dataset. <br> For every row `\t` separated: : **< question >, < similar question >, < negative example 1 >, < negative example 2>, ...**

### Download and unzip dataset

In [19]:
file_id = '1QqT4D0EoqJTy7v9VrNCYD-m964XZFR7_'
!gdown $file_id
!unzip stackoverflow_similar_questions.zip

Downloading...
From (original): https://drive.google.com/uc?id=1QqT4D0EoqJTy7v9VrNCYD-m964XZFR7_
From (redirected): https://drive.google.com/uc?id=1QqT4D0EoqJTy7v9VrNCYD-m964XZFR7_&confirm=t&uuid=369a48d3-d869-48e2-928b-16e95ea6266b
To: /content/stackoverflow_similar_questions.zip
100% 131M/131M [00:00<00:00, 189MB/s]
Archive:  stackoverflow_similar_questions.zip
   creating: data/
  inflating: data/.DS_Store          
   creating: __MACOSX/
   creating: __MACOSX/data/
  inflating: __MACOSX/data/._.DS_Store  
  inflating: data/train.tsv          
  inflating: data/validation.tsv     


#### Func to read tsv

In [20]:
def read_corpus(filename: str) -> List:
    """Reads the file and return rows separated by \t

    :param filename: (str)
    :returns: List of rows containing question\similar question
    """
    data = []
    for line in open(filename, encoding='utf-8'):
        row = line.strip().split('\t')
        data.append(row)
    return data

#### Upload val dataset

In [21]:
validation_data = read_corpus('./data/validation.tsv')

In [22]:
print('Lines number:', len(validation_data))
print('First few rows:')
for i in range(5):
    print(f'\t{i + 1} {len(validation_data[i])}')

Lines number: 3760
First few rows:
	1 1001
	2 1001
	3 1001
	4 1001
	5 1001


## No-ML approach

Implement a ranking func based on cos distance. The func should go through the list of candidates and return a sorted list of pairs (original position, candidate). The index in resulting list is its rating. Example: `[(2,c), (0,a), (1,b)]` where `(2,c)` is the top, `2` is original position and `c` is candidate .

In [50]:
def rank_candidates(question, candidates, embeddings, tokenizer, dim=200, bpe=False) -> List:
    """
        question: string
        candidates: strnig array - cadidates [a, b, c]
        result: pairs (initial position, candidate) [(2, c), (0, a), (1, b)]
    """
    q2v_func = question_to_vec_bpe if bpe else question_to_vec
    sims = []
    q_avg_vector = q2v_func(question, embeddings=embeddings, tokenizer=tokenizer, dim=dim).reshape(1, -1)
    for candidate in candidates:
      c_avg_vector = q2v_func(candidate, embeddings=embeddings, tokenizer=tokenizer, dim=dim).reshape(1, -1)
      sim = cosine_similarity(q_avg_vector, c_avg_vector).flatten()[0]
      sims.append(sim)

    ranks = np.argsort(sims)[::-1]
    result = [(rank, candidates[rank]) for rank in ranks]
    return result

#### Let's test it given N=2

In [24]:
questions = ['converting string to list', 'Sending array via Ajax fails']

candidates = [['Convert Google results object (pure js) to Python object',
               'C# create cookie from string and send it',
               'How to use jQuery AJAX for an outside domain?'],

              ['Getting all list items of an unordered list in PHP',
               'WPF- How to update the changes in list item of a list',
               'select2 not displaying search results']]

In [25]:
for question, q_candidates in zip(questions, candidates):
        ranks = rank_candidates(question, q_candidates, wv_embeddings, tokenizer)
        print(ranks)
        print()

[(1, 'C# create cookie from string and send it'), (0, 'Convert Google results object (pure js) to Python object'), (2, 'How to use jQuery AJAX for an outside domain?')]

[(0, 'Getting all list items of an unordered list in PHP'), (2, 'select2 not displaying search results'), (1, 'WPF- How to update the changes in list item of a list')]



# results
```
results = [[(1, 'C# create cookie from string and send it'),
            (0, 'Convert Google results object (pure js) to Python object'),
            (2, 'How to use jQuery AJAX for an outside domain?')],
           [(*, 'Getting all list items of an unordered list in PHP'), #hidden
            (*, 'select2 not displaying search results'), #hidden
            (*, 'WPF- How to update the changes in list item of a list')]] #hidden

```




For `experiment 1` the correct index order is `1, 0, 2`

### Question 5:

What is the resulting right order of inital indices for `experiment 2`?
(format: `102` for `experiment 1`, no punctuation nor spaces)

#### Q5 answer:

For `experiment 2` the correct answer `102` (1, 0, 2)

#### Evaluate the quality of the approach

In [51]:
def check_dcg_hit(embs, data, dim=200, tokenizer=tokenizer, bpe=False):
  wv_ranking = []
  for i, line in enumerate(tqdm(data)):
    q, *ex = line
    ranks = rank_candidates(q, ex, embs, tokenizer, dim=dim, bpe=bpe)
    wv_ranking.append([r[0] for r in ranks].index(0) + 1)
    if i == 1000:
      break
  for k in tqdm([1, 5, 10, 100, 500, 1000]):
    print("DCG@%4d: %.3f | Hits@%4d: %.3f" % (k, dcg_score(wv_ranking, k), k, hits_count(wv_ranking, k)))

In [None]:
bpe_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [29]:
check_dcg_hit(wv_embeddings, validation_data, tokenizer=bpe_tokenizer)

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.189 | Hits@   1: 0.189
DCG@   5: 0.236 | Hits@   5: 0.280
DCG@  10: 0.249 | Hits@  10: 0.320
DCG@ 100: 0.276 | Hits@ 100: 0.453
DCG@ 500: 0.302 | Hits@ 500: 0.663
DCG@1000: 0.337 | Hits@1000: 1.000


In [54]:
check_dcg_hit(wv_embeddings, validation_data, tokenizer=MyTokenizer())

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.410 | Hits@   1: 0.410
DCG@   5: 0.499 | Hits@   5: 0.579
DCG@  10: 0.520 | Hits@  10: 0.644
DCG@ 100: 0.567 | Hits@ 100: 0.875
DCG@ 500: 0.580 | Hits@ 500: 0.973
DCG@1000: 0.583 | Hits@1000: 1.000


In [None]:
gc.collect()

162

## Embeddings on similar texts corpora

In [31]:
train_data = read_corpus('./data/train.tsv')

In [32]:
train_data[0]

['converting string to list',
 'Convert Google results object (pure js) to Python object']

Combine questions into pairs and train them using gensim Word2Vec.
Choose the window size. Elaborate on your decision.

In [46]:
def preprocessing(words, tokenizer=tokenizer):
  w_list = []
  for w in tokenizer.tokenize(words.lower()):
    if w not in stops:
      w_list.append(w)
  return w_list

In [57]:
words = [preprocessing(question, MyTokenizer()) for pair in train_data for question in pair]

In [35]:
gc.collect()

0

In [58]:
w_frequency = 5
window = 5
size = 300

In [59]:
embeddings_trained = Word2Vec(words,               # Data for model to train on
                 vector_size=size,                 # The embedding vector size
                 min_count=w_frequency,            # We consider words that occured at least 5 times
                 window=window,
                sg=1).wv

In [60]:
check_dcg_hit(embeddings_trained, validation_data, dim=size, tokenizer=MyTokenizer())

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.491 | Hits@   1: 0.491
DCG@   5: 0.581 | Hits@   5: 0.659
DCG@  10: 0.598 | Hits@  10: 0.712
DCG@ 100: 0.635 | Hits@ 100: 0.893
DCG@ 500: 0.646 | Hits@ 500: 0.976
DCG@1000: 0.649 | Hits@1000: 1.000


#### Try training BPE Tokenizer

In [None]:
def train_bpe_tokenizer(train_data):
    # Initialize a tokenizer with BPE model
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()

    # Initialize a trainer for BPE
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

    texts = [pair[0] for pair in train_data] + [pair[1] for pair in train_data]

    # Train the tokenizer
    tokenizer.train_from_iterator(texts, trainer)

    # Save the tokenizer
    tokenizer.save("custom_bpe.json")

In [41]:
train_bpe_tokenizer(train_data)

In [42]:
tokenizer = Tokenizer.from_file("custom_bpe.json")
tokenizer.tokenize = tokenizer.encode # monkey patch

In [52]:
def question_to_vec_bpe(question, embeddings, tokenizer, dim=200) -> np.ndarray:
    """
    Embeds a sentence into vector representations using a BPE tokenizer.

    :param question: str, the question to be vectorized.
    :param embeddings: dict, a dictionary of word embeddings.
    :param tokenizer: a tokenizer that has been trained or loaded, capable of encoding the input question.
    :param dim: int, the dimensionality of the word embeddings.

    :return: np.ndarray, the average embedding vector of the question.
    """
    # Encode the question using the tokenizer, obtaining tokens
    encoding = tokenizer.encode(question.lower())
    vecs = []
    # Iterate over the encoded tokens
    for token_id in encoding.ids:
        # Convert token ID back to the token string representation
        token = tokenizer.id_to_token(token_id)
        if token in embeddings:
            vecs.append(embeddings[token])
    if not vecs:
        return np.zeros(dim, dtype='float32')
    avg_vector = np.mean(vecs, axis=0)
    return avg_vector


In [53]:
check_dcg_hit(embeddings_trained, validation_data, dim=size, tokenizer=tokenizer, bpe=True)

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.301 | Hits@   1: 0.301
DCG@   5: 0.371 | Hits@   5: 0.432
DCG@  10: 0.394 | Hits@  10: 0.501
DCG@ 100: 0.432 | Hits@ 100: 0.687
DCG@ 500: 0.451 | Hits@ 500: 0.841
DCG@1000: 0.468 | Hits@1000: 1.000


In [56]:
check_dcg_hit(wv_embeddings, validation_data, tokenizer=tokenizer, bpe=True)

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.434 | Hits@   1: 0.434
DCG@   5: 0.514 | Hits@   5: 0.586
DCG@  10: 0.535 | Hits@  10: 0.652
DCG@ 100: 0.579 | Hits@ 100: 0.866
DCG@ 500: 0.593 | Hits@ 500: 0.971
DCG@1000: 0.596 | Hits@1000: 1.000


Lets max it out

In [None]:
max_embeds = Word2Vec(words,
                 vector_size=400,
                 min_count=12,
                 window=10,
                sg=1).wv

In [None]:
check_dcg_hit(max_embeds, validation_data, dim=400)

  0%|          | 0/3760 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

DCG@   1: 0.509 | Hits@   1: 0.509
DCG@   5: 0.592 | Hits@   5: 0.662
DCG@  10: 0.610 | Hits@  10: 0.718
DCG@ 100: 0.648 | Hits@ 100: 0.898
DCG@ 500: 0.658 | Hits@ 500: 0.977
DCG@1000: 0.660 | Hits@1000: 1.000


#### Tokenizer=WorkPunktTokenizer

We used bpe, trained and pretrained, also tried regex option (deleted since not interesting), the best one appeared to be tokenizer for word and punctiation + additional punctiation check - WorkPunktTokenizer.

#### vector_size=300

There is a common practice to set emb size to 200-300. There has been plensty of research in this field so we'll just follow the guidelines.

#### min_count=5

This parapeter indicates that the word should apper at lesat 5 times in the dataset to be considered viable for training. Thus we filter out rare words.

#### window=5

This parameter specifies the maximum distance between current and predicted word. `5 4 3 2 1 current 1 2 3 4 5` It specifies cooccurrency. Making the value larger might slow down the computation process as well as it's more likely that words don't really relate to each other if they are 5 words away.

(Except for complex and composite sentences like:


"The **cat**, which was lazily resting on the green branch of an ancient and enormously overgrown tree, suddenly jumped down to the ground and **chasing** after the colourful fluttering butterfly, disappearing into the garden shed."

Here the *cat* and *started chasing* are related but they are more than 20 words apart).

However we won't consider such cases.



## Conclusion

So altogether the quality is not that good. Skip-Grams Word2Vec(sg=1) are a bit better than CBOW but in general the difference is not that drastical.

1. As given in results above, the pretrained w2v embeddings with combination of stopwords + removing punctuation tokenization (WordPunkt) provide a firmer result.
However it is worth noting that there is a lot to improve even with tokenization, and dimenstions.

2. Normalization\lemmaization won't really help, quite contrary - it worsens the result due to the information loss and overgeneralization

3. Skip gram, window 5 appeared to be the best in the experiments since window of 5 words allows the model to capture a reasonably large amount of local context around each word and overall skip grams normally perform a bit better but slower than CBOW. However its worth noting the maximizing the window could give lightly better result at a price of computation speed. Adding negative sampling would be a good idea.

4. The chosen approach isn't the best due to the complexity of the task - you can't solve it perfectly with simply cosine similarity of embeddings, there is much to look into - more complex context (after all Word2Vec doesn't remember the global context of the question), specificity of the questions, different meanings in vocabulary.

5. Word2Vec (its implementations like GloVe or FastText and adding n-grams and negative sampling) is a proper first step in the task however we could use BERT model, roBERTa, GPT2 or ruGPT later. Transformers aside, we could utilize RNNs like LSTM or Bi-LSTM for this task.