<a href="https://colab.research.google.com/github/Dimildizio/DS_course/blob/main/Neural_networks/NLP/Embeddings/Simple_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Embeddings

The task is to rank StackOverflow questions based on their semantic representations.

* $X$ - the n number of objects
* $X^l = \{x_1, x_2, ..., x_l\}$ - train dataset

* $i \prec j$ - the order of index pairs of $X^l$ and $i$, $j$ indices.



### Task:
Construct and ranking function $a$ : $X \rightarrow R$ so that
$$i \prec j \Rightarrow a(x_i) < a(x_j)$$

## Embeddings

### Download the corpora

In [None]:
%%capture
!wget https://zenodo.org/record/1199620/files/SO_vectors_200.bin?download=1

### Imports

In [135]:
import numpy as np
import pandas as pd
import re


from gensim.models.keyedvectors import KeyedVectors
from nltk.tokenize import WordPunctTokenizer
from typing import List

### Create embeddings

In [3]:
wv_embeddings = KeyedVectors.load_word2vec_format("SO_vectors_200.bin?download=1", binary=True)

### Examples

In [12]:
word = 'dog'
if word in wv_embeddings:
    print(wv_embeddings[word].dtype, wv_embeddings[word].shape)
print(f"Num of words: {len(wv_embeddings.index_to_key)}")

float32 (200,)
Num of words: 1787145


### Question 1

Is 'cat' in top5 most similar words to 'dog'? If yes, which position?

In [64]:
print("'dog' and 'cat' similarity:")
print('\t', wv_embeddings.similarity('dog', 'cat'))
print('\t', wv_embeddings.n_similarity(['dog'], ['cat']))


'dog' and 'cat' similarity:
	 0.6852341
	 0.6852341


In [68]:
def check_occur(req: str, base: str, n: int = 5, flag = True) -> np.ndarray:
  """Checks if one word is similar to another

  :param req: word to check with
  :param base: word check if is similar to
  :param n: top N words to check
  :param flag: a flag to check which slimilarity function to use
  :return: None
  """

  # also funcs most_similar_cosmul or similar_by_word could be used
  result = wv_embeddings.most_similar(base, topn=n)
  for i, (word, perc) in enumerate(result):
    if req == word:
      print(f'{req} is {int(perc*100)}% similar to {base} at position {i}')
      return result
  print(f'{req} is not similar to {base}')
  return result

In [72]:
words = ('cat', 'cats', 'dog', 'dogs')
for requested in words:
  requested = requested.lower()
  for based in words:
    based = based.lower()
    if requested != based:
      result = check_occur(requested, based)
  print()

cat is not similar to cats
cat is not similar to dog
cat is not similar to dogs

cats is not similar to cat
cats is 76% similar to dog at position 3
cats is 90% similar to dogs at position 0

dog is 68% similar to cat at position 1
dog is 76% similar to cats at position 2
dog is 78% similar to dogs at position 3

dogs is not similar to cat
dogs is 90% similar to cats at position 0
dogs is 78% similar to dog at position 1



#### Q1Answer:

We've checked several simialrity function: `similar_by_word`, `most_similar_cosmul` and `most_similar` and got identical (different percentage) results. *(Here the words are ranked starting from 1 unlike positions in the similarity list which start from 0)*:

**"cat" is not in top5 words similar to "dog"**, however "cats" is similar to "dog" coming fourth and and "cats" is similar to "dogs" coming first. Also 'dog' is similar to 'cat' ranked second, to 'cats' ranked third. And 'dogs' similar to 'cats' ranked first.


## Vector representations

In [75]:
class MyTokenizer:
    def __init__(self):
        self.tokenizer = WordPunctTokenizer()

    def tokenize(self, text):
        return self.tokenizer.tokenize(text)

In [78]:
tokenizer = MyTokenizer()

In [87]:
def question_to_vec(question, embeddings=wv_embeddings, tokenizer=tokenizer, dim=200):
    """
    Embeds a sentence into vector representations

        :param question: str
        :param embeddings: embedidngs
        :param dim: size of any vector in repr

        return: embeddings of a questing
    """
    tokens = tokenizer.tokenize(question)
    vecs = []
    known_words = 0

    for token in tokens:
      if token in embeddings:
        vecs.append(embeddings[token])
        known_words+=1
    if not known_words:
      return np.zeros(dim)
    avg_vector = np.mean(vecs, axis=0)
    return avg_vector

### Question 2:

What is the third component (2nd index) of vector representation of `"I love neural networks"` (rounded to 2 digit)?

In [100]:
q2 = 'I love neural networks'
q2_embeds = question_to_vec(q2)
third_component = str(round(q2_embeds[2], 2))
print(f"Third component of '{q2}' is {third_component}")

Third component of 'I love neural networks' is -1.29


#### Q2 Answer:

Third component of `'I love neural networks'` is `-1.29`

## Text similarity

### Explanation

*We assume* that cos distance for duplicates is smaleer than between randomly chosen sentences.



For each of $N$ questions, we'll **generate** $R$ **random negative examples** and **include the actual duplicates** as well. We'll rank $R + 1$ examples for each question using our model and **look at the position of the duplicate**. Ideally, we want the duplicate to be ranked first.

#### Hits@K
The first metric will be the number of correct hits for a given $K$:

$$ \text{Hits@K} = \frac{1}{N}\sum_{i=1}^N \, [rank\_q_i^{'} \le K],$$
* $\begin{equation*}
[x < 0 ] \equiv
 \begin{cases}
   1, &x < 0\\
   0, &x \geq 0
 \end{cases}
\end{equation*}$ - func
* $q_i$ - $i$-th question


#### DCG@K

Another metric will be a simplified DCG metric, which considers the order of elements in the list by multiplying the **relevance of an element** by a **weight equal to the inverse logarithm of its position**:

$$ \text{DCG@K} = \frac{1}{N} \sum_{i=1}^N\frac{1}{\log_2(1+rank\_q_i^{'})}\cdot[rank\_q_i^{'} \le K],$$




With this metric, the model is penalized for higher ranks of correct answers.

### Question 3:

Max `Hits@47 - DCG@1` ?

The duplicate is always ranked within the top 47 positions (Hits@47 = 1), which means that the model correctly identifies the duplicate as similar to the input question.

The duplicate is always ranked first (DCG@1 = 1), showing that the model ranks the duplicate at the top position, with no incorrect rankings above it.

So, to generalize:

**Hits@47** = 1: means all duplicates are in top47

**DCG@1** = 1: means all duplicates are ranked first

*If we just want maximum value for the result then we want the situation to be Hits@47=1 (all duplicates within first47), DCG@1=0 (none of the duplicates ranked first) thus having 1 as maximum*
$$1-0=1$$

So **the answer is 1** however i consider such situation theretical

#### Q3 answer:

Max Hits@47 - DCG@1 = 1

### Question 4:
Find `DCG@10`, if $rank\_q_i^{'} = 9$(round to 1 digit)

Since we dont need a list here we can as well do it in one line of code

In [110]:
dcg = lambda rank, k: 1 / np.log2(1 + rank) if rank <= k else 0
dcg_result = round(dcg(9, 10), 1)
print(f"DCG@10 if rank q'_i is 9: {dcg_result}")

DCG@10 if rank q'_i is 9: 0.3


#### Q4 answer:

DCG@10 if rank_$q'_i=9$: 0.3

### HITS\_COUNT и DCG\_SCORE

Each func has two args: $dup\_ranks$ and $k$.

$dup\_ranks$ is a list which has dubs rankings (positions in ranked list)

In [126]:
def calc_dups_ranks(candidates_ranking: List, copy_answers: List) -> List:
    """
    Calculate the ranks of duplicates in the candidates ranking.
    For each index in a length of copy_answers list:
      1. Find the index of the copy answer in the corresponding candidates ranking sublist
      2. Append the rank to the dup_ranks list

    :param candidates_ranking: (list of lists) List of ranked candidate answers for each question.
    :param copy_answers: (list) List of duplicate answers.
    :returns: List of ranks of duplicates in the candidates ranking.
    """
    dup_ranks = []
    for i in range(len(copy_answers)):
        rank = candidates_ranking[i].index(copy_answers[i]) + 1
        dup_ranks.append(rank)
    return dup_ranks

In [122]:
def hits_count(dup_ranks: List, k: int) -> float:
    """
    Compute Hits@K metric.

    :param dup_ranks: (list) List of ranks of duplicates in the ranked list for each question.
    :param k: (int) Number of top-ranked items to consider.
    :returns: float Hits@K score.
    """
    hits = sum(1 for rank in dup_ranks if rank <= k) / len(dup_ranks)
    return hits

In [123]:
def dcg_score(dup_ranks: List, k: int) -> float:
    """
    Compute DCG@K metric.


    :param dup_ranks: (list) List of ranks of duplicates in the ranked list for each question.
    :param k: (int) Number of top-ranked items to consider.
    :returns: float DCG@K score.
    """
    dcg = sum(1 / np.log2(1 + rank) for rank in dup_ranks if rank <= k) / len(dup_ranks)
    return dcg

#### Test it

In [130]:
copy_answers = ["How does the catch keyword determine the type of exception that was thrown",]

candidates_ranking = [["How Can I Make These Links Rotate in PHP",
                       "How does the catch keyword determine the type of exception that was thrown",
                       "NSLog array description not memory address",
                       "PECL_HTTP not recognised php ubuntu"],]

dup_ranks = calc_dups_ranks(candidates_ranking, copy_answers)

print('HIT:', [hits_count(dup_ranks, k) for k in range(1, 5)])
print('DCG:', [round(dcg_score(dup_ranks, k), 5) for k in range(1, 5)])

HIT: [0.0, 1.0, 1.0, 1.0]
DCG: [0.0, 0.63093, 0.63093, 0.63093]


In [136]:
# correct_answers - metrics for different k's
correct_answers = pd.DataFrame([[0, 1, 1, 1], [0, 1 / (np.log2(3)), 1 / (np.log2(3)), 1 / (np.log2(3))]],
                               index=['HITS', 'DCG'], columns=range(1,5))
correct_answers

Unnamed: 0,1,2,3,4
HITS,0,1.0,1.0,1.0
DCG,0,0.63093,0.63093,0.63093
