<a href="https://colab.research.google.com/github/Dimildizio/DS_course/blob/main/Neural_networks/NLP/Embeddings/Simple_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Embeddings

The task is to rank StackOverflow questions based on their semantic representations.

* $X$ - the n number of objects
* $X^l = \{x_1, x_2, ..., x_l\}$ - train dataset

* $i \prec j$ - the order of index pairs of $X^l$ and $i$, $j$ indices.



### Task:
Construct and ranking function $a$ : $X \rightarrow R$ so that
$$i \prec j \Rightarrow a(x_i) < a(x_j)$$

## Embeddings

### Download the corpora

In [None]:
%%capture
!wget https://zenodo.org/record/1199620/files/SO_vectors_200.bin?download=1

### Imports

In [2]:
import numpy as np

from gensim.models.keyedvectors import KeyedVectors

### Create embeddings

In [3]:
wv_embeddings = KeyedVectors.load_word2vec_format("SO_vectors_200.bin?download=1", binary=True)

### Examples

In [12]:
word = 'dog'
if word in wv_embeddings:
    print(wv_embeddings[word].dtype, wv_embeddings[word].shape)
print(f"Num of words: {len(wv_embeddings.index_to_key)}")

float32 (200,)
Num of words: 1787145


### Question 1

Is 'cat' in top5 most similar words to 'dog'? If yes, which position?

In [64]:
print("'dog' and 'cat' similarity:")
print('\t', wv_embeddings.similarity('dog', 'cat'))
print('\t', wv_embeddings.n_similarity(['dog'], ['cat']))


'dog' and 'cat' similarity:
	 0.6852341
	 0.6852341


In [68]:
def check_occur(req: str, base: str, n: int = 5, flag = True) -> np.ndarray:
  """Checks if one word is similar to another

  :param req: word to check with
  :param base: word check if is similar to
  :param n: top N words to check
  :param flag: a flag to check which slimilarity function to use
  :return: None
  """

  # also funcs most_similar_cosmul or similar_by_word could be used
  result = wv_embeddings.most_similar(base, topn=n)
  for i, (word, perc) in enumerate(result):
    if req == word:
      print(f'{req} is {int(perc*100)}% similar to {base} at position {i}')
      return result
  print(f'{req} is not similar to {base}')
  return result

In [72]:
words = ('cat', 'cats', 'dog', 'dogs')
for requested in words:
  requested = requested.lower()
  for based in words:
    based = based.lower()
    if requested != based:
      result = check_occur(requested, based)
  print()

cat is not similar to cats
cat is not similar to dog
cat is not similar to dogs

cats is not similar to cat
cats is 76% similar to dog at position 3
cats is 90% similar to dogs at position 0

dog is 68% similar to cat at position 1
dog is 76% similar to cats at position 2
dog is 78% similar to dogs at position 3

dogs is not similar to cat
dogs is 90% similar to cats at position 0
dogs is 78% similar to dog at position 1



####Answer:

We've checked several simialrity function: **similar_by_word**, **most_similar_cosmul** and **most_similar** and got identical (different percentage) results. *(Here the words are ranked starting from 1 unlike positions in the similarity list which start from 0)*:

**"cat" is not in top5 words similar to "dog"**, however "cats" is similar to "dog" coming fourth and and "cats" is similar to "dogs" coming first. Also 'dog' is similar to 'cat' ranked second, to 'cats' ranked third. And 'dogs' similar to 'cats' ranked first.
