# Embedding

## Why Embedding?
As we know, machines can't handle text, it can only handle numbers. But how to convert a word to numbers?

The most naive approach would be to take a list of all the words in your text and attribute a number to all of them. It will work but you can imagine that some problems will appear:
* How do you handle unknown words? 
* If your text contains `doctor`, `nurse`, and `candy`. `doctor` and `nurse` have a strong similarity but `candy` doesn't. How can we make the machine understand that? With our naive technique, `doctor` could have the number `5` associated to it and nurse the number `98767`.

Of course, a lot of people already spent some time with those problems. the solution that came out of it is "Embedding". 

## What is embeddings?

An embedding is a **VECTOR** which represents a word or a document.

A vector will be attributed to each token. Each vector will contain multiple dimensions (usually tens or hundreds of dimensions).

```
[...] associate with each word in the vocabulary a distributed word feature vector [...] The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features [...] is much smaller than the size of the vocabulary.
```
- [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), 2003.

Long story short, embeddings convert words into vectors in a way that allows the machine to understand the similarity betweens them.

Each embedding library has it's own way of classifying words, it will regroup words into big categories. Each word will get a score for each category.

To take a simple example the word `mother` could be classified like that:

|        | female | family | human | animal|
|--------|--------|---------|-------|-------|
| mother | 0.9    | 0.9.    | 0.7   | 0.1   |

**Explanations:** Mother has a strong similarity with female, family and human but it has a low similarity with animal.

**Disclaimer:** Those numbers and categories are totally arbitrary and are only here to show an example.

Here is another example with more complete datas:

![embedding](https://miro.medium.com/max/2598/1*sAJdxEsDjsPMioHyzlN3_A.png)

## Should I do it by hand?

You could, but if some people already did the job for you and spent a lot of time to optimize it, why not use it?

## What to use?

There are a lot of libraries out there for embeddings. Which one is the best? Once again, *it depends*. The results will change depending on the text you are using, the information you want to extract, the model you use,...

Choosing the "best" embedding model will be part of the hyper-optimization that you can do at the end of a project.

If you want understand embeddings more in depth, [follow this link](http://jalammar.github.io/illustrated-word2vec/) or watch this [video](https://www.youtube.com/watch?v=gQddtTdmG_8).

Here are some of the best libraries:

* [Gensim](https://pypi.org/project/gensim/)
* [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec)

This next bit of code loads a model for practice. If you get an error, it may be due to the Python version (3.12.3 works), make sure you create a venv for that.

In [49]:
"""import os
import gensim.downloader as api
from gensim.models import KeyedVectors
import math
import numpy as np

# Path where you want to store/load the model
model_path = "glove-wiki-gigaword-300.kv"

# Load model from disk if exists, else download and save it
if os.path.exists(model_path):
    print("Loading model from local file...")
    model = KeyedVectors.load("data/"+model_path)
else:
    print("Downloading model...")
    model = api.load(model_path[:-3])
    model.save("data/"+model_path)
    print("Model downloaded and saved.")"""

'import os\nimport gensim.downloader as api\nfrom gensim.models import KeyedVectors\nimport math\nimport numpy as np\n\n# Path where you want to store/load the model\nmodel_path = "glove-wiki-gigaword-300.kv"\n\n# Load model from disk if exists, else download and save it\nif os.path.exists(model_path):\n    print("Loading model from local file...")\n    model = KeyedVectors.load("data/"+model_path)\nelse:\n    print("Downloading model...")\n    model = api.load(model_path[:-3])\n    model.save("data/"+model_path)\n    print("Model downloaded and saved.")'

In [50]:
import os
import gensim.downloader as api
from gensim.models import KeyedVectors
import math
import numpy as np
model_path = "glove-wiki-gigaword-300.kv"
model = KeyedVectors.load("data/"+model_path)

## Practice time!

Enough reading, let's practice a bit. On this sentence:

In [51]:
sentence = "I love learning"

What do the word vectors look like? What is their size? What is their [magnitude](https://numpy.org/doc/2.1/reference/generated/numpy.linalg.norm.html)?

In [52]:
#you may get the vectors from using the model like a dictionary
# разбивка предложения на слова
tokens = sentence.lower().split()
tokens

['i', 'love', 'learning']

In [53]:
# Получение векторов (как из словаря)
words = [w for w in tokens if w in model]
vectors = {word: model[word] for word in tokens if word in model}
vectors

{'i': array([-1.3292e-01,  1.6985e-01, -1.4360e-01, -8.8722e-02,  7.9510e-02,
        -1.4212e-01, -2.4209e-02, -2.6291e-01, -7.4814e-02, -2.3600e+00,
         3.4830e-01, -9.1722e-02, -5.3906e-02,  3.0418e-01, -1.3286e-01,
         5.0341e-03, -1.5056e-01,  2.3562e-03,  6.8321e-02,  3.4246e-01,
         3.9891e-01,  5.8813e-01,  6.0618e-02, -1.9871e-01, -4.0465e-01,
        -1.0706e-01, -5.9312e-03, -6.4842e-01,  1.9080e-01, -1.7630e-01,
         9.2407e-02,  3.8685e-01, -3.1085e-01, -3.2574e-01, -1.6823e+00,
         2.5336e-01, -2.4647e-01, -1.0874e-01,  7.6402e-03,  3.3880e-01,
        -5.9736e-02, -8.5940e-01, -8.0964e-02, -2.2981e-01,  1.7709e-01,
         8.2094e-02,  7.4416e-01,  3.6873e-01,  1.3740e-01,  2.9408e-01,
         1.0647e-01, -1.3246e-01,  1.2134e-01, -1.4273e-01, -5.3270e-01,
         6.4936e-01,  4.9657e-01,  3.0029e-01,  6.7226e-01,  1.8005e-01,
         8.8050e-01,  3.8144e-02, -8.7140e-02,  7.6400e-01, -1.2107e-01,
        -4.2809e-01, -1.2588e-01,  8.8377e-04,

In [54]:
# Пример: усечённый вывод вектора (10 первых элементов)
vectors["love"][:10]

array([-0.45205 , -0.33122 , -0.063607,  0.028325, -0.21372 ,  0.16839 ,
       -0.017186,  0.047309, -0.052355, -0.98706 ], dtype=float32)

In [55]:
# Размерность вектора (v(word) ∈ ℝ³⁰⁰)
model.vector_size

300

In [56]:
# Размерность векторов (для всего предложения)
X = np.vstack([vectors[w] for w in vectors])
X.shape

(3, 300)

In [57]:
# Модуль (длина) векторов
# Модуль = L2-норма:
for w, v in vectors.items():
    print(w, np.linalg.norm(v))

i 6.9177027
love 6.1360564
learning 5.955531


In [58]:
# normalization
model.fill_norms()
model

<gensim.models.keyedvectors.KeyedVectors at 0x26f1d3b1ba0>

Итоговая схема: 
"i"        → [300 float32] → ‖v‖ ≈ 6.9
"love"     → [300 float32] → ‖v‖ ≈ 6.1
"learning" → [300 float32] → ‖v‖ ≈ 6.0

## Maths on text

Since the words are embedded into vectors we can now apply mathematical methods on them.

### Average vector

For example we could build the average vector for a text by using NumPy! This is a straightforward way to build one single representation for a text.

- Apply a gensim model on the text
- Get all word vectors into a list
- Compute and display the average vector of the list
- Get it's representation using the gensim most_similar method

In [59]:
text = "I want to be a famous data scientist"
tokens = text.lower().split()
tokens

['i', 'want', 'to', 'be', 'a', 'famous', 'data', 'scientist']

In [60]:
vectors = {word: model[word] for word in tokens if word in model}
X = np.vstack([vectors[w] for w in vectors])
X.shape

(8, 300)

In [61]:
# The average vector
"""sum_elem=0
cnt_elem=0
for w, v in vectors:
    sum_elem =+ np.linalg.norm(v)
    cnt_elem =+1
mean_vector = sum_elem/cnt_elem
print("Average_vector: ", mean_vector)

mean_vector = np.mean(vectors, axis=0)
mean_vector"""

'sum_elem=0\ncnt_elem=0\nfor w, v in vectors:\n    sum_elem =+ np.linalg.norm(v)\n    cnt_elem =+1\nmean_vector = sum_elem/cnt_elem\nprint("Average_vector: ", mean_vector)\n\nmean_vector = np.mean(vectors, axis=0)\nmean_vector'

In [62]:
words = [w for w in tokens if w in model]

vectors = [model[w] for w in words]
mean_vector = np.mean(vectors, axis=0)

similar = model.similar_by_vector(mean_vector, topn=10)

print("Words:", words)
print("Mean vector shape:", mean_vector.shape)
print("Most similar words:")
for w, s in similar:
    print(f"{w}: {s:.3f}")

Words: ['i', 'want', 'to', 'be', 'a', 'famous', 'data', 'scientist']
Mean vector shape: (300,)
Most similar words:
so: 0.783
not: 0.772
you: 0.767
n't: 0.765
this: 0.764
what: 0.764
could: 0.762
if: 0.753
be: 0.753
want: 0.750


This does not work very well in practice, still let us explore further.

### Word similarity

We can also compute the similarity between two words by using distance measures (e.g. [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), [euclidean distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html)...). These measures will calculate the distance between word embeddings in the vector space.

Identify what fundamental difference there is between these two metrics when it comes to assessissing similarity between vectors.

#### Let's practice!

- Compute the cosine and the euclidean distance between those 4 words in a similarity table visualizing it with matplotlib and/or seaborn
- Assess which words are the most similar and the most dissimilar

In [63]:
words = ["computer","keyboard","water","ocean"]
vectors = {w: model[w] for w in words if w in model}
vectors.keys()

dict_keys(['computer', 'keyboard', 'water', 'ocean'])

Косинусное сходство: cos(u,v) = (u⋅v) / (∥u∥⋅∥v∥)

In [64]:
def cosine_similarity(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [None]:
# Матрица косинусного сходства
import pandas as pd
cosine_matrix = pd.DataFrame(
    index=words,
    columns=words,
    data=[
        [cosine_similarity(vectors[w1], vectors[w2]) for w2 in words]
        for w1 in words
    ]
)
print(cosine_matrix.round(3))

          computer  keyboard  water  ocean
computer     1.000     0.441  0.124  0.143
keyboard     0.441     1.000  0.054  0.075
water        0.124     0.054  1.000  0.477
ocean        0.143     0.075  0.477  1.000


In [None]:
# Описание:
## диапазон: [-1, 1]
## ближе по смыслу → выше значение
## Косинус лучше для семантики

# Выводы:
## computer ↔ keyboard — близкие (техника)
## water ↔ ocean — близкие (природа)


Евклидово расстояние:
d(u,v)=∥u−v∥

In [66]:
def euclidean_distance(u, v):
    return np.linalg.norm(u - v)

In [None]:
# Матрица евклидовых расстояний
euclidean_matrix = pd.DataFrame(
    index=words,
    columns=words,
    data=[
        [euclidean_distance(vectors[w1], vectors[w2]) for w2 in words]
        for w1 in words
    ]
)
print(euclidean_matrix.round(3))

          computer  keyboard  water  ocean
computer     0.000     7.536  9.546  9.045
keyboard     7.536     0.000  9.845  9.316
water        9.546     9.845  0.000  7.107
ocean        9.045     9.316  7.107  0.000


In [None]:
# Описание:
## меньше расстояние → ближе по смыслу
## диапазон: [0, +∞)
## Евклид чувствителен к масштабу векторов

# Выводы:
## computer ↔ keyboard — близкие (техника)
## water ↔ ocean — близкие (природа)

## Combining things together

This next bit of code uses the gensim library to allow you to perform arithmetic operations on vectors. Things you may want to try:

Silly additions:
 - man + hair

Checking for some more abstractions:
 - hair - woman + man
 - mice - home + city
 - children - child + goose
 - paris - france + belgium
 - triceratops - deer + wolf

Bonus points if you can make a function which takes any form of addition and substraction calculations on word vectors.

In [68]:
equals=model.most_similar(positive=['king', 'woman'], negative=['man'])[0][0]
print(f"'king' - 'man' + 'woman' = '{equals}'")

#Your code here

'king' - 'man' + 'woman' = 'queen'


In [72]:
# Функция вычисляет векторное выражение со словами через + и -. Пример: word_vector_expression("king - man + woman", model)
# Args:
##  expr (str): строка выражения, слова разделены пробелом, используют + и -
##  model (KeyedVectors): загруженная модель словарных векторов
# Returns: np.ndarray: результирующий вектор
def word_vector_expression(expr, model):
    tokens = expr.lower().split()
    
    if not tokens:
        raise ValueError("Выражение пустое")
    
    result_vector = None
    current_op = "+"  # начальная операция
    
    for token in tokens:
        if token == "+" or token == "-":
            current_op = token
        else:
            if token not in model:
                raise ValueError(f"Слово '{token}' отсутствует в модели")
            
            vec = model[token]
            if result_vector is None:
                # первый вектор
                result_vector = vec.copy()
            else:
                if current_op == "+":
                    result_vector += vec
                elif current_op == "-":
                    result_vector -= vec
    return result_vector

In [77]:
# Проверка:
expr_1 = "hair - woman + man"
expr_2 = "mice - home + city"
expr_3 = "children - child + goose"
expr_4 = "paris - france + belgium"
expr_5 = "triceratops - deer + wolf"

vec_1 = word_vector_expression(expr_1, model)
vec_2 = word_vector_expression(expr_1, model)
vec_3 = word_vector_expression(expr_1, model)
vec_4 = word_vector_expression(expr_1, model)
vec_5 = word_vector_expression(expr_1, model)

vec_3
# vec

array([ 4.25400138e-02, -1.17890000e-01, -7.89985061e-04, -2.15880007e-01,
       -1.63714901e-01,  3.54744017e-01, -3.25149782e-02,  6.18579268e-01,
        7.81599998e-01, -1.43030000e+00,  1.69982016e-01,  2.52020001e-01,
       -5.42320251e-01,  4.44274992e-01, -1.30129993e-01, -5.55970073e-02,
        5.18780053e-01, -4.24700975e-03, -4.23150033e-01,  2.13943005e-01,
       -2.04540968e-01,  4.02459979e-01, -7.86240026e-02,  6.78250015e-01,
       -3.14300001e-01,  3.14165980e-01, -2.19610333e-02, -5.48509955e-01,
       -1.20950937e-02,  4.59619999e-01,  4.30344999e-01,  2.86617011e-01,
       -3.48199964e-01,  1.76498994e-01, -7.07549989e-01,  3.28399912e-02,
        6.03106022e-01, -2.17429996e-01, -7.45769978e-01, -1.54792011e-01,
       -5.42690039e-01, -4.08749968e-01, -1.25423998e-01, -8.10353041e-01,
        4.85696018e-01, -3.14729989e-01, -3.00839961e-01, -7.99279988e-01,
        4.47989970e-01, -3.70359987e-01, -4.39957976e-01, -4.36357975e-01,
       -3.71615022e-01,  

In [76]:
# Посмотрим ближайшие слова
similar = model.similar_by_vector(vec, topn=5)
print(similar)

[('king', 0.8065858483314514), ('queen', 0.689616322517395), ('monarch', 0.5575491189956665), ('throne', 0.5565375089645386), ('princess', 0.5518684387207031)]


When you play with these examples (or others). You quickly notice both the powerful levels of abstraction and the gaping limitations.

## More resources
* [Why do we use word embeddings in NLP?](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2)
* [More details on what word embeddings are exactly?](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)