# Embedding

## Why Embedding?
As we know, machines can't handle text, it can only handle numbers. But how to convert a word to numbers?

The most naive approach would be to take a list of all the words in your text and attribute a number to all of them. It will work but you can imagine that some problems will appear:
* How do you handle unknown words? 
* If your text contains `doctor`, `nurse`, and `candy`. `doctor` and `nurse` have a strong similarity but `candy` doesn't. How can we make the machine understand that? With our naive technique, `doctor` could have the number `5` associated to it and nurse the number `98767`.

Of course, a lot of people already spent some time with those problems. the solution that came out of it is "Embedding". 

## What is embeddings?

An embedding is a **VECTOR** which represents a word or a document.

A vector will be attributed to each token. Each vector will contain multiple dimensions (usually tens or hundreds of dimensions).

```
[...] associate with each word in the vocabulary a distributed word feature vector [...] The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features [...] is much smaller than the size of the vocabulary.
```
- [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), 2003.

Long story short, embeddings convert words into vectors in a way that allows the machine to understand the similarity betweens them.

Each embedding library has it's own way of classifying words, it will regroup words into big categories. Each word will get a score for each category.

To take a simple example the word `mother` could be classified like that:

|        | female | family | human | animal|
|--------|--------|---------|-------|-------|
| mother | 0.9    | 0.9.    | 0.7   | 0.1   |

**Explanations:** Mother has a strong similarity with female, family and human but it has a low similarity with animal.

**Disclaimer:** Those numbers and categories are totally arbitrary and are only here to show an example.

Here is another example with more complete datas:

![embedding](https://miro.medium.com/max/2598/1*sAJdxEsDjsPMioHyzlN3_A.png)

## Should I do it by hand?

You could, but if some people already did the job for you and spent a lot of time to optimize it, why not use it?

## What to use?

There are a lot of libraries out there for embeddings. Which one is the best? Once again, *it depends*. The results will change depending on the text you are using, the information you want to extract, the model you use,...

Choosing the "best" embedding model will be part of the hyper-optimization that you can do at the end of a project.

If you want understand embeddings more in depth, [follow this link](http://jalammar.github.io/illustrated-word2vec/)

Here are some of the best libraries of the moment:

* [Flair](https://github.com/flairNLP/flair) (University of Berlin)
* [fasttext](https://fasttext.cc/) (Facebook)
* [GloVe](https://github.com/stanfordnlp/GloVe) (Stanford)

And the oldest way doing it (but still good):
* [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec)

## Practice time!

Enough reading, let's practice a bit. Can you use SpaCy to embed this sentence?
Read the [spacy embedding documentation](https://spacy.io/usage/vectors-similarity)

In [23]:
import spacy

nlp = spacy.load("en_core_web_lg")
sentence = "I love learning"
tokens = nlp(sentence)

tokens_vectored_list = []
for token in tokens:
    token_vectored = token.vector
    tokens_vectored_list.append(token_vectored)
print(tokens_vectored_list)

# Embed `sentence` with SpaCy

[array([ -1.8607  ,   0.15804 ,  -4.1425  ,  -8.6359  , -16.955   ,
         1.157   ,  -1.588   ,   5.6609  , -12.03    ,  16.417   ,
         4.1907  ,   5.5122  ,  -0.11932 ,  -6.06    ,   3.8957  ,
        -7.8212  ,   3.6736  , -14.824   ,  -7.6638  ,   2.5344  ,
         7.9893  ,   3.6785  ,   4.3296  , -11.338   ,  -3.5506  ,
        -5.899   ,   1.0998  ,   3.4515  ,  -5.4191  ,   1.8356  ,
        -2.902   ,  -7.9294  ,  -1.1269  ,   8.4124  ,   5.1416  ,
        -3.1489  ,  -4.2061  ,  -1.459   ,   7.8313  ,   0.27859 ,
        -4.3832  ,   8.0756  ,  -0.94784 ,  -6.1214  ,   8.2792  ,
         5.0529  ,  -8.3611  ,  -6.0743  ,  -0.53773 ,   2.7538  ,
         3.8162  ,  -4.1612  ,   0.7591  ,  -2.8374  ,  -6.4851  ,
        -3.3435  ,   3.2703  ,   2.759   ,   2.6645  ,   4.0013  ,
        13.381   ,  -5.2907  ,  -3.133   ,   4.5374  , -11.899   ,
        -6.716   ,  -0.041939,  -2.0879  ,   3.0101  ,  10.3     ,
         2.6835  ,   2.7265  ,   8.3018  ,  -4.4563  ,  14.43

What is the shape of each word's vector?

In [24]:
for token in tokens:
    print(f"Word: {token}, Vector: {token.vector}, Shape: {token.vector.shape}")

Word: I, Vector: [ -1.8607     0.15804   -4.1425    -8.6359   -16.955      1.157
  -1.588      5.6609   -12.03      16.417      4.1907     5.5122
  -0.11932   -6.06       3.8957    -7.8212     3.6736   -14.824
  -7.6638     2.5344     7.9893     3.6785     4.3296   -11.338
  -3.5506    -5.899      1.0998     3.4515    -5.4191     1.8356
  -2.902     -7.9294    -1.1269     8.4124     5.1416    -3.1489
  -4.2061    -1.459      7.8313     0.27859   -4.3832     8.0756
  -0.94784   -6.1214     8.2792     5.0529    -8.3611    -6.0743
  -0.53773    2.7538     3.8162    -4.1612     0.7591    -2.8374
  -6.4851    -3.3435     3.2703     2.759      2.6645     4.0013
  13.381     -5.2907    -3.133      4.5374   -11.899     -6.716
  -0.041939  -2.0879     3.0101    10.3        2.6835     2.7265
   8.3018    -4.4563    14.43       3.9642    -4.8287    -5.648
  -7.2597   -11.475     -2.6171     0.3325    14.454     -5.155
   0.93722   -2.6187    -1.783      3.8711     1.4681    -6.705
  -4.0953    -0

Try with Flair and Glove now (You will find how to do [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))

In [8]:
from flair.embeddings import WordEmbeddings
from flair.data import Sentence
# Embed with Flair
text = "I love learning"


# init embedding
glove_embedding = WordEmbeddings('glove')

# create sentence.
sentence = Sentence(text)

# embed a sentence using glove.
glove_embedding.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)




Token[0]: "I"
tensor([-0.0465,  0.6197,  0.5665, -0.4658, -1.1890,  0.4460,  0.0660,  0.3191,
         0.1468, -0.2212,  0.7924,  0.2991,  0.1607,  0.0253,  0.1868, -0.3100,
        -0.2811,  0.6051, -1.0654,  0.5248,  0.0642,  1.0358, -0.4078, -0.3801,
         0.3080,  0.5996, -0.2699, -0.7603,  0.9422, -0.4692, -0.1828,  0.9065,
         0.7967,  0.2482,  0.2571,  0.6232, -0.4477,  0.6536,  0.7690, -0.5123,
        -0.4433, -0.2187,  0.3837, -1.1483, -0.9440, -0.1506,  0.3001, -0.5781,
         0.2017, -1.6591, -0.0792,  0.0264,  0.2205,  0.9971, -0.5754, -2.7266,
         0.3145,  0.7052,  1.4381,  0.9913,  0.1398,  1.3474, -1.1753,  0.0040,
         1.0298,  0.0646,  0.9089,  0.8287, -0.4700, -0.1058,  0.5916, -0.4221,
         0.5733, -0.5411,  0.1077,  0.3978, -0.0487,  0.0646, -0.6144, -0.2860,
         0.5067, -0.4976, -0.8157,  0.1641, -1.9630, -0.2669, -0.3759, -0.9585,
        -0.8584, -0.7158, -0.3234, -0.4312,  0.4139,  0.2837, -0.7093,  0.1500,
        -0.2154, -0.3762, 

What is the shape of each word's vector?

In [9]:
for token in sentence:
    print(f"Word: {token}, shape of word vector: {token.embedding.shape}")

Word: Token[0]: "I", shape of word vector: torch.Size([100])
Word: Token[1]: "love", shape of word vector: torch.Size([100])
Word: Token[2]: "learning", shape of word vector: torch.Size([100])


Your text is now embedded, your model will be able to understand it, yeah!

## Maths on text

Since the words are embedded into vectors we can now apply mathematical methods on them.

### Average vector

For example we could build the average vector for a text by using NumPy! This is a straightforward way to build one single representation for a text.

In [15]:
import spacy
import numpy as np

text = "I want to be a famous data scientist"

# Apply a spacy model on the text
nlp = spacy.load("en_core_web_lg")
tokens = nlp(text)

# Get all word vectors into a list
vector_list = []
for token in tokens:
    vector_list.append(token.vector)
# print(vector_list)
# Compute and display the average vector of the text
average_text_vector = np.mean(vector_list, axis=0)
print(f"Average Vector: {average_text_vector}")

Average Vector: [-1.39259255e+00  4.43511915e+00 -2.87938738e+00 -1.46659994e+00
 -2.49136314e-01  3.17635000e-01  3.21812510e-01  5.20731592e+00
 -1.49731112e+00  2.29289985e+00  8.17859936e+00  2.53242493e+00
 -3.80331492e+00 -1.94080308e-01  2.12294507e+00  5.62517643e-01
  3.66180110e+00 -3.78243327e+00 -5.52819538e+00 -2.37863755e+00
  3.47129107e+00 -9.34405386e-01 -2.44463801e+00 -2.58583379e+00
 -2.49281645e+00 -3.74811745e+00 -2.65185761e+00 -2.05591232e-01
 -3.03205299e+00  9.44486260e-02  1.37055504e+00 -8.00651193e-01
 -2.86574006e+00  1.37557483e+00  8.98500085e-02 -1.28024876e+00
 -7.56257534e-01 -1.58593631e+00  5.12072420e+00  1.32494879e+00
 -2.21358752e+00  2.15078497e+00  1.60805118e+00 -3.99998128e-02
 -1.08419502e+00  9.56671178e-01  2.91092491e+00 -4.15709496e+00
 -1.64297867e+00  1.94995987e+00 -2.10612342e-02  2.99287438e-02
  2.17419982e-01 -6.81680346e+00 -2.71650863e+00  2.75750458e-03
 -1.69070005e-01  1.74406245e-01  1.23772871e+00 -9.59155202e-01
  3.64199

### Word similarity

We can also compute the similarity between two words by using distance measures (e.g. [cosine distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html), [euclidean distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html)...). These measures will calculate the distance between word embeddings in the vector space.

#### Let's practice!

In [21]:
# Import the required libraries
import spacy
from scipy.spatial import distance

word1 = "computer"
word2 = "keyboard"

nlp = spacy.load("en_core_web_lg")
token1 = nlp(word1)
token2 = nlp(word2)

# Get the vector for both words through your favorite model
vector_1 = token1.vector
vector_2 = token2.vector
# Compute the cosine and the euclidean distance between both words
cosine_dist = distance.cosine(vector_1, vector_2)
euclidean_dist = distance.euclidean(vector_1, vector_2)
print(cosine_dist, euclidean_dist)


# Try with other pairs of words for comparing the results

word1 = "football"
word2 = "tennis"

nlp = spacy.load("en_core_web_lg")
token1 = nlp(word1)
token2 = nlp(word2)

# Get the vector for both words through your favorite model
vector_1 = token1.vector
vector_2 = token2.vector
# Compute the cosine and the euclidean distance between both words
cosine_dist = distance.cosine(vector_1, vector_2)
euclidean_dist = distance.euclidean(vector_1, vector_2)
print(cosine_dist, euclidean_dist)


# Try with other pairs of words for comparing the results

word1 = "eat"
word2 = "ball"

nlp = spacy.load("en_core_web_lg")
token1 = nlp(word1)
token2 = nlp(word2)

# Get the vector for both words through your favorite model
vector_1 = token1.vector
vector_2 = token2.vector
# Compute the cosine and the euclidean distance between both words
cosine_dist = distance.cosine(vector_1, vector_2)
euclidean_dist = distance.euclidean(vector_1, vector_2)
print(cosine_dist, euclidean_dist)


0.4687861204147339 39.738563537597656
0.46193528175354004 44.84297561645508
1.0110153406858444 100.74956512451172


## Stack embeddings

The previous embeddings are good, but if you want something even better, you can "stack" these embeddings to create a bigger vector. It gives better results but will also require more computation power.

[Here is a super clear and understandable guide](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md) to get it done. (by the Flair's team)


In [22]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])

sentence = Sentence('The grass is green.')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

2024-01-03 16:41:07,766 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-forward-0.4.1.pt not found in cache, downloading to C:\Users\grego\AppData\Local\Temp\tmpcsaf5kj2


100%|██████████| 69.7M/69.7M [00:05<00:00, 12.8MB/s]

2024-01-03 16:41:13,965 copying C:\Users\grego\AppData\Local\Temp\tmpcsaf5kj2 to cache at C:\Users\grego\.flair\embeddings\news-forward-0.4.1.pt





2024-01-03 16:41:14,016 removing temp file C:\Users\grego\AppData\Local\Temp\tmpcsaf5kj2
2024-01-03 16:41:15,066 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-backward-0.4.1.pt not found in cache, downloading to C:\Users\grego\AppData\Local\Temp\tmpzogwiwzj


100%|██████████| 69.7M/69.7M [00:05<00:00, 14.2MB/s]

2024-01-03 16:41:20,801 copying C:\Users\grego\AppData\Local\Temp\tmpzogwiwzj to cache at C:\Users\grego\.flair\embeddings\news-backward-0.4.1.pt
2024-01-03 16:41:20,852 removing temp file C:\Users\grego\AppData\Local\Temp\tmpzogwiwzj





Sentence[5]: "The grass is green."
Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0090])
Token[1]: "grass"
tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143])
Token[2]: "is"
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3690e-04,
        -9.6750e-03, -2.7541e-02])
Token[3]: "green"
tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161])
Token[4]: "."
tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032])


## More resources
* [Why do we use word embeddings in NLP?](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2)
* [More details on what word embeddings are exactly?](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)