# Overview

In this notebook, let's familiar some (lower-level)tools are mentioned in the previous nobooks. This notebook can be a appendix for the previous notebooks which with higher level techniques. We will follow the timeline of evolution of embeddings. And all the images are from the credit section at the bottom.


# Bag of Words

Note: This approach is quite basic, and it doesn't take into account the semantic meaning of the words.

It is a basic approach to converting texts into vectors. The first step to get a bag of words vector is to split the text into tokens and then reduce words to their base forms. For example, "running" will transform into "run". This process is called `stremming`. We use `NLTK` for it.

In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

weather_of_melbourne="Enjoy a beautiful mostly sunny day in Melbourne! It's currently a pleasant 18°C with a light southerly breeze, but make sure to slip on some sunscreen as the UV index climbs to a very high 10 later today. Expect a high of 21°C, perfect for outdoor activities, but bundle up a bit for the 13°C low tonight. With only a very slight chance of rain, it's a fantastic day to get out and explore the city!"

# tokenization - splitting text into words
words=word_tokenize(weather_of_melbourne)
print(words)

print(100*'=')
#steamming
stemmer=SnowballStemmer(language='english')
stemmed_words=list(map(lambda x: stemmer.stem(x), words))
print(stemmed_words)

['Enjoy', 'a', 'beautiful', 'mostly', 'sunny', 'day', 'in', 'Melbourne', '!', 'It', "'s", 'currently', 'a', 'pleasant', '18°C', 'with', 'a', 'light', 'southerly', 'breeze', ',', 'but', 'make', 'sure', 'to', 'slip', 'on', 'some', 'sunscreen', 'as', 'the', 'UV', 'index', 'climbs', 'to', 'a', 'very', 'high', '10', 'later', 'today', '.', 'Expect', 'a', 'high', 'of', '21°C', ',', 'perfect', 'for', 'outdoor', 'activities', ',', 'but', 'bundle', 'up', 'a', 'bit', 'for', 'the', '13°C', 'low', 'tonight', '.', 'With', 'only', 'a', 'very', 'slight', 'chance', 'of', 'rain', ',', 'it', "'s", 'a', 'fantastic', 'day', 'to', 'get', 'out', 'and', 'explore', 'the', 'city', '!']
['enjoy', 'a', 'beauti', 'most', 'sunni', 'day', 'in', 'melbourn', '!', 'it', "'s", 'current', 'a', 'pleasant', '18°c', 'with', 'a', 'light', 'souther', 'breez', ',', 'but', 'make', 'sure', 'to', 'slip', 'on', 'some', 'sunscreen', 'as', 'the', 'uv', 'index', 'climb', 'to', 'a', 'veri', 'high', '10', 'later', 'today', '.', 'expect

## Calculating the frequencies of words

> Note: In the real word case, we need to have a vocabulary to cover the whole words to create a vector

We calculate their frequences to create a vector.

In [2]:
import collections

bag_of_words=collections.Counter(stemmed_words)
print(bag_of_words)

Counter({'a': 8, ',': 4, 'to': 3, 'the': 3, 'day': 2, '!': 2, 'it': 2, "'s": 2, 'with': 2, 'but': 2, 'veri': 2, 'high': 2, '.': 2, 'of': 2, 'for': 2, 'enjoy': 1, 'beauti': 1, 'most': 1, 'sunni': 1, 'in': 1, 'melbourn': 1, 'current': 1, 'pleasant': 1, '18°c': 1, 'light': 1, 'souther': 1, 'breez': 1, 'make': 1, 'sure': 1, 'slip': 1, 'on': 1, 'some': 1, 'sunscreen': 1, 'as': 1, 'uv': 1, 'index': 1, 'climb': 1, '10': 1, 'later': 1, 'today': 1, 'expect': 1, '21°c': 1, 'perfect': 1, 'outdoor': 1, 'activ': 1, 'bundl': 1, 'up': 1, 'bit': 1, '13°c': 1, 'low': 1, 'tonight': 1, 'onli': 1, 'slight': 1, 'chanc': 1, 'rain': 1, 'fantast': 1, 'get': 1, 'out': 1, 'and': 1, 'explor': 1, 'citi': 1})


# TF-IDF

It is a slightly improved version of the bag of the words approach. It stands for **Term Frequency-Inverse Document Frequency**. It's the multiplication of two metrics.

$$TF-IDF(t,d,D)=TF(t,d)*IDF(t,D)$$

**Term Frequency** shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term($n_t$) in this document(like in the bag of words) by the total number of terms(words)($d$) in the document. However, there are many other approches like just raw count, boolean "frequencies", and different approaches to normalisation. See more on [Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

$$TF(t,d)=\frac{n_t}{d}$$


**Inverse Document Frequency** denotes how much information the word procides. For example, the words 'a' or 'that' don't give you any additional information about the document's topic. In contrast, words like `ChatGPT` or `bioinfomatics` can help you define the domain (but not for this sentence). It's calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 - the more common the word is and the less information it provides. 

* D-> total number of documents in corpus D
* -> number of documents containing term t

$$IDF(t,D)=log(\frac{D}{t})$$


As we can tell that the common words will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it sill can't capture semantic meaning. Moreover, it produces pretty sparse vectors. The length of vectos is equal to the corpus size. There are about 470k unique words in English. so we will have huge vectors. Since the sentence won't have more than 50 unique words, 99.89% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

# Word2Vec

It is the famous approaches to dense representation. There are two different word2vec approaches mentioned in the paper:
* Continuous Bag of Words- prediction the word based on the surrounding words
* Skip-gram - the opposite task - when we predict context based on the word

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/932/699/788/269/102/small/cf5b60614ad0d60b.webp)


The high-level idea of sense vector representration is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word "chrismas" to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words "merry","to" and "you". This model started to take into account the meaning of the words since it's trained on the context of the words. However, it ignores morphology(information we can get from the word parts, for example, that "less" means the lack of something). This drawback was addressed later by looking at subword skip-grams in GLove.

# Transfromers and Sentence Embeddings

See the detail from the previosuly notebooks.

## Transformers

* [Encoder of Transformers](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture)
* [Decoder of Transformers](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture)


## Sentence Embeddings

* [Computing sentence embeddings with multiple GPUs](https://www.kaggle.com/code/aisuko/computing-embeddings-with-multi-gpus)
* [Computing sentence embeddings with streaming](https://www.kaggle.com/code/aisuko/computing-embeddings-streaming)
* [Computing sentence embeddings with Transformers](https://www.kaggle.com/code/aisuko/sentence-embeddings-with-transformers)

# Distance between vectors

Embeddings are actually vectors. So, if we want to unerstand how close two sentences are to each other, we can calcuate the distance between vectors. A smaller distance would be equivalent to a closer semantic meaning.

Different metrics can be used to measure the distance between two vectors:

* Euclidean distance (L2)
* Manhattant distance (L1)
* Dot product
* Cosine distance

Let's discuss them. As a simple exapmle, we will be using two 2D vectors.

In [3]:
vector1=[1,4]
vector2=[2,2]

## Euclidean distance(L2)

The most standard way to define distance between two points(or vectors) is Euclidean distance or L2 norm. This metric is the most commonly used in day-to-day life, for example, when we are talking about the distance between 2 towns. Here is a visual representation and formula for L2 distance.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/933/848/504/549/053/original/1045050091437874.webp)

We can calculate this metric using vanilla Python or leveraging the numpy function.

In [4]:
import numpy as np

print(sum(list(map(lambda x,y:(x-y)**2, vector1, vector2)))**0.5)

print(np.linalg.norm((np.array(vector1)-np.array(vector2)), ord=2))

2.23606797749979
2.23606797749979


## Manhattant distance(L1)

The other commonly used distance is the L1 norm or Manhattan distance. This distance was called after the island of Manhattan(New York). This island has a grid layout of streets, and the shortest routes between two points in Manhattan will be L1 distance since you need to follow the grid.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/933/886/084/041/180/original/504b3c9fcda74345.webp)

We can also implement it from scratch or use the numpy function.

In [5]:
print(sum(list(map(lambda x,y:abs(x-y), vector1, vector2))))
      
print(np.linalg.norm((np.array(vector1)-np.array(vector2)), ord=1))

3
3.0


## Dot product

> Note: This metric is a bot tricky to interpret. On the one hand, it shows you whether vectors are pointing in one direction. On the other hand, the results highly depend on the magnitudes of the vectors.

Another way to look at the distance between vectors is to calculate a dot or scalar product. Here's formula and we can easily implement it.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/933/930/157/590/756/original/3208368343fcbecb.webp)

As we mentioned above, the results highly depends on the magnitudes of the vectors. For example, if we calculate the dot products between two pairs of vectors:

* (1,1) vs (1,1)
* (1,1) vs (10,10)

In both cases, vectors are **collinear**, but the dot product is ten times bigger in the second case: 2 vs 20.

In [6]:
print(sum(list(map(lambda x,y:x*y, vector1, vector2))))

print(np.dot(vector1, vector2))

10
10


## Cosine similarity

Quite often, cosine similarity is used. Cosine similarity is a dot product normalised by vectors' magnitudes(or normes).

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/933/983/599/203/267/original/a4aa2175b2a0a759.webp)

We can either calculate everything ourselves(as previously) or use the function from sklearn.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

dot_product=sum(list(map(lambda x,y:x*y, vector1, vector2)))
norm_vector1=sum(list(map(lambda x:x**2, vector1)))**0.5
norm_vector2=sum(list(map(lambda x:x**2, vector2)))**0.5

print(dot_product/norm_vector1/norm_vector2)

cosine_similarity(
    np.array(vector1).reshape(1,-1),
    np.array(vector2).reshape(1,-1))[0][0]

0.8574929257125442


0.8574929257125441

The function `cosine_similarity` expects 2D arrays. That's why we need to reshape the numpy arrays. Cosine similarity is equal to the cosine between two vectors. The closer the vectors are, the higher the metric value.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/934/012/642/244/598/original/40ec654df0850059.webp)

We can even calculate the exact angle between our vectors in degrees. We get result around 30 degreees, and it looks pretty reasonable.

In [8]:
import math

math.degrees(math.acos(0.8575))

30.962968709327708

# What metric to use?

We have discussed different ways to calcualte the distance between two vectors. And we can use any distance to compare the embeddings we have. However, for NLP tasks, the best practice is sually to use cosine similarity. And some reasons behind it:

* Cosine similarity is between -1 and 1, while L1 and L2 are unbounded, so it's easier to interpret.
* From the practical perspective, it's more effective to calculate dot products than square roots for Euclidean distance.
* Cosine similarity is less affected by the curse of dimensionality

Note: There are many of functions that support normalization the embeddings like `sentenceTransformer.encode()`. This can help us to avoid `The curse of dimensionality`.

# Credit

* https://medium.com/towards-data-science/text-embeddings-comprehensive-guide-afd97fce8fb5
* https://arxiv.org/pdf/1301.3781.pdf