# Overview

In this notebook, let's familiar some (lower-level)tools are mentioned in the previous nobooks. This notebook can be a appendix for the previous notebooks which with higher level techniques. We will follow the timeline of evolution of embeddings. And all the images are from the credit section at the bottom.


# Bag of Words

Note: This approach is quite basic, and it doesn't take into account the semantic meaning of the words.

It is a basic approach to converting texts into vectors. The first step to get a bag of words vector is to split the text into tokens and then reduce words to their base forms. For example, "running" will transform into "run". This process is called `stremming`. We use `NLTK` for it.

In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

weather_of_melbourne="Enjoy a beautiful mostly sunny day in Melbourne! It's currently a pleasant 18°C with a light southerly breeze, but make sure to slip on some sunscreen as the UV index climbs to a very high 10 later today. Expect a high of 21°C, perfect for outdoor activities, but bundle up a bit for the 13°C low tonight. With only a very slight chance of rain, it's a fantastic day to get out and explore the city!"

# tokenization - splitting text into words
words=word_tokenize(weather_of_melbourne)
print(words)

print(100*'=')
#steamming
stemmer=SnowballStemmer(language='english')
stemmed_words=list(map(lambda x: stemmer.stem(x), words))
print(stemmed_words)

['Enjoy', 'a', 'beautiful', 'mostly', 'sunny', 'day', 'in', 'Melbourne', '!', 'It', "'s", 'currently', 'a', 'pleasant', '18°C', 'with', 'a', 'light', 'southerly', 'breeze', ',', 'but', 'make', 'sure', 'to', 'slip', 'on', 'some', 'sunscreen', 'as', 'the', 'UV', 'index', 'climbs', 'to', 'a', 'very', 'high', '10', 'later', 'today', '.', 'Expect', 'a', 'high', 'of', '21°C', ',', 'perfect', 'for', 'outdoor', 'activities', ',', 'but', 'bundle', 'up', 'a', 'bit', 'for', 'the', '13°C', 'low', 'tonight', '.', 'With', 'only', 'a', 'very', 'slight', 'chance', 'of', 'rain', ',', 'it', "'s", 'a', 'fantastic', 'day', 'to', 'get', 'out', 'and', 'explore', 'the', 'city', '!']
['enjoy', 'a', 'beauti', 'most', 'sunni', 'day', 'in', 'melbourn', '!', 'it', "'s", 'current', 'a', 'pleasant', '18°c', 'with', 'a', 'light', 'souther', 'breez', ',', 'but', 'make', 'sure', 'to', 'slip', 'on', 'some', 'sunscreen', 'as', 'the', 'uv', 'index', 'climb', 'to', 'a', 'veri', 'high', '10', 'later', 'today', '.', 'expect

## Calculating the frequencies of words

> Note: In the real word case, we need to have a vocabulary to cover the whole words to create a vector

We calculate their frequences to create a vector.

In [2]:
import collections

bag_of_words=collections.Counter(stemmed_words)
print(bag_of_words)

Counter({'a': 8, ',': 4, 'to': 3, 'the': 3, 'day': 2, '!': 2, 'it': 2, "'s": 2, 'with': 2, 'but': 2, 'veri': 2, 'high': 2, '.': 2, 'of': 2, 'for': 2, 'enjoy': 1, 'beauti': 1, 'most': 1, 'sunni': 1, 'in': 1, 'melbourn': 1, 'current': 1, 'pleasant': 1, '18°c': 1, 'light': 1, 'souther': 1, 'breez': 1, 'make': 1, 'sure': 1, 'slip': 1, 'on': 1, 'some': 1, 'sunscreen': 1, 'as': 1, 'uv': 1, 'index': 1, 'climb': 1, '10': 1, 'later': 1, 'today': 1, 'expect': 1, '21°c': 1, 'perfect': 1, 'outdoor': 1, 'activ': 1, 'bundl': 1, 'up': 1, 'bit': 1, '13°c': 1, 'low': 1, 'tonight': 1, 'onli': 1, 'slight': 1, 'chanc': 1, 'rain': 1, 'fantast': 1, 'get': 1, 'out': 1, 'and': 1, 'explor': 1, 'citi': 1})


# TF-IDF

It is a slightly improved version of the bag of the words approach. It stands for **Term Frequency-Inverse Document Frequency**. It's the multiplication of two metrics.

$$TF-IDF(t,d,D)=TF(t,d)*IDF(t,D)$$

**Term Frequency** shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term($n_t$) in this document(like in the bag of words) by the total number of terms(words)($d$) in the document. However, there are many other approches like just raw count, boolean "frequencies", and different approaches to normalisation. See more on [Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

$$TF(t,d)=\frac{n_t}{d}$$


**Inverse Document Frequency** denotes how much information the word procides. For example, the words 'a' or 'that' don't give you any additional information about the document's topic. In contrast, words like `ChatGPT` or `bioinfomatics` can help you define the domain (but not for this sentence). It's calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 - the more common the word is and the less information it provides. 

* D-> total number of documents in corpus D
* -> number of documents containing term t

$$IDF(t,D)=log(\frac{D}{t})$$


As we can tell that the common words will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it sill can't capture semantic meaning. Moreover, it produces pretty sparse vectors. The length of vectos is equal to the corpus size. There are about 470k unique words in English. so we will have huge vectors. Since the sentence won't have more than 50 unique words, 99.89% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

# Word2Vec

It is the famous approaches to dense representation. There are two different word2vec approaches mentioned in the paper:
* Continuous Bag of Words- prediction the word based on the surrounding words
* Skip-gram - the opposite task - when we predict context based on the word

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/932/699/788/269/102/small/cf5b60614ad0d60b.webp)


The high-level idea of sense vector representration is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word "chrismas" to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words "merry","to" and "you". This model started to take into account the meaning of the words since it's trained on the context of the words. However, it ignores morphology(information we can get from the word parts, for example, that "less" means the lack of something). This drawback was addressed later by looking at subword skip-grams in GLove.

# Transfromers and Sentence Embeddings

See the detail from the previosuly notebooks.

## Transformers

* [Encoder of Transformers](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture)
* [Decoder of Transformers](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture)


## Sentence Embeddings

* [Computing sentence embeddings with multiple GPUs](https://www.kaggle.com/code/aisuko/computing-embeddings-with-multi-gpus)
* [Computing sentence embeddings with streaming](https://www.kaggle.com/code/aisuko/computing-embeddings-streaming)
* [Computing sentence embeddings with Transformers](https://www.kaggle.com/code/aisuko/sentence-embeddings-with-transformers)

# Calculating embeddings

# Credit

* https://medium.com/towards-data-science/text-embeddings-comprehensive-guide-afd97fce8fb5
* https://arxiv.org/pdf/1301.3781.pdf