# Embeddings

We can understand embeddings as a representation.
When working on Machine/Deep Learning models, we need to represent information in a numerical way.
This is not a problem when we can extract this type of features directly from our problem, such as the house price problem where we use the *number* of rooms, the slot *area*, etc.
However, when working with natural language, we have to convert letters, words, sentences, or entire documents into numerical information.
In this notebook, we will see some existing embedding methods for representing natural language.

### Bag of Words

This is the most basic structure we can use to represent text.
The basic idea is to represent a sentence by a vector where each position represents a word and the value in it represents the number of occurrences in the sentence.
Consider the following example:

In [2]:
# Sentences.
sentences = [
    'I love embeddings',
    'I do not like embeddings',
    'Love is like trash',
    'I am like you and I hate'
]

In order to use it properly, we first preprocess the text.
In our example, I will just turn all text into lower case, but it can vary, like removing stop-words, among other techniques.

In [3]:
# Preprocess.
lower_sentences = [s.lower() for s in sentences]
lower_sentences

['i love embeddings',
 'i do not like embeddings',
 'love is like trash',
 'i am like you and i hate']

Then, we will create a set with all unique words from our sentences, this will be our **vocabulary**.

In [5]:
word_list = list()

for sentence in lower_sentences:
    for word in sentence.split(): # Break sentence into individual words.
        word_list.append(word)
        
word_list = list(set(word_list)) # Turn into a set to remove duplicates. Then back to a list.
word_list

['do',
 'love',
 'am',
 'i',
 'you',
 'hate',
 'embeddings',
 'like',
 'and',
 'trash',
 'is',
 'not']

Now, our representation will be based on this vector.

In [6]:
# Representing our sentences.
bow_sentences = []

for sentence in lower_sentences:
    bow_sent = [0 for i in word_list]
    
    for word in sentence.split():
        index = word_list.index(word)
        bow_sent[index] = bow_sent[index] + 1
    print("Sentence:", sentence)
    print("BOW Representation:", bow_sent)

    bow_sentences.append(bow_sent)
print("Word List:", word_list)

Sentence: i love embeddings
BOW Representation: [0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]
Sentence: i do not like embeddings
BOW Representation: [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1]
Sentence: love is like trash
BOW Representation: [0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]
Sentence: i am like you and i hate
BOW Representation: [0, 0, 1, 2, 1, 1, 0, 1, 1, 0, 0, 0]
Word List: ['do', 'love', 'am', 'i', 'you', 'hate', 'embeddings', 'like', 'and', 'trash', 'is', 'not']


##### Drawbacks

Although simple and easy to use, bag-of-words are not useful when we want to capture the semantic meaning of sentences or simply word order.
For instance, imagine that we trained a Sentiment Analysis model that classifies sentences as *Positive* and *Negative* regarding the sentiment they express.
If we use Bag-of-Words to train our model, the word 'love' and 'like' can end up having a strong weight towards the positive sentiment, while *hate* and *dislike* being considered negative.
Considering our small set of sentences, the trained model would be misled by a sentence such as "*I do not like embeddings*", as it has the *like* word but, it is negative.
Something similar can happen with the "*love is like trash*" sentence.

### TF-IDF

**TF-IDF** stands for *Term Frequency - Inverse Document Frequency* and represents texts using the frequency of terms regarding the corpus used.
With this representation, we are providing a significance score to terms according to their importance in the corpus.
For example, consider the word "*the*". It is potentially in every English text possible and occurs multiple times.
Therefore, it does not carry too much information about the text it is in, *i.e.*, it does not help to differentiate it from the rest of the texts.
On the other hand, the word "*embedding*" carries a lot of information in it, as it is not common and refers to specific scenarios.

Consider the following example using our sentences set used in the bag-of-words example.
Here, instead of considering a set of sentences, we will say that each sentence is a single document.
The change is only in the term we are going to use.

In [7]:
import math

tf_idf_sentences = list()
n_docs = len(lower_sentences)

# First, let's calculate the document frequency of the terms.
df = [0 for i in word_list]

for s in lower_sentences:
    # Given a document.
    for t in word_list:
        # For each term, check if it occurs in the document.
        for w in s.split():
            # Run through the document words.
            if t == w:
                # If the word is in the document.
                ind = word_list.index(w)
                df[ind] = df[ind] + 1 # Update the counter in the term index.
                break
    
# Then, we can calculate the tf (term frequency) of each term and obtain the final result.
for s in lower_sentences:
    tf_idf_sent = [0 for i in word_list]
    words = s.split()
    n_words = len(words)

    for t in word_list:
        # For each term in our vocabulary.
        tf = 0

        for w in words:
            # Find the term among the words in the document.
            if t == w:
                tf = tf + 1 # Add to the counter.

        tf = tf/n_words # Calculate term frequency over document number of words.
        ind = word_list.index(t)
        """
            Calculate the TF-IDF by multiplying the term frequency by the document frequency of the term,
            i.e., the number of documents in the corpus divided by the number of documents containing the term.
        """
        tf_idf_sent[ind] = tf * math.log(n_docs/df[ind])
    
    print("Sentence:", s)
    print("BOW Representation:", tf_idf_sent)
    tf_idf_sentences.append(tf_idf_sent)
print(word_list)

Sentence: i love embeddings
BOW Representation: [0.0, 0.23104906018664842, 0.0, 0.09589402415059362, 0.0, 0.0, 0.23104906018664842, 0.0, 0.0, 0.0, 0.0, 0.0]
Sentence: i do not like embeddings
BOW Representation: [0.2772588722239781, 0.0, 0.0, 0.05753641449035617, 0.0, 0.0, 0.13862943611198905, 0.05753641449035617, 0.0, 0.0, 0.0, 0.2772588722239781]
Sentence: love is like trash
BOW Representation: [0.0, 0.17328679513998632, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07192051811294521, 0.0, 0.34657359027997264, 0.34657359027997264, 0.0]
Sentence: i am like you and i hate
BOW Representation: [0.0, 0.0, 0.19804205158855578, 0.08219487784336595, 0.19804205158855578, 0.19804205158855578, 0.0, 0.04109743892168297, 0.19804205158855578, 0.0, 0.0, 0.0]
['do', 'love', 'am', 'i', 'you', 'hate', 'embeddings', 'like', 'and', 'trash', 'is', 'not']


As we can see, common terms, such as "*I*", obtain lower values compared to less common ones, such as "*trash*". The obtained values will serve as a weight for learning methods to consider when classifying such texts. 

### Word2Vec

Recently, the use of embeddings changed drastically by the use of neural networks to create such representation. One of the first initiatives of using neural networks was the Word2Vec. The main idea was to find a way to have a representation that encapsulates the context in which words are used in a text. To do so, [in their paper](https://arxiv.org/pdf/1301.3781.pdf%C3%AC%E2%80%94%20%C3%AC%E2%80%9E%C5%93) Mikolov et al. describe CBOW and Skip-Gram models as simple neural networks that try to use the surrounding words of a target word to generate a representation.
For instance, given the following sentence "*the cat on the mat*" and a window with size 4, we could train the "on" representation as:

- First, CBOW uses the surrounding words to generate an initial word representation;
    - Input: "the", "cat", "the", "mat"
    - Output: "on" representation
- Then, Skip-gram uses the resulting representation as input to generate the surrounding words.
    - Input: "on" representation
    - Output: "the", "cat", "the", "mat"

<center>
    <img src='https://www.researchgate.net/profile/Wang-Ling-16/publication/281812760/figure/fig1/AS:613966665486361@1523392468791/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models.png' width=600/>
</center>

The main property of this representation is that it creates vectors for words that contain the semantic meaning of the word based on the context. This way, all words can be represented in the same dimensionality space (all words have the same vector size), which allows us to identify similarities between words by their distance. It is even possible to perform mathematical operations, such as:

V(king) - V(man) + V(woman) = V(queen)

In this example, we modify the *king* vector by subtracting *man* and adding *woman*. As result, we obtain the *queen* vector. Actually, the result is not exactly the queen vector, but queen will be the closest one to the resulting vector. 

<center>
    <img src='https://1.bp.blogspot.com/-VhOFQH--Izo/XVfXQ2xWOUI/AAAAAAAANw8/n6VCsT6z_OMWdbmx3O2snLeJOJiJcT4LwCLcBGAs/s1600/w2v_001.png' width=600/>
</center>

Here, we will use an existing library to perform Word2Vec. 

In [8]:
# Python program to generate word vectors using Word2Vec

# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
import numpy as np

warnings.filterwarnings(action = 'ignore')

import gensim
from gensim.models import Word2Vec

# Reads ‘alice.txt’ file
sample = open("supporting_texts/alice.txt", "r")
s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []

# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []

    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())

    data.append(temp)

# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,
                        vector_size = 50, window = 5, seed=42)

# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 50,
                                            window = 5, sg = 1, seed=42)


ModuleNotFoundError: No module named 'gensim'

In [8]:
print(model2.wv['king'])
print(model2.wv['alice'])
print(model2.wv.most_similar('alice', topn=10))
print(model2.wv.most_similar('king', topn=10))

[ 0.13175842  0.07318828  0.23259278  0.11238079  0.08772991 -0.1094278
  0.01447148  0.21727201 -0.3275798   0.0608438   0.08709006  0.34696513
 -0.13289163  0.44468698  0.17632636 -0.0565002   0.06107061 -0.4055396
 -0.4012241  -0.05726532 -0.39340606 -0.32697043 -0.08423092  0.19419779
  0.6429302  -0.12702861 -0.08809735 -0.16599075 -0.34790352 -0.33239952
 -0.09282667  0.05073008 -0.26013282  0.25128275 -0.42947346  0.28955492
 -0.1756014  -0.7376331  -0.00924839  0.25418404  0.03287344  0.23274258
  0.5267652  -0.10837778  0.0453444  -0.02558453  0.3589811   0.2514291
  0.18430313 -0.11240748]
[ 0.1984804   0.16906266  0.2556089   0.10391572  0.02816753 -0.05490195
  0.01585247  0.22863606 -0.34660432  0.09359525  0.19915769  0.41965812
 -0.19819796  0.4896722   0.3313081  -0.04547765  0.17936224 -0.4153227
 -0.4469249  -0.11902079 -0.42439133 -0.43051377 -0.18797913  0.2214064
  0.75050586 -0.21559669 -0.09494667 -0.25087526 -0.43159768 -0.4114146
 -0.14018968  0.13352662 -0.293

In [9]:
vking = model2.wv['king']
vman = model2.wv['man']
vwoman = model2.wv['woman']
vqueen = vking - vman + vwoman
print(model2.wv.most_similar(vqueen))

[('king', 0.9757580161094666), ('said', 0.9702053666114807), ('alice', 0.9692055583000183), ('hatter', 0.9616694450378418), ('”', 0.9567456841468811), ('gryphon', 0.9541082978248596), ('thought', 0.9510892033576965), ('he', 0.9500472545623779), ('mock', 0.9488303661346436), ('herself', 0.9461054801940918)]


Since we are using a small dataset here, the idea of subtracting *man* and adding *woman* from *king* did not work as expected. However, *queen* is in the list of most similar to our resulting vector.

### Sent2Vec

Inspired by Word2Vec, [Pagliardini et al.](https://aclanthology.org/N18-1049.pdf) use a similar method to generate representations to entire sentences and documents.
In this case, instead of just words, we will also have sentences and paragraphs as context.
They generate sentence representations by averaging the words in the input sentence.
Thus, they train a model to predict the next word.
This method (as well as Word2Vec) is considered to be unsupervised, since we are not using an annotation of data to train the model.
Instead, we use the text itself, so the model learns by input.

<center>
    <img src='https://miro.medium.com/max/1400/1*RyWXrpAxzzO_zzZgtMN1mQ.png' width=600>
<center/>

In order to use a sent2vec model as described in the paper, we can follow the instructions from the authors in the paper [repository](https://github.com/epfml/sent2vec).
Since it relies on downloading heavy models, I will skip it in this notebook.
But I encourage you to test it by yourself.

### BERT

[BERT](https://arxiv.org/pdf/1810.04805.pdf) stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers.
As its name says, it is based on a previous model called [Transformer](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).
A Transformer consists of an Encoder and a Decoder, and it is commonly used for Machine Translation.
In general terms, the Encoder creates a representation to the input and the Decoder processes it to convert in the expected output.

<center>
    <img src='https://production-media.paperswithcode.com/methods/new_ModalNet-21.jpg' width=300>
<center/>
Transformer architecture.

BERT exploits the encoder part of Transformer to generate rich representations to the input and places a classifier head at the end to be trained as a standard model.
The encoder uses self-attention layers to process the information in a way that it can focus on what is more important for the specific terms regarding the whole input.
It makes each representation unique according to the context.
For example, using Word2Vec, we generate a unique representation to a word.
However, we know that words can have multiple meanings.
For the Word2Vec representation, all these meanings are combined in a single vector.
We say, then, that Word2Vec is **context-independent**.
On the other hand, BERT is **context-dependent**, *i.e.*, it generates different representations to different meanings of the same word depending on the context it is in.

<center>
    <img src='https://www.researchgate.net/publication/349546860/figure/fig2/AS:994573320994818@1614136166736/The-Transformer-based-BERT-base-architecture-with-twelve-encoder-blocks.ppm' width=600>
<center/>
BERT architecture example for a classification task.

Consider that we have the following two sentences:
1. We went to the *river* **bank**.
2. I need to go to the **bank** to *make a deposit*.

In bold, **bank** is the word with different meanings for each of the two sentences.
In italic, is the context that allows us to understand the meaning of bank.
While word2vec will generate a single vector of bank for both sentences, BERT generates different ones.
The reason for this is that Word2Vec needs only a single word as input to generate the representation, while BERT always needs to have the entire sentence as input to generate the representation of a single word.

To improve your understanding on this subject, please, read this following [blog post](https://medium.com/swlh/differences-between-word2vec-and-bert-c08a3326b5d1).

In order to use BERT, we have multiple existing python libraries that allows us to load pre-trained models and use them for classification tasks, but also for embedding generation.
Since it involves the use of Machine Learning frameworks, such as PyTorch, TensorFlow, and the Transformers library, I will cover them in the next notebook, which will be focused on frameworks for Machine Learning.
Now, you can follow the steps provided by the [TensorFlow page](https://www.tensorflow.org/hub/tutorials/bert_experts) to use existing pre-trained BERT models on sentences.