Demistifying embeddings
======================

Embeddings is one of the core concept in NLP. It's a way to represent text into dense and low-dimensional vector space. Why low dimension? We will see later in this post.

## Encoding vs Embeddings:
Machines only understand the language of bits and bytes represented by binary number system 1 and 0 (base 2). All forms of data is finally stored in sequences of 0's and 1's. There are multiple encoding schemes (UTF-8, ISO-8859 etc.) to represent characters in binary format. Where encoding is a way to represent textual data in the form which computers understand, Embeddings on other hand is a dense vector representation of data (words, sentences, images). 
Embeddings aim to capture semantic similarities and relationships in a lower-dimensional space. They enable models to understand the context and meaning of words or phrases beyond simple character representation. 
Both encodings and embeddings play crucial roles in text processing and machine learning. While encodings ensure that text is represented in a binary format for storage and transmission, embeddings enable models to understand and leverage the semantic relationships between words or sentences. Understanding the differences and interplay between these two concepts is essential for effectively working with text data in NLP and other domains.

In [1]:
import numpy as np
text = "hello"

# Encoding the text ASCII
for ch in text:
    print(f"ASCII value of {ch}::: ", ord(ch))


# embeddings can be pre-trained or learned during the training of the model.
# In real world, embeddings are represented in 100s of dimensions
# Just for an example, character in 2-D space

for ch in text:
    print(f"Embedding of {ch}::: ", np.random.rand(2).round(3))

ASCII value of h:::  104
ASCII value of e:::  101
ASCII value of l:::  108
ASCII value of l:::  108
ASCII value of o:::  111
Embedding of h:::  [0.3   0.559]
Embedding of e:::  [0.716 0.37 ]
Embedding of l:::  [0.135 0.508]
Embedding of l:::  [0.5   0.067]
Embedding of o:::  [0.959 0.731]


Just to mention, embedding dimensions are abstract in the sense that they don't have human interpretable meaning. The dimensions encode complex interactions learned by the model. 

For example, the embedding for the word "king" might be a n-dimensional vector like [0.25, -0.34, 0.91, ...]. Each number in this vector contributes to the overall meaning of "king" but does not correspond to a single, easily interpretable feature.

#### Naive Integer Embedding

A naive way to represent text tokens with integers is by using a simple integer encoding or tokenization approach. This involves assigning a unique integer to each word (or token) in your vocabulary. It’s straightforward but lacks the nuance of more sophisticated methods like word embeddings.

In [2]:
text = "a quick brown fox jumps over the lazy dog"

# tokenize the text
tokenized_text = text.split()
print(tokenized_text)
print("")

# create a dictionary of words and their embeddings
# use simple integer values as embeddings

vocab = {word: i for i, word in enumerate(tokenized_text)}
print("Embeddings of words in the text")
for word in tokenized_text:
    print(f"Embedding of {word}::: ", vocab[word])

# print encoded text
print("")
encoded_text = [vocab[word] for word in tokenized_text]
print("Encoded text::: ", encoded_text)

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Embeddings of words in the text
Embedding of a:::  0
Embedding of quick:::  1
Embedding of brown:::  2
Embedding of fox:::  3
Embedding of jumps:::  4
Embedding of over:::  5
Embedding of the:::  6
Embedding of lazy:::  7
Embedding of dog:::  8

Encoded text:::  [0, 1, 2, 3, 4, 5, 6, 7, 8]


Limitations of Integer encoding:

* The numbers are arbitrary and do not represent a relationships between words in the sentence. The semantics are lost with this embedding scheme
* For out of vocabulary words, there is no integer representation. (Words which are not there in the training data)
* The naive integer representation lacks the dense and distributed representation that embeddings provide, leading to worse performance in machine learning models.


#### One-Hot Encoding
Instead of assigning a unique integer, you can create a sparse one-hot encoded vector for each word

In [6]:
# sort the tokens and create a vocabulary
# generate one hot encoding for each token

sorted_tokens = sorted(set(tokenized_text))
vocab = {word: i for i, word in enumerate(sorted_tokens)}
embeddings = np.eye(len(vocab)) # one hot encoding of the tokens (vocab_size x vocab_size)

for word in sorted_tokens:
    print(f"Embedding of {word:<10}::: ", embeddings[vocab[word]])

Embedding of a         :::  [1. 0. 0. 0. 0. 0. 0. 0. 0.]
Embedding of brown     :::  [0. 1. 0. 0. 0. 0. 0. 0. 0.]
Embedding of dog       :::  [0. 0. 1. 0. 0. 0. 0. 0. 0.]
Embedding of fox       :::  [0. 0. 0. 1. 0. 0. 0. 0. 0.]
Embedding of jumps     :::  [0. 0. 0. 0. 1. 0. 0. 0. 0.]
Embedding of lazy      :::  [0. 0. 0. 0. 0. 1. 0. 0. 0.]
Embedding of over      :::  [0. 0. 0. 0. 0. 0. 1. 0. 0.]
Embedding of quick     :::  [0. 0. 0. 0. 0. 0. 0. 1. 0.]
Embedding of the       :::  [0. 0. 0. 0. 0. 0. 0. 0. 1.]


#### Limitations of One-Hot encoding:

* Dimensionality of embedding matrix grows with vocabulary. This leads to sparse vectors, which are memory-inefficient and computationally expensive
* No semantic information, even though fox and dog are semantically similar (animals), they have completely different vectors. This prevents models from understanding relationships like synonyms, antonyms, or words that often appear together.
* One-hot encoding is based on a fixed vocabulary. If you encounter a word that wasn’t in your vocabulary during training (an out-of-vocabulary word), there is no way to represent it.This limits the generalization of models in real-world scenarios.


#### Word Embeddings
Word embeddings address most of the limitations of one-hot encoding
* Low Dimensionality: Word embeddings reduce high-dimensional one-hot vectors to lower-dimensional dense vectors (e.g., 100 to 300 dimensions).
* Semantic Meaning: Embeddings capture semantic relationships between words, placing similar words (e.g., "fox" and "dog") closer in the embedding space.
* Contextual Information: Embedding models like BERT and GPT can capture context, meaning that the same word can have different embeddings depending on its context.
* Better Generalization: Word embeddings generalize better because they capture word similarity and relationships in their vector representations.

In [4]:
text = "a quick brown fox jumps over the lazy dog"

import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Ensure the model is in evaluation mode
model.eval()

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

# Get token IDs and attention mask
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Get the embeddings from the model
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)

# The outputs contain the last hidden states
last_hidden_states = outputs.last_hidden_state

# Convert token IDs back to tokens
input_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# Get embeddings for all tokens
token_embeddings = last_hidden_states[0]

embed_dict = {}

# Get the embeddings for only fiox and dog

for token, embedding in zip(input_tokens, token_embeddings):
    if token in ["fox", "dog"]:
        embed_dict[token] = embedding

# shape of the embeddings
embed_dict["fox"].shape

torch.Size([768])

In [5]:
# calculate similarity between  fox and dog
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embed_dict["fox"].reshape(1, -1), embed_dict["dog"].reshape(1, -1))
print("Similarity between fox and dog::: ", similarity)

Similarity between fox and dog:::  [[0.5633921]]
