### Transformer based embedding model

<img src="transformer_embedding.png" width="300" height="340">


### Inputs of the embedding model

- Inputs of the model are mapped to the input embeddings.
- Each input has a corresponding embedding learned during the training process.
- The inputs are processed by the model's tokenizer, which converts the text into a sequence of numerical IDs. These IDs correspond to tokens recognized by the tokenizer from the input text. This mapping forms a contract between the tokenizer and the model, ensuring consistent interpretation of inputs. As a result, swapping out components like the tokenizer and model post-training is not straightforward, since the IDs are crucial for proper functioning. A basic approach to handling inputs would involve treating each individual letter or token as a separate entity.

### Tokenization
1. Character base tokenization
The most basic approach would be to let the model work directly on a character or bytes level
- Small vocabulary
- Ambiguity

2. Word base tokenization
On the other hand, we could prefer to use whole words, as they have their meanings.
- Huge vocabulary
- Unseen words cannot be represnted

#### Difference between them

- Word-Based Tokenization:
Training: Easier to train with shorter sequences and manageable vocabulary sizes.
Advantages: Simple, intuitive, and computationally efficient for smaller datasets.
Disadvantages: Suffers from OOV issues and has difficulty handling morphological variations.

- Character-Based Tokenization:
Training: Requires handling longer sequences, making training more computationally intensive.
Advantages: No OOV problems and compact vocabulary size.
Disadvantages: Longer sequences, slower training, and loss of semantic information at the token level.

-  **Character-based tokenization** requires more time to train compared to **word-based tokenization**. This is because character-based tokenization generates longer sequences, requiring the model to process more tokens for the same text. Even though it has a smaller vocabulary, the increased sequence length and the need to learn relationships between individual characters make training slower.

### Practical Example of Word-Based Tokenization:
Suppose we have the sentence:  
**"The quick brown fox jumps over the lazy dog."**

In **word-based tokenization**, this sentence would be split into individual word tokens:
- Tokens: `["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]`
  
Each of these words would be treated as a single token, and during training, the model would learn an embedding for each word. If the model encounters an unfamiliar word (e.g., "quicksilver"), it may struggle because it's not in the vocabulary, leading to an OOV (Out of Vocabulary) issue.

---

### Practical Example of Character-Based Tokenization:
Taking the same sentence:  
**"The quick brown fox jumps over the lazy dog."**

In **character-based tokenization**, the sentence would be split into individual characters:
- Tokens: `["T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f", "o", "x", " ", "j", "u", "m", "p", "s", " ", "o", "v", "e", "r", " ", "t", "h", "e", " ", "l", "a", "z", "y", " ", "d", "o", "g", "."]`

Here, each character (including spaces and punctuation) is a token, so the sequence becomes much longer. However, the model can handle any word, even something unfamiliar like "quicksilver," by breaking it into characters, avoiding the OOV issue. 

---

### Time to Train:
- **Word-based tokenization** would be faster to train since there are fewer tokens per sentence (9 tokens in this example).
- **Character-based tokenization** takes more time because the sentence results in 43 tokens, requiring the model to process more tokens per input.

--- 
### Subword tokenization
Words having the same root form should be similar in terms of meaning.
**Subword tokenization** is a method that splits words into smaller, meaningful subword units, ensuring that words with similar roots or forms are tokenized in a way that maintains their semantic relationships. For example, words like "run," "running," and "runner" would be tokenized into subword units that share common components, allowing the model to generalize their meanings more effectively.

Here’s a more detailed breakdown:

### Why Subword Tokenization is Important:
- **Words with the same root** (like "happy," "happily," "happiness") should be represented in a way that their shared root ("happi") captures the similarity in meaning. By breaking these words into subword units, the model can understand their connection.
- **Morphological variants** of words (like verb tenses, plural forms, or derived nouns) can be handled more effectively than in word-based tokenization, where these forms would be treated as separate tokens.

### Subword Tokenization Approaches:
1. **NLP-Based Techniques (Stemming or Lemmatization)**:
   - Stemming and lemmatization are traditional **NLP techniques** that reduce words to their base or root forms.
     - **Stemming** strips suffixes, producing a crude base (e.g., "running" → "run").
     - **Lemmatization** converts a word to its dictionary form (e.g., "better" → "good").
   - While these techniques make words with similar roots look alike, they lack the flexibility and statistical learning capabilities of modern tokenization approaches.

2. **Trainable Approaches Based on Statistics**:
   - Subword tokenization methods like **Byte Pair Encoding (BPE)** and **WordPiece** are trainable and based on the statistical frequency of subword units in a corpus.
   - These algorithms start by splitting text into individual characters, then iteratively merge the most frequent pairs of characters (or subword units), creating a vocabulary that captures both common words and frequent subword patterns.
   - For example:
     - "happily" might be split into `["happi", "ly"]`, and "happiness" into `["happi", "ness"]`.
     - This ensures that the common subword "happi" is learned across all words sharing the same root, allowing the model to associate these variations with similar meanings.

### Advantages of Trainable Subword Tokenization:
- **Shared Meaning for Morphologically Related Words**: Words with similar root forms are tokenized in a way that captures their related meanings.
- **Efficient Vocabulary**: By learning the most common subwords statistically, the vocabulary can capture both frequent words and meaningful subword units, reducing the total vocabulary size.
- **Handling Rare and Unseen Words**: Rare words are split into smaller subwords, ensuring that even unfamiliar words can be processed by breaking them into known units.

### Example:
For the words "runner," "running," and "run," subword tokenization might generate tokens like:
- `["run", "ner"]`
- `["run", "ning"]`
- `["run"]`

The model can then understand that all these tokens share the same root subword "run," which aids in learning their related meanings.

### Summary:
- **Subword tokenization** allows words with similar root forms to be tokenized similarly, helping the model understand their semantic relationships.
- **NLP-based techniques** like stemming and lemmatization focus on reducing words to their base forms but don't offer the flexibility of modern, trainable methods.
- **Trainable methods** like BPE and WordPiece use statistical patterns to learn subword units, enabling better handling of word variations, rare words, and vocabulary efficiency.

This approach allows NLP models to generalize better, handle unseen words more effectively, and capture semantic similarities between related words.

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Got it! Here's a conceptual explanation with an example that demonstrates the process of embedding text without the need for Python code:

---

### **Process: Input of Embedding Model**

1. **Text Input**:
   - You start with a sentence, for example:  
     **Text**: *"Transformers are amazing models for NLP."*

2. **Tokenizer**:
   - The sentence is split into tokens, which can be words or subwords.
   - For instance, the sentence could be tokenized as:  
     **Tokens**: `['transform', '##ers', 'are', 'amazing', 'models', 'for', 'nlp', '.']`
   - This is an example of **subword tokenization**, where larger words like "Transformers" are split into smaller units: `"transform"` and `"##ers"`.

3. **Sequence of IDs**:
   - Each token is then mapped to a unique numerical identifier (ID). This is done based on a vocabulary that the tokenizer has learned during model training.
   - For example, the tokens might be mapped to the following IDs:  
     **Token IDs**: `[19081, 2015, 2024, 6429, 3597, 2005, 17953, 1012]`

4. **Embedding Layer**:
   - These token IDs are passed into the model’s embedding layer, where each ID is mapped to a **dense vector** (embedding). These vectors represent the meaning of the tokens in a form that can be understood by the model.
   - For instance, the ID `19081` (representing the token "transform") is mapped to a vector, say, of dimension 768. Each token has its corresponding vector in this step.

---

### **Practical Example Explanation**:

- **Text**: "Transformers are amazing models for NLP."
  
- **Tokens** (after tokenization):  
  `['transform', '##ers', 'are', 'amazing', 'models', 'for', 'nlp', '.']`
  
- **Token IDs** (corresponding ID numbers for each token):  
  `[19081, 2015, 2024, 6429, 3597, 2005, 17953, 1012]`
  
- **Embedding Layer**:  
  The sequence of token IDs is passed through the embedding layer, and each ID gets mapped to its corresponding dense vector (embedding). For example:
  - `"transform"` (ID: 19081) → vector: `[0.23, -0.12, 0.54, ..., 0.92]`
  - `"##ers"` (ID: 2015) → vector: `[0.35, 0.18, -0.25, ..., 0.14]`
  - (and so on for each token)

---

### **Workflow Overview (Arrow-based Steps)**:

1. **Text Input** → 
   - (*Input sentence: "Transformers are amazing models for NLP."*)

2. **Tokenizer** → 
   - (*Splits the sentence into tokens: `['transform', '##ers', 'are', 'amazing', 'models', 'for', 'nlp', '.']`*)

3. **Token IDs Generation** → 
   - (*Each token is mapped to its numerical ID: `[19081, 2015, 2024, 6429, 3597, 2005, 17953, 1012]`*)

4. **Embedding Layer** → 
   - (*Each token ID is converted into a dense embedding vector*).

---

### **Summary**:
- **Text Input**: Raw text is provided to the embedding model.
- **Tokenizer**: The text is split into tokens (words/subwords).
- **Token IDs**: Each token is mapped to a numerical ID.
- **Embedding Layer**: The IDs are converted into dense embeddings, which represent the input text in a form that the model can process. 


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
#  Importing SentenceTransformer Model
from sentence_transformers import SentenceTransformer

# Loading a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

paraphrase-MiniLM-L6-v2: A lightweight transformer model based on Microsoft’s MiniLM architecture, which balances performance and efficiency. It is often used for tasks like paraphrase detection.

## Tokenizing the Text

- **Tokenization:** The process of converting the input text into smaller units, such as words or subwords, that the transformer model can work with.

In [3]:
tokenized_data = model.tokenize(['Walker walked a long walk'])
tokenized_data

{'input_ids': tensor([[ 101, 5232, 2939, 1037, 2146, 3328,  102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

-- tensor([[ 101, 5232, 2939, 1037, 2146, 3328,  102]])
- **Explanation:** This tensor represents the token IDs corresponding to the words or subwords in the sentence.
- The numbers are the IDs of the tokens in the model's vocabulary.
- 101 and 102 are special tokens added by the model:
    - 101: [CLS] token, which is added at the beginning of the input. It's used as the starting marker for the sentence.
    - 102: [SEP] token, added at the end, marking the separation or end of the sentence.
- The numbers between 101 and 102 represent the tokenized words:
    - 5232: Token ID for "Walker"
    - 2939: Token ID for "walked"
    - 1037: Token ID for "a"
    - 2146: Token ID for "long"
    - 3328: Token ID for "walk"

-- tensor([[0, 0, 0, 0, 0, 0, 0]])
- Explanation: This tensor indicates the "segment" each token belongs to.
- In this case, all values are 0 because the input consists of a single sentence. When models like BERT are used for tasks involving pairs of    sentences, this field helps distinguish between sentence 1 (values = 0) and sentence 2 (values = 1).
- Since this input is a single sentence, all the tokens belong to the same segment.

-- tensor([[1, 1, 1, 1, 1, 1, 1]])
- Explanation: This tensor is a mask that tells the model which tokens to attend to.
- A value of 1 indicates that the token should be attended to (i.e., it's a valid token in the input), while a value of 0 would indicate that the token should be ignored (e.g., padding tokens).
- Since there are no padding tokens in this case, all values are 1, meaning all tokens are valid and should be attended to during model processing.

-- Summary of the tokenized data:
- input_ids: Numeric representation of each token in the sentence.
- token_type_ids: Indicates the sentence segment for each token (all 0 here since it's a single sentence).
- attention_mask: Specifies which tokens are real tokens to be attended to (all 1 here).

In [4]:
# took another example
# Sentences we want to encode. Example:
sentence_1 = ['This framework generates embeddings for each input sentence']

In [5]:
tokenized_data_1 = model.tokenize(sentence_1)
tokenized_data_1

{'input_ids': tensor([[  101,  2023,  7705, 19421,  7861,  8270,  4667,  2015,  2005,  2169,
           7953,  6251,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [6]:
sentences = ["This is an example sentence", "Each sentence is converted"]
tokenized_data_2 = model.tokenize(sentences)
tokenized_data_2

{'input_ids': tensor([[ 101, 2023, 2003, 2019, 2742, 6251,  102],
         [ 101, 2169, 6251, 2003, 4991,  102,    0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 0]])}

In [7]:
# convert the token IDs (numeric representation) back into their corresponding tokens (words or subwords).
model.tokenizer.convert_ids_to_tokens(tokenized_data['input_ids'][0])

['[CLS]', 'walker', 'walked', 'a', 'long', 'walk', '[SEP]']

In [8]:
# Transformer consists of multiple stack modules. Tokens are an input
# of the first one, so we can ignore the rest.
first_module = model._first_module()
first_module.auto_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)


- **Tokens** are input to the first module: When you input a sentence and tokenize it, the token IDs are first passed into this first module (first_module) for processing. The tokens will go through various layers (like self-attention and feed-forward neural networks) within this first module.

- **AutoModel:** This is the core transformer model responsible for handling token embeddings and other internal computations like attention mechanisms. The term auto_model refers to a general-purpose, pre-trained transformer model (from the transformers library), like BERT, that is loaded based on the model type used (MiniLM in this case).

Example Interpretation:
If first_module.auto_model is indeed a transformer model (like BERT or MiniLM), you are accessing the internals of the first transformer module that operates on the tokenized inputs.

If you wanted to further explore the layers or components within this module, you could inspect attributes like first_module.auto_model.encoder (to view the encoder layers) or first_module.auto_model.embeddings (to see how embeddings are generated for tokens).

Usage:
For visualization: You might want to visualize what happens to the tokens as they pass through this first module (like extracting attention weights).
For debugging: Accessing specific parts of the model can be helpful for debugging or fine-tuning specific components.

## Input token embeddings


In [9]:
embeddings = first_module.auto_model.embeddings
embeddings

BertEmbeddings(
  (word_embeddings): Embedding(30522, 384, padding_idx=0)
  (position_embeddings): Embedding(512, 384)
  (token_type_embeddings): Embedding(2, 384)
  (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

When text is passed through the BertEmbeddings layer:

Word embeddings are generated for each token.
Position embeddings are added to provide information about the position of each token in the sequence.
Token type embeddings are added to distinguish between different parts of the input (e.g., question and answer).
The combined embeddings are normalized using LayerNorm.
Dropout is applied to the embeddings during training to prevent overfitting.
The final output is a combination of all these embeddings for each token in the input sequence.

In [12]:
import torch
import plotly.express as px

# Define two example sentences
first_sentence = "vector search optimization"
second_sentence = "we learn to apply vector search optimization"

# Set the device to MPS (Apple Silicon GPU) if available, otherwise CPU
device = torch.device("mps" if torch.has_mps else "cpu")  # Use MPS for Apple, or fallback to CPU

with torch.no_grad():  # Disable gradient calculations (saves memory and computation for inference)
    
    # Tokenize the first and second sentences, converting them into token IDs
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])

    # Get the word embeddings for the first sentence tokens
    first_embeddings = embeddings.word_embeddings(
        first_tokens["input_ids"].to(device)  # Move token IDs to the selected device (CPU/MPS)
    )
    
    # Get the word embeddings for the second sentence tokens
    second_embeddings = embeddings.word_embeddings(
        second_tokens["input_ids"].to(device)  # Move token IDs to the selected device (CPU/MPS)
    )

# Print the shapes of the embedding tensors for both sentences
first_embeddings.shape, second_embeddings.shape


(torch.Size([1, 5, 384]), torch.Size([1, 9, 384]))

- Device Setup: Checks if MPS is available (for Apple devices), otherwise defaults to CPU.
- No Gradient Calculations: The torch.no_grad() context is used to avoid unnecessary gradient computations since we are only doing inference.
- Tokenization: Converts the input sentences into token IDs, which are fed into the model.
- Embeddings: Retrieves word embeddings for the tokenized inputs, which are moved to the appropriate device.
- Shape Output: Returns the shape of the embedding tensors for each sentence, showing the number of tokens and the size of the embedding vectors.

In [13]:
from sentence_transformers import util

# Compute cosine similarity between the embeddings of the two sentences
distances = util.cos_sim(
    first_embeddings.squeeze(),  # Remove extra dimensions from the first sentence embeddings
    second_embeddings.squeeze()  # Remove extra dimensions from the second sentence embeddings
).cpu().numpy()  # Move the tensor to the CPU and convert it to a NumPy array for easy manipulation


In [14]:
# visualization with plotly

px.imshow(
    distances,  # Cosine similarity matrix between tokens from the two sentences
    x=model.tokenizer.convert_ids_to_tokens(
        second_tokens["input_ids"][0]  # Convert token IDs of second sentence to tokens (words)
    ),
    y=model.tokenizer.convert_ids_to_tokens(
        first_tokens["input_ids"][0]  # Convert token IDs of first sentence to tokens (words)
    ),
    text_auto=True,  # Automatically display similarity values on the heatmap
)


- This code calculates the cosine similarity between word embeddings of two sentences, then visualizes the similarity matrix using a heatmap where the words of both sentences are labeled along the axes. The brighter cells indicate higher similarity between tokens.
- Input tokens embeddingd does not change , no matter the context or order of the words. Same Tokens will always get the same vector.

### Model

- embedding model start with a sequence of token embeddings
- The order of the token is modelled by positional encodings. Generally use sin function
- Input are processed by N stacked modules of the network

### Visualizing the input embeddings

How this mechanism captures the meaning of tokens. Input token embeddings are context-free, and they are also the parameters of the transformer model. The model learns them  during the training phase. To represent the meaning of each token in the best way possbile. We can access as a matrix. 

In [15]:
token_embeddings = first_module.auto_model \
.embeddings \
.word_embeddings \
.weight \
.detach() \
.numpy () \

token_embeddings.shape

(30522, 384)

Since that the token embeddings are context free we can map each of the vectors with the corresponding token. our matrix has 30522 inputs what is equal to the size of the vocabulary. you will now get it from the tokenizer and then sort by the index

In [16]:
import random

# Retrieve the tokenizer's vocabulary as a dictionary {token: token_id}
vocabulary = first_module.tokenizer.get_vocab()

# Sort the vocabulary based on the token IDs (values in the dictionary)
sorted_vocabulary = sorted(
    vocabulary.items(),  # Get (token, token_id) pairs
    key=lambda x: x[1],  # Sort by the token ID (second element of the tuple)
)

# Extract the sorted tokens from the sorted (token, token_id) pairs
sorted_tokens = [token for token, _ in sorted_vocabulary]

# Select 100 random tokens from the sorted list of tokens
random.choices(sorted_tokens, k=100)

['dakota',
 'countdown',
 '##ages',
 'integer',
 'circa',
 '##uous',
 '##lism',
 'ra',
 'sweetheart',
 '##xes',
 '##jure',
 'backseat',
 'slowed',
 'rosary',
 'promoted',
 'mammals',
 'kathryn',
 'proceeding',
 '##nsed',
 'walter',
 'marin',
 'tha',
 '##rant',
 'unlock',
 'doll',
 '##media',
 'untitled',
 '##anum',
 'perhaps',
 'mentally',
 'emi',
 'constraints',
 'helicopter',
 'henrietta',
 '##zhou',
 '##lub',
 '##州',
 'future',
 'sung',
 '[unused55]',
 'arabian',
 'dragoons',
 'connected',
 'stink',
 'guiana',
 '23rd',
 'pair',
 'obtained',
 'reyes',
 'lucia',
 'zones',
 'admissions',
 'osborn',
 'sutherland',
 '##พ',
 'infirmary',
 'ि',
 'spear',
 'tesla',
 'yates',
 'negotiating',
 'adjutant',
 'besieged',
 'hovered',
 'indicate',
 '##cture',
 'rest',
 'certified',
 'used',
 'ambassador',
 '[unused146]',
 'finer',
 'opposes',
 'buffet',
 'brewster',
 'inauguration',
 'machines',
 'task',
 '##ise',
 'bottled',
 'amara',
 '##rued',
 'likeness',
 'cock',
 'henrik',
 'loses',
 'consum

In [17]:
from sklearn.manifold import TSNE

# Initialize t-SNE with 2D output and cosine distance as the metric for distance calculation
tsne = TSNE(n_components=2, metric="cosine", random_state=42)

# Apply t-SNE to reduce the dimensionality of the token embeddings from high-dimensional space to 2D
tsne_embeddings_2d = tsne.fit_transform(token_embeddings)

# Check the shape of the resulting 2D embeddings
tsne_embeddings_2d.shape


(30522, 2)

This code takes high-dimensional token embeddings and reduces them to 2D using t-SNE with cosine similarity as the distance metric. The result (tsne_embeddings_2d) can be used to visualize the embeddings in a 2D plot, making it easier to see how the tokens are related in the lower-dimensional space. The shape will be (number_of_tokens, 2) after the transformation.

### Groups of Tokens in our vocabulary

- 1st group:
    Technical tokens specific to the model
-2nd Group:
    Subwords tokens (with ## prefiexed)
-3rd Group:
     prefixes and words strating with anything except ##

In [18]:
token_colors = []
for token in sorted_tokens:
    if token[0] == "[" and token[-1] == "]":
        token_colors.append("red")
    elif token.startswith("##"):
        token_colors.append("blue")
    else:
        token_colors.append("green")

In [19]:
import plotly.graph_objs as go

scatter = go.Scattergl(
    x=tsne_embeddings_2d[:, 0], 
    y=tsne_embeddings_2d[:, 1],
    text=sorted_tokens,
    marker=dict(color=token_colors, size=3),
    mode="markers",
    name="Token embeddings",
)

fig = go.FigureWidget(
    data=[scatter],
    layout=dict(
        width=600,
        height=900,
        margin=dict(l=0, r=0),
    )
)

fig.show()

## Output token embeddings

In [20]:
output_embedding = model.encode(["walker walked a long walk"])
output_embedding.shape

(1, 384)

In [21]:
output_token_embeddings = model.encode(
    ["walker walked a long walk"],  # The input sentence as a list
    output_value="token_embeddings"  # Specify that we want token-level embeddings (one for each token)
)

# Check the shape of the token embeddings for the first sentence
output_token_embeddings[0].shape


torch.Size([7, 384])

In [22]:
first_sentence = "vector search optimization"
second_sentence = "we learn to apply vector search optimization"

# Ensure no gradients are calculated, to save memory and computation.
with torch.no_grad():
    # Tokenize the sentences
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])
    
    # Encode the sentences into token-level embeddings
    first_embeddings = model.encode(
        [first_sentence], 
        output_value="token_embeddings"  # Return embeddings for each token
    )
    
    second_embeddings = model.encode(
        [second_sentence], 
        output_value="token_embeddings"  # Return embeddings for each token
    )

    # Compute cosine similarity between the token embeddings of both sentences
    distances = util.cos_sim(
        first_embeddings[0],  # Token embeddings of the first sentence
        second_embeddings[0]   # Token embeddings of the second sentence
    )


The distances matrix will contain cosine similarity scores between each token from first_sentence and each token from second_sentence. Higher values in this matrix indicate more similar tokens.
The shape of the distances matrix will be [num_tokens_in_first_sentence, num_tokens_in_second_sentence].

In [24]:
px.imshow(
    distances.cpu().numpy(),  # Move the tensor to CPU and convert to a NumPy array
    x=model.tokenizer.convert_ids_to_tokens(
        second_tokens["input_ids"][0]
    ),  # Get the token labels for the x-axis from the second sentence
    y=model.tokenizer.convert_ids_to_tokens(
        first_tokens["input_ids"][0]
    ),  # Get the token labels for the y-axis from the first sentence
    text_auto=True,  # Automatically display the values of cosine similarities as text in the cells
)


First sentence: "vector search optimization"
Second sentence: "we learn to apply vector search optimization"
The heatmap will show how each token in the first sentence (y-axis) compares to each token in the second sentence (x-axis) based on the token embeddings generated by the model. You should see high similarity scores for tokens like "vector", "search", and "optimization" across both sentences, while other tokens may show low similarity