# This notebook only covers embedding. No Vector storage is done here.

# Embeddings

## What Are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning.
They transform words, sentences, or documents into high-dimensional vectors where semantically similar content is mapped to nearby points in vector space.

In Retrieval-Augmented Generation (RAG) systems, embeddings enable:
- Semantic search beyond keyword matching
- Retrieval of contextually relevant documents
- Efficient vector similarity search


## Why Sentence Transformers?
- Pre-trained models optimized for semantic similarity.
- Generate dense vector representations.
- Enable semantic search beyond keyword matching.

In [1]:
# Used to import pretrained models .
from sentence_transformers import SentenceTransformer

In [2]:
embedding_model=SentenceTransformer("all-MiniLM-L6-v2") # Using all-MiniLM-L6-v2 as a pretrained model for embedding .

## Model Configuration
- The all-MiniLM-L6-v2 model used in the notebook features:

- Architecture: A BertModel structure.

- Max Sequence Length: Capable of processing up to 256 tokens per input.

- Precision: Uses torch.float32 for its numerical calculations.

In [3]:
print(embedding_model) # About the model .

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


In [4]:
print(embedding_model.device) # Model works on cpu
print(embedding_model.cpu()) # Model parameters or features .
print(embedding_model.transformers_model) # It uses Bert architecture .

cpu
SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (ke

In [5]:
print(embedding_model.backend) # The framework or attributes used to make the model .
print(embedding_model.dtype) # Type of the model .

torch
torch.float32


In [6]:
embedding_model.get_sentence_embedding_dimension() # Embedding dimensions of model .

384

In [7]:
# Other libraries used for data manipulation purposes .
import numpy as np
from typing import Union,List

## Key Functionalities:
- Model Initialization: Handles the loading of the SentenceTransformer model and confirms its dimensions.

- Input Validation: Ensures that only non-empty strings or lists of strings are processed.

- Batch Processing: Encodes multiple texts efficiently using a default batch_size of 32.

- Normalization: Automatically normalizes the output vectors, which is critical for performing similarity searches late.

In [8]:
# The following class helps to load and generate the embeddings of the texts .
class EmbeddingManager:
    def __init__(self,embedding_model_name:str="all-MiniLM-L6-v2")->None: #default model is all-MiniLM-L6-v2 .
        self.embedding_model_name=embedding_model_name
        self.embedding_model=None
        self._load_model() # Self function to load model .
    def _load_model(self): # This functions try to load models .
        try:
            print(f"Loading Embedding Model : {self.embedding_model_name} .")
            self.embedding_model=SentenceTransformer(self.embedding_model_name) # Loading model .
            print(f"Embedding dimensions of the model is : {self.embedding_model.get_sentence_embedding_dimension()} dimensions .") # Prints the dimension of model .
        except Exception as e:
            raise RuntimeError(f"Error loading {self.embedding_model_name} model") from e # Exception handling .

    # The following functions embeds the given strings .
    def generate_embeddings(self,texts:Union[str,List[str]],batch_size=32)->np.ndarray:
        if self.embedding_model is None:
            raise ValueError("Model not found .") # Raises error if model is not loaded .
        if isinstance(texts,str): # Checks if the input is string .
            if not texts.strip():
                raise ValueError("Text cannot be empty .") # Raises error if the given input is not string .
        elif isinstance(texts,list):
            if len(texts)==0:
                raise ValueError("Texts list is empty .") # Raises error if the given input is not list of strings .
            if not all(isinstance(t,str) and t.strip() for t in texts):
                raise ValueError("All items must be a non-empty string .") # Raises error if there is empty items in list .
        else:
            raise TypeError("Invalid type must be string or list of strings .") # Raises error if the given input is invalid type .

        # texts -> input -> string or List of strings .
        # show_progress_bar -> Used to show the progress of converting texts into embeddings .
        # normalize_embeddings -> Used for semantic search using cosine similarity .
        # batch_size -> Parallel embedding of texts .
        # convert_to_numpy -> Converts output to numpy array . Can also convert to tensor .
        embeddings=self.embedding_model.encode(texts,show_progress_bar=True,normalize_embeddings=True,batch_size=batch_size,convert_to_numpy=True)
        if embeddings.ndim==1: # Used to convert single dimension output to two dimension . Eg input -> single text .
            embeddings=embeddings.reshape(1,-1)
        print(f"Generated with embedding of dimensions : {embeddings.shape[0]} x {embeddings.shape[1]}.") # Shape of the embeddings .
        return embeddings # Returns converted embeddings .

In [9]:
embedding_manager=EmbeddingManager() # Loading model by calling it .

Loading Embedding Model : all-MiniLM-L6-v2 .
Embedding dimensions of the model is : 384 dimensions .


In [10]:
# Sample input for converting text into embeddings .
from langchain_community.document_loaders import TextLoader
sample_texts=TextLoader(file_path="../data/text/file1.txt",encoding="utf-8")

In [11]:
texts_loader=sample_texts.load() # Loading sample text through a file .
texts_loader[0].page_content # Using the page_content of the Document .

'\nAbstract\n\nLarge pre-trained language models have been shown to store factual knowledge in their parameters and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.\n\nPre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) â€” models which combine pre-trained parametric and non-parametric memory for language generation.\n\nWe introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia,

### Important Note: Embeddings and Metadata

Embeddings are stored together with metadata such as the original text chunk and source information.
This metadata is required to reconstruct context during retrieval.

In [12]:
print(embedding_manager.generate_embeddings(texts_loader[0].page_content)) # Generating embeddings .

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated with embedding of dimensions : 1 x 384.
[[-6.82110786e-02 -5.49745858e-02 -9.45889484e-03  9.14839506e-02
   9.89885628e-03  7.40969852e-02 -2.27620099e-02 -9.08262655e-03
   3.26191857e-02 -2.97626834e-02  1.55247143e-02 -1.60280149e-02
   6.23321123e-02  3.82024376e-03 -7.78601039e-03  4.69363816e-02
   9.62595418e-02  4.51717749e-02 -8.07017609e-02 -5.44300564e-02
   6.30562901e-02  7.01797977e-02  6.04657233e-02 -2.80444659e-02
   9.29219648e-03 -3.76222432e-02 -3.70796956e-02 -4.02639173e-02
   9.27506760e-02  7.15795846e-04  5.56802228e-02  3.79125066e-02
  -9.68193635e-02  6.64731264e-02 -4.32489440e-02  9.20629352e-02
  -9.50567052e-02  3.45189720e-02  3.56559642e-02 -7.04297870e-02
  -1.94121583e-03 -5.44193201e-04 -3.14035825e-02  4.64817025e-02
   7.70938247e-02  1.53455166e-02 -1.82156768e-02 -3.22249811e-03
  -1.52310487e-02  2.01163068e-02 -1.10819988e-01  1.80859249e-02
   2.59972103e-02  2.42232122e-02  5.33686392e-03  4.54786345e-02
   9.83385555e-03  3.58048

### Important Note: Embeddings are NOT retrieval results

After embeddings are generated:
1. Embeddings are stored in a vector database.
2. User queries are converted into embeddings.
3. Similarity search is performed in the vector store.

This notebook stops at step 3.

### What comes next
- Vector store ingestion
- Retrieval (RAG)


## Next step: Storing the embeddings into vector store.
