# Index
- [Vector Search: Understanding the True Meaning of Queries](#vector-search-understanding-the-true-meaning-of-queries)

- [Encoding Data with Embeddings](#encoding-data-with-embeddings)
    - [The Challenge of Text Representation](#the-challenge-of-text-representation)
    - [Basic Vectorization: One-Hot Encoding](#basic-vectorization-one-hot-encoding)
    - [Text Embeddings (Word Embeddings): Dense Representations](#text-embeddings-word-embeddings-dense-representations)
    - [Multimodal Embeddings](#multimodal-embeddings)

- [Generating Embeddings for Variable-Length Data](#generating-embeddings-for-variable-length-data)
    - [Embeddings for Sentences / Variable-Length Text Sequences](#1-embeddings-for-sentences--variable-length-text-sequences)
    - [Embeddings for Images / Variable-Resolution Data](#2-embeddings-for-images--variable-resolution-data)
- [Focus on Sentence Embedding using Tranformer Models](#focus-on-sentence-embedding-using-tranformer-models)

    - [Why Vanilla BERT Fails at Sentence Embeddings](#why-vanilla-bert-fails-at-sentence-embeddings)
    - [Sentence-BERT's Solution: Siamese and Triplet Networks](#sentence-berts-solution-siamese-and-triplet-networks)
    - [The Key Difference: Fine-tuning for Similarity](#the-key-difference-fine-tuning-for-similarity)
    - [Why this is a Game Changer](#why-this-is-a-game-changer)

- [Indexing and Search](#indexing-and-search)
    - [Measuring Vector Distance (Similarity Metrics)](#1-measuring-vector-distance-similarity-metrics)
    - [Fast and Scalable Vector Search](#2-fast-and-scalable-vector-search)
    - [Vertex AI Vector Search (formerly Matching Engine)](#vertex-ai-vector-search-formerly-matching-engine)

- [The Problem: AI Hallucination 😵‍💫](#the-problem-ai-hallucination-)
    - [What Causes AI Hallucinations?](#what-causes-ai-hallucinations)
    - [Traditional Solutions and Their Limitations](#traditional-solutions-and-their-limitations)
    - [The RAG Solution: An Open-Book Exam for AI 📖](#the-rag-solution-an-open-book-exam-for-ai-)
    - [How RAG Works with Vector Search](#how-rag-works-with-vector-search)

- [The Challenge: Beyond Semantic Search 🤔](#the-challenge-beyond-semantic-search-)
    - [What is Hybrid Search? 🤝](#what-is-hybrid-search-)
    - [How Hybrid Search Works ⚙️](#how-hybrid-search-works-)
    - [Implementation with Vertex AI Vector Search 🛠️](#implementation-with-vertex-ai-vector-search-)
    - [Example Result](#example-result)

- [TF-IDF (Term Frequency-Inverse Document Frequency)](#tf-idf-term-frequency-inverse-document-frequency)

# Vector Search: Understanding the True Meaning of Queries

## What is Vector Search?

Vector Search is a technology that enables search systems to understand the **semantic meaning of queries**, rather than just matching keywords. It focuses on **semantic similarity**, allowing systems to find results that are conceptually related, even if exact terms aren't used.

- Traditional **keyword search** excels at matching explicit terms but lacks the ability to understand context or underlying meaning. For example, it might find "summer tops" but miss related items like "swimming suits" or fail to infer intent, such as "attire for a beach party."

- Vector Search (with its focus on semantic search) provides a crucial solution by transforming data into **meaningful embeddings**, enabling a deeper understanding of user intent. 

## Benefits of Vector Search

1.  **Semantic Understanding:**
      * Finds results similar in meaning to a query, even without exact keyword matches.
      * Highly effective for natural language queries where precise or technical language may not be used.
2.  **Multimodal Search Capabilities:**
      * Can be applied to various data types, including text, images, and audio.
      * Enables applications where users can search using multiple data types (e.g., image search, voice search).
3.  **Personalization and Recommendation:**
      * Leverages context understanding to personalize search results and recommendations.
      * Helps users discover more relevant and interesting information.
4.  **Generative AI Integration:**
      * A critical component in generative AI applications for fast and efficient information retrieval.
      * Becoming a foundational element in AI and machine learning services.

Vector Search is transforming how people engage with information, leading to more relevant, efficient, and personalized search experiences as data volumes and user expectations grow.

## How Vector Search Works

Vector Search involves a three-step process:

![](docs/images/vector-search-steps.png)

1.  **Encode Data into Vectors:** AI models, known as **Embedding Models**, convert various data types (text, images, audio) into numerical representations called **vectors**. These vectors capture the semantic meaning of the data.
2.  **Create an Index:** An index is built from these vectors to enable fast and scalable search across billions of items.
3.  **Search the Vector Space:** When a query is made, it is also encoded into a vector. This query vector is then used to efficiently search the indexed vector space for other vectors (data items) that are semantically similar.

### Detailed View: Development vs. Serving

  * **At Development Time (Building the System):**
      * Generate embeddings for all data.
      * Build and deploy the vector index.
  * **At Serving Time (Responding to Queries):**
      * Encode the user's query into a vector.
      * Search the vector space using the query vector.
      * Serve the most relevant results.

## Core Challenges

To implement Vector Search, two major challenges must be addressed:

1.  **How to Encode Data:** Converting diverse, multimodal data into representations that accurately capture semantic meanings. (Answer: **Embeddings**)
2.  **How to Index and Search Data:** Building an efficient search space that enables fast and scalable lookups. (Answer: **Vector Search Indexing**)

-----

## Further Reading & Resources

To deepen your understanding of Vector Search, Embeddings, and related technologies, explore these resources:

### General Concepts & Explanations:

  * **Google Cloud Blog - What is Vector Search?**
      * A good starting point for understanding the fundamentals and applications.
      * [Link to a Google Cloud "What is Vector Search" article](https://www.google.com/search?q=https://cloud.google.com/learn/what-is-vector-search) (You can search for the most recent official one)
  * **Pinecone Blog - What is Vector Search?**
      * Pinecone is a dedicated vector database company, and their blog often has excellent, in-depth explanations.
      * [Link to a Pinecone "What is Vector Search" article](https://www.google.com/search?q=https://www.pinecone.io/learn/vector-search/)

### Vector Databases & Open-Source Frameworks:

These are specialized databases and libraries designed to store and query vectors efficiently.

  * **Pinecone:**
      * A popular managed vector database service.
      * [Website: pinecone.io](https://www.pinecone.io/)
  * **Weaviate:**
      * An open-source vector database that also includes built-in search capabilities.
      * [Website: weaviate.io](https://weaviate.io/)
      * [GitHub: Weaviate](https://github.com/weaviate/weaviate)
  * **Qdrant:**
      * Another open-source vector similarity search engine.
      * [Website: qdrant.tech](https://qdrant.tech/)
      * [GitHub: Qdrant](https://github.com/qdrant/qdrant)
  * **Faiss (Facebook AI Similarity Search):**
      * A library for efficient similarity search and clustering of dense vectors. It's not a database, but a powerful library often used as a backend for vector search systems.
      * [GitHub: Faiss](https://github.com/facebookresearch/faiss)
  * **Chroma:**
      * An open-source embedding database for building AI applications.
      * [Website: trychroma.com](https://www.trychroma.com/)
      * [GitHub: Chroma](https://github.com/chroma-core/chroma)
  * **Elasticsearch (with Vector Search capabilities):**
      * While primarily a full-text search engine, recent versions of Elasticsearch (and OpenSearch) have added native vector search capabilities.
      * [Elasticsearch Vector Search documentation](https://www.elastic.co/what-is/vector-search)

### Embeddings & Models:

  * **Hugging Face Hub (Models):**
      * A vast repository of pre-trained models, including many for generating embeddings (e.g., Sentence-BERT, OpenAI's embedding models).
      * [Website: huggingface.co/models](https://huggingface.co/models)
  * **OpenAI Embeddings:**
      * OpenAI offers powerful embedding models accessible via API, widely used for various semantic tasks.
      * [OpenAI Embeddings Documentation](https://platform.openai.com/docs/guides/embeddings)
  * **Google's Universal Sentence Encoder:**
      * A model for encoding text into high-dimensional vectors.
      * [TensorFlow Hub: Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4)

---
---

# Encoding Data with Embeddings

## Introduction to Embeddings

Embeddings are a fundamental technology for **encoding data into a representation that captures semantic meaning**. This process addresses the challenge of transforming complex data (like text, images, audio, video, or code) into a numerical format that retains its underlying sense.

While embeddings can be generated for multimodal data, this documentation focuses on **text embeddings** as a primary example.

## The Challenge of Text Representation

The core problem is: **How do you represent text numerically while retaining its meaning?** This can be broken down into two sub-problems:

1.  **Semantic Relationship:** How do you convert text into numbers that reflect semantic relationships between words, indicating their similarity in meaning?
2.  **Machine Learning Input:** How do you transform text into numbers that can be efficiently processed by machine learning models, typically requiring relatively dense vectors to avoid overfitting?

## Basic Vectorization: One-Hot Encoding

One of the most intuitive, but basic, techniques for text representation is **one-hot encoding**.

1.  **Tokenization & Preprocessing:** A sentence is divided into smaller units (tokens, often words) and preprocessed (e.g., stemming to get root words).
2.  **Vocabulary:** A vocabulary of all unique words in the dataset is created. This can easily contain tens of thousands of words.
3.  **Vector Creation:** 

While intuitive and easy to implement, it has the following **disadvantages**:
    1.  **No Semantic Relationship:** One-hot encoding does not convey any relationship between words. "Dog" and "cat" are as "far apart" as "dog" and "apple" in this representation, even though "dog" and "cat" are semantically much closer.
    2.  **High-Dimensional and Sparse:** The resulting vectors are very high-dimensional (equal to the vocabulary size) and extremely sparse (mostly zeros).
        * A vocabulary of 10,000 words leads to 10,000-dimensional vectors, with 99.99% zeros.
        * This **sparse embedding** can lead to computational inefficiency and model overfitting in machine learning tasks.

## Text Embeddings (Word Embeddings): Dense Representations

To overcome the limitations of sparse representations, **text embeddings** (also known as word embeddings or dense embeddings) were developed. Text embeddings leverage this idea by representing each word as a point in a multi-dimensional vector space.

### Key Characteristics:

  * **Semantic Similarity:** The distance between words in this vector space indicates their semantic similarity.
      * *Example:* "Queen" and "king" are close to each other, but "queen" and "apple" are far apart.
  * **Analogies:** Embeddings can capture analogies between words.
      * *Example:* The vector difference between "king" and "man" is similar to the difference between "queen" and "woman." This allows for operations like: `vector("king") - vector("man") + vector("woman") = vector("queen")`.
  * **Low-Dimensional & Dense:** Unlike one-hot encoding, text embeddings use significantly fewer dimensions (e.g., tens to thousands, instead of tens of thousands). Each dimension typically contains a non-zero value, making them **dense embeddings**.
  * **Distributed Representation:** The meaning of a word is "distributed" across these dimensions, with each dimension capturing a specific feature or aspect of the word.

### How are Text Embeddings Developed?

Instead of manually assigning values, text embeddings are **learned by neural networks**. These networks are trained on vast amounts of text data to identify patterns and relationships between words. Popular algorithms and models used for this include:

  * **Word2Vec** (by Google)
  * **GloVe** (Global Vectors for Word Representation, by Stanford)
  * **FastText** (by Facebook/Meta)

Training these models from scratch requires substantial computational resources. Fortunately, **pre-trained embedding models** are widely available via APIs, simplifying their use.

### Using Embedding APIs:

Using embedding APIs is typically straightforward:

1.  **Specify Model:** Choose a pre-trained embedding model (e.g., `text-embedding-gecko` or `multimodal-embedding` from Google).
2.  **Define Input:** Provide the text you want to embed.
3.  **Get Embeddings:** Call the API to receive the dense vector representation of your text.

### Multimodal Embeddings

The same approach used for text can be applied to other media types, such as images. This allows for **multimodal embeddings**, where different types of data (e.g., the text "dog" and an image of a dog) are represented in the same vector space, located near each other due to their semantic similarity.

Embeddings are a powerful concept that allows computers to "understand" and process the semantic meaning of various data types, forming a critical component of modern AI and machine learning applications.

-----

## Further Reading & Resources

To dive deeper into Embeddings, text representation, and the models used to create them:

### Core Concepts & Algorithms:

  * **Word2Vec Explained:**
      * A classic paper and many tutorials exist. Search for "Word2Vec tutorial" or "Gensim Word2Vec" for practical examples.
      * [Original Paper: Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
  * **GloVe: Global Vectors for Word Representation:**
      * Another foundational method for learning word embeddings.
      * [Original Paper: GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
  * **FastText:**
      * An extension to Word2Vec that considers subword information, useful for rare words.
      * [FastText Website](https://fasttext.cc/)
  * **Sparse vs. Dense Embeddings:**
      * Understanding the differences is crucial for performance and quality in search systems.
      * [Article: Sparse vs. Dense Vectors for Search](https://www.google.com/search?q=https://www.pinecone.io/learn/sparse-dense-vectors-for-search/)

### Practical Implementation & APIs:

  * **OpenAI Embeddings:**
      * Learn how to use OpenAI's powerful text embedding models via their API.
      * [OpenAI Embeddings Documentation](https://platform.openai.com/docs/guides/embeddings)
  * **Google Cloud Embeddings:**
      * Explore Google's offerings for generating text and multimodal embeddings.
      * [Google Cloud - Generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)
  * **Hugging Face Transformers Library:**
      * A comprehensive library for various NLP tasks, including using and fine-tuning embedding models.
      * [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
  * **Sentence-BERT (SBERT):**
      * A popular method for generating sentence-level embeddings, often used with vector search.
      * [Sentence-BERT GitHub](https://github.com/UKPLab/sentence-transformers)

### Natural Language Processing (NLP) Courses:

  * **Natural Language Processing on Google Cloud (Coursera/Google Cloud Skills Boost):**
      * This course often includes modules on text representation, embeddings, and practical applications. Search for the most up-to-date version.
      * [Search Google Cloud Skills Boost for "Natural Language Processing on Google Cloud"](https://www.google.com/search?q=Natural+Language+Processing+on+Google+Cloud+skills+boost)

---
---
---

# Generating Embeddings for Variable-Length Data

The challenge of variable input size is central to processing sentences, images, audio, or any data where the "length" or "resolution" isn't fixed. Autoencoders with fixed input/output layers, as you described, are not directly suitable for these tasks. Instead, more sophisticated architectures are used.

## 1\. Embeddings for Sentences / Variable-Length Text Sequences

For sentences and longer text sequences, the primary goal is to get a single, fixed-size vector representation for the entire sequence that captures its overall meaning. Autoencoders are less common for this specific task because they are designed for reconstruction, whereas sentence embedding often focuses on meaning representation directly.

Here's how it's typically done:

### a) Recurrent Neural Networks (RNNs) and their Variants (LSTM, GRU)

  * **Concept:** RNNs are specifically designed to process sequential data. They maintain a "hidden state" that acts as a memory, updating it as each word in the sequence is processed.
  * **How it Works:**
    1.  Each word in the sentence is first converted into a **word embedding** (e.g., using Word2Vec, GloVe, or a pre-trained embedding layer).
    2.  These word embeddings are fed sequentially into the RNN.
    3.  The final hidden state of the RNN (or a pooled/aggregated version of all hidden states) can be used as the **sentence embedding**. This final state effectively summarizes the information learned from the entire sequence into a fixed-size vector.
  * **Variable Length Handling:** RNNs naturally handle variable-length sequences because they process one word at a time, updating their internal state, regardless of how many words are in the sequence. The output (the sentence embedding) always has the same fixed dimension.
  * **Drawbacks:** Traditional RNNs struggle with long-range dependencies (the "vanishing/exploding gradient" problem). LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are improvements that mitigate this.

### b) Convolutional Neural Networks (CNNs) for Text

  * **Concept:** While famously used for images, CNNs can also be applied to text. They use filters (kernels) to identify local patterns (like n-grams) within a sequence.
  * **How it Works:**
    1.  Word embeddings are stacked to form a "matrix" representing the sentence.
    2.  Convolutional filters slide over this matrix, detecting patterns (e.g., phrases).
    3.  A pooling layer (e.g., max-pooling) is then applied across the results of these filters to select the most salient features.
    4.  The output of the pooling layer, or further layers, can then form the fixed-size sentence embedding.
  * **Variable Length Handling:** The pooling step (especially global max-pooling) is key here. It reduces the variable-length output of the convolutions to a fixed-size vector, regardless of the input sentence length.

### c) Transformer Models (e.g., BERT, Sentence-BERT)

  * **Concept:** Transformers are the state-of-the-art for many NLP tasks, heavily relying on the "attention mechanism." They process all words in a sequence simultaneously, but the attention mechanism allows them to weigh the importance of different words relative to each other.
  * **How it Works:**
    1.  Input words are first converted into embeddings.
    2.  These embeddings, along with positional encodings (to retain word order), are passed through multiple "transformer blocks."
    3.  The attention mechanism learns to focus on different parts of the input sentence when generating representations for each word.
    4.  To get a sentence embedding, typically:
          * The embedding of a special `[CLS]` (classification) token at the beginning of the sequence is used (as in BERT).
          * Or, the embeddings of all tokens are pooled (e.g., mean-pooled or max-pooled) to create a single vector (as in Sentence-BERT).
  * **Variable Length Handling:** Transformers use **padding** to make all input sequences the same length for batch processing. However, a "mask" is used to tell the attention mechanism to ignore the padded tokens. The final output (e.g., `[CLS]` token embedding or pooled embeddings) is always a fixed-size vector, irrespective of the original sequence length.
  * **Why they're powerful:** They excel at capturing long-range dependencies and complex semantic relationships due to their parallel processing and sophisticated attention.

  * **Note: Vanilla BERT does not work well**. If you take a BERT model and you use it to geenrate sentence embedding with the methods described above, it will produce poor embeddings. These models required an extra layer and supervised training to be *specialised* for sentnce embedding. See [Focus on Sentence Embedding using Tranformer Models](#focus-on-sentence-embedding-using-tranformer-models) for more info.

### Relevant Links for Sentence Embeddings:

  * **Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:**
      * A seminal paper and library for high-quality sentence embeddings.
      * [Paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
      * [GitHub: sentence-transformers](https://github.com/UKPLab/sentence-transformers)
  * **What are embeddings? (Google Developers):**
      * Good general overview of embeddings, including for text.
      * [Link: What are embeddings?](https://www.google.com/search?q=https://developers.google.com/machine-learning/glossary/embedding)
  * **The Illustrated Transformer:**
      * A fantastic visual explanation of how transformer models work.
      * [Link: The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

## 2\. Embeddings for Images / Variable-Resolution Data

Images (and other variable-resolution data like medical scans or videos) also present the challenge of non-uniform input sizes.

### a) Convolutional Neural Networks (CNNs)

  * **Concept:** CNNs are the workhorse of modern computer vision. They learn hierarchical features from images through convolutional layers that detect patterns (edges, textures, shapes) at various scales.
  * **How it Works:**
    1.  An image is fed into a stack of convolutional layers, interleaved with pooling layers (e.g., max-pooling).
    2.  Convolutional layers apply filters across local regions of the image, producing feature maps.
    3.  Pooling layers downsample these feature maps, reducing their spatial dimensions and making the network more robust to small variations in the input.
    4.  After several convolutional and pooling layers, the output is typically flattened into a 1D vector.
    5.  This flattened vector is then fed into one or more fully connected (dense) layers. The output of one of these dense layers (often before the final classification layer) serves as the **image embedding**.
  * **Variable Resolution Handling:**
      * **Resizing/Cropping:** Often, images are **resized** or **cropped** to a fixed input dimension (e.g., 224x224 pixels for ImageNet-trained models) *before* being fed into the CNN. This is the simplest and most common approach, though it can lose information or introduce distortion.
      * **Global Pooling:** For truly variable input sizes, the final pooling layer can be a **global average pooling** or **global max pooling** layer. This layer takes the entire feature map from the preceding convolutional layer and reduces it to a single value per feature channel, resulting in a fixed-size vector regardless of the input image's original dimensions (after convolutions).
      * **Adaptive Pooling:** Some architectures use adaptive pooling layers that can automatically adjust their kernel size to produce a fixed-size output, regardless of the input feature map size.

### b) Vision Transformers (ViT)

  * **Concept:** Adapting the successful Transformer architecture from NLP to computer vision.
  * **How it Works:**
    1.  An image is divided into a grid of fixed-size **patches** (e.g., 16x16 pixels).
    2.  Each patch is then flattened into a 1D vector and linearly projected into a specific dimension.
    3.  Positional embeddings are added to these patch embeddings (to retain spatial information).
    4.  These sequence of patch embeddings are then fed into a standard Transformer encoder, similar to how text tokens are handled.
    5.  The output embedding of a special `[CLS]` token (similar to BERT) or the pooled embeddings of all patches can serve as the **image embedding**.
  * **Variable Resolution Handling:**
      * Similar to text transformers, the input image is effectively converted into a sequence of fixed-size patches. While the *number* of patches can vary with image resolution, the *size* of each patch is fixed.
      * Models are often trained on fixed-resolution images initially. To handle different resolutions, images are either resized or the positional embeddings are interpolated. The output embedding itself is always fixed-size.

### Relevant Links for Image Embeddings:

  * **ImageNet and CNNs:**
      * Understanding ImageNet is key to understanding pre-trained CNNs for image embeddings.
      * [Article: A Brief Introduction to ImageNet](https://www.google.com/search?q=https://towardsdatascience.com/a-brief-introduction-to-imagenet-c-a7c37f39b69b)
  * **Introduction to CNNs for Image Recognition (Keras):**
      * A good practical overview.
      * [Keras Guide: A crash course on CNNs](https://keras.io/api/layers/convolution_layers/convolution2d/)
  * **An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformers):**
      * The original paper for ViT.
      * [Paper: An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
  * **CLIP (Contrastive Language-Image Pre-training):**
      * A model that learns highly effective multimodal embeddings by training on image-text pairs, allowing you to get embeddings for both images and text in the same space.
      * [OpenAI Blog: CLIP](https://openai.com/research/clip)

-----

In summary, the key to handling variable-length or variable-resolution inputs for embeddings lies in using architectures like **RNNs, CNNs, and Transformers** that incorporate mechanisms (like recurrent states, pooling layers, or attention with padding/patching) to distill the variable input into a fixed-size, meaningful vector representation.

---
---
--- 

# Focus on Sentence Embedding using Tranformer Models

## Why Vanilla BERT Fails at Sentence Embeddings

**ref**: [Paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084).

A straightforward application of pre-trained BERT for sentence embeddings (e.g., using the `[CLS]` token's final hidden state or averaging all token embeddings) yields surprisingly **poor quality embeddings for semantic similarity tasks**.

Here's why:

1.  **BERT's Original Training Objective:**
    * **Masked Language Modeling (MLM):** BERT is trained to predict masked words in a sentence. This forces it to understand local context and word relationships *within a single sentence*.
    * **Next Sentence Prediction (NSP):** BERT is also trained to predict if two sentences follow each other. This helps it understand sentence-level relationships, but it's a binary classification task, not a continuous similarity task.
    * **No Explicit Similarity Objective:** Critically, BERT's training **does not explicitly teach it to produce sentence embeddings that are close in vector space for semantically similar sentences.** It wasn't optimized for this. Its embeddings are good for contextualizing individual words for downstream tasks (like classification or Q&A), but not for direct vector comparisons of whole sentences.

2.  **The `[CLS]` Token's Role:**
    * In vanilla BERT, the `[CLS]` token's final hidden state is primarily designed for **sequence-level classification tasks**. When you fine-tune BERT for sentiment analysis or spam detection, you typically add a linear layer on top of the `[CLS]` token's embedding to make the classification. During this fine-tuning, the `[CLS]` token learns to condense information relevant to that *specific classification task*, not necessarily general semantic meaning for similarity comparisons.
    * Without specific fine-tuning for similarity, the `[CLS]` token's embedding from the base BERT model is essentially a "general purpose" representation that hasn't been optimized to cluster similar sentences together in vector space.

3.  **Averaging Token Embeddings:**
    * Averaging all token embeddings (after passing them through BERT) helps, but it still often falls short. While each token embedding is contextually rich, simply averaging them can lose the nuanced, overall semantic meaning of the sentence. It's like averaging all the words in a paragraph; you get a vague idea, but not a concise summary designed for comparison.

4.  **Computational Cost for Similarity Search:**
    * If you wanted to find the most similar sentence to a query using vanilla BERT, you'd have to pass *each candidate sentence* through the BERT model separately to get its `[CLS]` token embedding, and then compute the similarity. This is extremely inefficient. For a database of millions of sentences, it would mean millions of BERT passes per query, which is computationally prohibitive.

---

## Sentence-BERT's Solution: Siamese and Triplet Networks

SBERT's main innovation is to **fine-tune BERT (or other pre-trained transformer models) specifically for semantic similarity tasks** using a **siamese** or **triplet network architecture**.

### The Core Idea: Siamese/Triplet Architectures

Instead of just one BERT model, SBERT uses two (siamese) or three (triplet) identical BERT models that share the same weights.

1.  **Siamese Network:**
    * Takes **two input sentences** (e.g., a pair `(sentence_A, sentence_B)`).
    * Each sentence is passed through an **identical BERT model**.
    * After the BERT output, a **pooling operation** is applied to generate a fixed-size sentence embedding for each sentence (e.g., mean-pooling over all token embeddings is often used, or the `[CLS]` token's embedding, but *after* fine-tuning).
    * These two sentence embeddings are then used to calculate a **similarity score** (e.g., cosine similarity, Euclidean distance) or are fed into a classification head to predict if they are similar/dissimilar.
    * **Training Objective:** The network is trained with a loss function that encourages:
        * **Similar sentence pairs** to have embeddings that are **close together** in vector space.
        * **Dissimilar sentence pairs** to have embeddings that are **far apart**.

2.  **Triplet Network:**
    * Takes **three input sentences**: an **anchor** sentence, a **positive** sentence (semantically similar to the anchor), and a **negative** sentence (semantically dissimilar to the anchor).
    * Each sentence is passed through an **identical BERT model**.
    * **Training Objective:** The network is trained using a **triplet loss** function. This loss aims to ensure that the distance between the anchor and the positive embedding is *smaller* than the distance between the anchor and the negative embedding by at least some margin. This directly optimizes for creating an embedding space where similar items are clustered and dissimilar items are separated.

### What SBERT Does (Specifically in the Paper):

The SBERT paper explores various fine-tuning objectives, including:

  * **Classification Objective:** Taking two sentences, passing them through BERT, generating embeddings, concatenating the embeddings (or their difference, or a similarity score), and then using a softmax classifier to predict if they are semantically related (e.g., entailment, contradiction, neutral). This forces the embeddings to contain information relevant for distinguishing relationships.
  * **Regression Objective:** Taking two sentences, passing them through BERT, generating embeddings, computing cosine similarity between them, and then training to predict a target similarity score (e.g., from a dataset like STS Benchmark).
  * **Triplet Objective:** As described above, using anchor, positive, and negative examples to explicitly push similar sentences closer and dissimilar sentences farther apart.

### The Key Difference: Fine-tuning for Similarity

The critical takeaway is that SBERT **fine-tunes** BERT (or RoBERTa, XLNet, etc.) specifically for the task of generating *semantically meaningful sentence embeddings for similarity comparisons*.

During this fine-tuning, the weights of the BERT model are adjusted such that the pooling output (the sentence embedding) now accurately reflects semantic similarity. The model learns to map similar sentences to nearby points in the vector space.

### Why this is a Game Changer:

1.  **Superior Embeddings:** The embeddings generated by SBERT (after fine-tuning) are significantly better for semantic similarity tasks than those from vanilla BERT.
2.  **Efficiency:** Once SBERT is fine-tuned, you can pre-compute and store the embeddings for all sentences in your database. When a new query comes in, you pass it *once* through the SBERT model to get its embedding, and then use highly efficient vector similarity search (like cosine similarity) to find the most relevant sentences in your pre-computed database. This is orders of magnitude faster than passing every candidate through BERT for each query.
3.  **Versatility:** SBERT models can be used out-of-the-box for various tasks like semantic search, clustering, and paraphrase detection without further task-specific fine-tuning.

---

In essence, SBERT takes a powerful general-purpose language model (BERT) and adapts it through a clever training architecture (siamese/triplet networks) to become an equally powerful, and highly efficient, tool for generating high-quality **sentence embeddings specifically optimized for semantic similarity**.

---
---

# Indexing and Search

Following the generation of embeddings (covered previously), the next critical steps in Vector Search are **indexing** and **searching** these vector spaces efficiently. This involves addressing two primary challenges:

1.  How to measure the distance (similarity) between vectors.
2.  How to search vectors in a fast and scalable way.

## 1\. Measuring Vector Distance (Similarity Metrics)

In a multi-dimensional vector space, we need quantitative methods to determine how "close" or "similar" two vectors are. The choice of metric depends on the embedding model and specific use case.

Here are five widely used metrics:

  * **L0 "Norm" (Hamming Distance for binary vectors, or count of non-zero elements):**
      * **Definition:** For a vector, it is the number of non-zero elements. For comparing two vectors, it can be conceptualized as the number of positions at which the corresponding elements are different (often related to Hamming distance for binary vectors).
      * **Interpretation:** Primarily measures **sparsity** or **difference in presence/absence** of features.
      * **Note:** While commonly referred to as the L0 "norm," it technically doesn't satisfy all the properties of a mathematical norm (specifically, homogeneity: `||cx|| = |c| ||x||` does not hold for `c=0` or when `x` has non-zero elements). It's more accurately described as a **pseudo-norm** or a count. It is **not commonly used for continuous vector similarity in dense embeddings** (like those from neural networks) but is highly relevant in fields like sparse coding, feature selection, and comparing binary/categorical data.
  * **Manhattan Distance (L1 Distance):**
      * **Definition:** Sum of the absolute differences between the corresponding coordinates of two points. `L1(x, y) = Σ |xi - yi|`
      * **Interpretation:** Measures distance in a grid-like pattern, moving only along axes.
  * **Euclidean Distance (L2 Distance / Squared Distance):**
      * **Definition:** Square root of the sum of the squared differences between the corresponding coordinates of two points. `L2(x, y) = √(Σ (xi - yi)²)`
      * **Interpretation:** Measures the shortest straight-line distance between two points.
  * **Cosine Distance (Cosine Similarity):**
      * **Definition:** Measures the angle between two vectors. Cosine similarity is 1 for identical directions (angle 0) and 0 for orthogonal directions (angle 90 degrees). Cosine *distance* is often `1 - Cosine Similarity`.
      * **Interpretation:** Focuses on the similarity of **direction**, regardless of magnitude.
      * **Note:** this is the dot product of two versions.
  * **Dot Product Distance (Inner Product Distance):**
      * **Definition:** Based on the projection of one vector onto another (sum of the products of corresponding components). `DotProduct(x, y) = Σ (xi * yi)`
      * **Interpretation:** Considers similarity in terms of both **direction and magnitude**.


## 2\. Fast and Scalable Vector Search

Once distances can be measured, the challenge shifts to finding similar vectors efficiently within potentially vast vector spaces (millions to billions of embeddings).

### Search Algorithms:

1.  **Brute Force Algorithm:**

      * **Process:**
          1.  Calculate distances from the query vector to *all* other vectors in the space.
          2.  Sort all distances.
          3.  Find the top `k` nearest vectors.
      * **Complexity:** `O(N*d)` where `N` is the number of vectors and `d` is the number of dimensions.
      * **Drawback:** Impractical and computationally bottlenecked for large datasets (`N` in the millions/billions).

2.  **Approximate Nearest Neighbor (ANN) Algorithms:**

      * **Concept:** Accelerate search by trading a small amount of accuracy for significant speed improvements. They avoid exhaustive search by intelligently pruning the search space.
      * **How it works (general idea):** Divides the search space into smaller partitions, indexes them using data structures (like trees or hashes), and then searches only the most relevant partitions.
      * **Example (TreeAh - shallow tree and asymmetric hashing):** A production-ready algorithm that uses tree structures for indexing.

### ScaNN: Scalable Approximate Nearest Neighbor

In 2020, Google Research introduced **ScaNN** (Scalable Approximate Nearest Neighbor), a leading ANN algorithm powering services like Google Search and YouTube's recommendation system. 

ScaNN achieves fast and scalable vector search by combining:

1.  **Reduced Search Space (Space Pruning):**
      * **Multilevel Tree Search:** The vector space is divided into hierarchical partitions. A search tree represents this structure, with nodes as centroids of partitions.
      * **Pruning:** During a query, the tree is traversed (from root to branches to leaves), and irrelevant partitions are pruned, focusing the search on the most relevant sub-partitions.
2.  **Compressed Vector Size (Data Quantization):**
      * **Technique:** Compresses data points to save space and reduce indexing time (e.g., reducing a 9-dimensional vector from 9 floats to 12 bits).
3.  **Increased Ranking Efficiency (Business Logic Integration):**
      * **Filtering:** Incorporates business logic to filter results based on specific criteria (e.g., "resorts in the United States," "red dresses") *before* or *during* similarity ranking, restricting the search to a relevant subset of the dataset.

## Vertex AI Vector Search (formerly Matching Engine)

Vertex AI Vector Search is a fully managed similarity vector search service provided by Google Cloud.

  * **Foundation:** Utilizes an advanced version of the ScaNN algorithm.
  * **Benefits:** Offers fast searching, low latencies, and scalability to billions of vectors, often at a lower cost compared to similar services.

-----

## Further Reading & Resources

### Approximate Nearest Neighbor (ANN) Algorithms & Vector Databases:

  * **ScaNN: Efficient Vector Similarity Search (Google AI Blog):**
      * The official announcement and explanation of ScaNN.
      * [Link: ScaNN: Efficient Vector Similarity Search](https://www.google.com/search?q=https://ai.googleblog.com/2020/07/scann-efficient-vector-similarity-search.html)
  * **The Missing Piece: An Introduction to Approximate Nearest Neighbor (ANN) Search (Pinecone Blog):**
      * A good overview of ANN concepts.
      * [Link: Introduction to ANN Search](https://www.google.com/search?q=https://www.pinecone.io/learn/approximate-nearest-neighbor/)
  * **Faiss (Facebook AI Similarity Search):**
      * An open-source library for efficient similarity search. Essential for understanding ANN implementations.
      * [GitHub: Faiss](https://github.com/facebookresearch/faiss)
  * **HNSW (Hierarchical Navigable Small Worlds):**
      * A popular graph-based ANN algorithm widely used in vector databases.
      * [Paper: Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs](https://arxiv.org/abs/1603.09320)
  * **Vertex AI Vector Search Documentation:**
      * Official Google Cloud documentation for their managed vector search service.
      * [Link: Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) (Look for the most current link for "Vector Search" if "Matching Engine" is outdated)

-----

## The Problem: AI Hallucination 😵‍💫

A significant challenge with AI models like chatbots is **hallucination**, a situation where the AI confidently delivers a completely inaccurate response. This occurs because Large Language Models (LLMs) have an understanding limited to their training data, which can become outdated or lack specific organizational knowledge. This "grounding problem" undermines user trust in AI systems.

---

### What Causes AI Hallucinations?

LLMs are prone to hallucination for several key reasons:

* **Limited Knowledge:** Their understanding is confined to their training data. They lack awareness of your company's internal data, specific industry knowledge, or real-time information.
* **Inability to Verify:** They cannot check the accuracy of their own training data.
* **Lack of Context:** They often assume user prompts are factually correct and are unable to ask for clarifying information.

---

### Traditional Solutions and Their Limitations

Several methods have been used to combat hallucinations, but each has drawbacks:

* **Fine-tuning:** This involves retraining an LLM with new, specific data. While effective, it is often costly and requires extensive data and computational resources.
* **Human Review:** Having humans verify AI responses increases accuracy but is expensive, time-consuming, and not always scalable enough to catch every error.
* **Prompt Engineering:** Carefully crafting prompts can help steer the AI toward more accurate answers, but its effectiveness is limited, especially at scale.

---

## The RAG Solution: An Open-Book Exam for AI 📖

A more effective solution is **Retrieval-Augmented Generation (RAG)**. RAG is an architecture that combines the strengths of retrieval technology (like Vector Search) and generative AI models (like LLMs).

* **Retrieval Models (Vector Search):** Excellent at finding specific, factual information from a large set of documents.
* **Generative Models (LLMs):** Excellent at generating coherent, fluent, and creative text.

RAG bridges the gap between these two. It effectively gives the LLM an **"open-book exam,"** allowing it to look up information from an external, up-to-date knowledge base *before* generating an answer. This grounds the AI's response in verifiable facts, reducing the likelihood of hallucination.

---

### How RAG Works with Vector Search

**Vector Search** is the key technology that powers the retrieval function in a RAG system. The process works as follows:

1.  **Encode and Index:** New, trustworthy information (e.g., company policies, product docs, real-time alerts) is encoded into vector embeddings and stored in a vector database for efficient searching.
2.  **Query:** A user's question is also converted into a vector embedding.
3.  **Search and Retrieve:** The system uses Vector Search to find the most semantically similar and relevant documents from the vector database based on the user's query embedding.
4.  **Augment and Generate:** The original question, along with the retrieved factual information, is passed to the LLM.
5.  **Grounded Response:** The LLM then generates a final answer that incorporates the fresh, verified information, resulting in a more reliable and trustworthy response.


![](docs/images/rag-pipeline.png)


This creates a **grounded agent**—an AI that can perform fact-checks against a trusted source of information.

---

## The Next Step: Hybrid Search

While semantic search is powerful for understanding context, it can sometimes struggle with retrieving specific, exact terms (like a new product SKU) that weren't in its original training data. To address this, the next evolution is **hybrid search**, which integrates the contextual understanding of semantic search with the precision of traditional keyword search to significantly enhance retrieval performance.

## The Challenge: Beyond Semantic Search 🤔

While **semantic search** is excellent at understanding the meaning and context of words, it can struggle with **out-of-domain information**—data the embedding model hasn't been trained on, such as a brand-new product name or a specific barcode. This is where **Hybrid Search** comes in.

---

## What is Hybrid Search? 🤝

**Hybrid Search** combines the strengths of two search methods to achieve a more comprehensive and precise search experience:

* **Semantic Search:** Handles nuanced, contextual queries by understanding meaning.
* **Keyword Search:** Accurately captures specific, literal terms, especially those that are out-of-domain.

By merging these two, you get the best of both worlds. A well-known example is **Google Search**, which integrated semantic search with its existing keyword algorithms to significantly improve search quality.

### The Old Way vs. The New Way

Previously, building a hybrid search engine was a difficult task, requiring the maintenance of two separate engines and a complex process to merge and re-rank their results. Modern platforms like **Vertex AI Vector Search** have simplified this process, allowing for the creation of a single, powerful search system.

---

## How Hybrid Search Works ⚙️

Hybrid search follows the familiar `encode -> index -> search` process, but it runs two parallel tracks that are later combined.

### 1. The Keyword Search Track (Token-based)

This track focuses on matching exact words or tokens.

* **Encoding (Creating Sparse Embeddings):**
    * Text is broken into tokens (words or sub-words).
    * Instead of simple one-hot encoding, this method often uses a weighting algorithm like **TF-IDF (Term Frequency-Inverse Document Frequency)**.
    * TF-IDF assesses a word's importance within a document relative to a whole collection of documents, emphasizing significant terms.
    * The result is a high-dimensional vector with mostly zero values, known as a **sparse embedding**.

* **Indexing & Searching:**
    * A vector space is created to organize these sparse embeddings. Texts with similar keyword distributions are placed near each other, enabling efficient keyword matching.

### 2. The Semantic Search Track

This track runs in parallel and focuses on meaning.

* **Encoding (Creating Dense Embeddings):**
    * As covered previously, an embedding model (like those available through the Vertex AI Embeddings API) converts text into a low-dimensional, meaningful vector called a **dense embedding**.

### 3. Combining and Re-ranking the Results

This is the final, crucial step where the results from both tracks are merged.

* **Reciprocal Rank Fusion (RRF):**
    * Instead of just mixing the two result lists, RRF is a sophisticated method that intelligently combines them.
    * It elevates items that rank highly in *any* of the individual lists.
    * An item that ranks very high in just one list (e.g., a perfect keyword match) or ranks consistently well across both lists will be prioritized in the final results.

---

## Implementation with Vertex AI Vector Search 🛠️

Modern APIs abstract away much of the complexity, making implementation straightforward.

1.  **Generate Embeddings:**
    * **Sparse Embeddings:** Use a vectorizer library (like `scikit-learn`'s TF-IDF vectorizer) to convert your text data into sparse embeddings for keyword search.
    * **Dense Embeddings:** Use a service like the Vertex AI Embeddings API to generate dense embeddings for semantic search.

2.  **Store and Index:**
    * Combine both the dense and sparse embeddings for each data point into a single record (e.g., in a JSON file).
    * Use this file to create a single hybrid vector index in Vertex AI Vector Search.

3.  **Query the Index:**
    * When performing a search, create a **hybrid query object** that contains both the sparse embedding (for keywords) and the dense embedding (for semantics) of your search query.
    * The system executes the query, leveraging both embedding types and using RRF to fuse the results.

### Example Result

A hybrid search for "kids sunglasses" might return:

* **Top Result:** "Google Blue Kids Sunglasses" (high similarity for both dense and sparse embeddings).
* **Middle Result:** "Google White Classic Youth Tee" (lower rank because it doesn't contain the keyword "kids," but "youth tee" is semantically similar enough to be included).

This demonstrates how hybrid search enables the rapid finding of similar items based on both literal keywords and conceptual meaning.

## TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic reflecting how important a word is to a document in a collection or corpus. It's widely used in information retrieval and text mining.

It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

### Core Components:

TF-IDF is calculated by multiplying two main components:

1.  **Term Frequency (TF):** Measures how frequently a term (word) appears in a document. There are several ways to define TF:
      * **Raw Count:** The absolute count of the term in the document. `TF(t, d) = count(t in d)`
      * **Normalized Frequency:** Raw count divided by the total number of terms in the document. This prevents bias towards longer documents. `TF(t, d) = count(t in d) / total_terms_in_d`
      * **Log Normalization:** `TF(t, d) = 1 + log(count(t in d))` (if count > 0)
      * **Interpretation:** Higher TF means the term is more relevant to that specific document.

2.  **Inverse Document Frequency (IDF):** Measures how important a term is across the entire corpus. It's designed to down-weight common words (like "the," "is," "a") that appear in many documents and up-weight rare words that are more distinctive.
      * **Definition:** `IDF(t, D) = log(N / df(t))`
          * `N`: Total number of documents in the corpus.
          * `df(t)`: Number of documents in the corpus that contain the term `t` (document frequency).
          * `log`: Usually base 10 or natural log (`ln`). A `+1` is often added to the denominator to prevent division by zero if a term doesn't appear in any document.
      * **Interpretation:** Higher IDF means the term is rarer across the corpus, making it potentially more informative.

### TF-IDF Calculation:

The final TF-IDF score for a term `t` in a document `d` within a corpus `D` is:

`TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)`

#### How it Works:

  * If a term appears frequently in a document (high TF) but rarely across the entire corpus (high IDF), it gets a **high TF-IDF score**, indicating it's very relevant to that specific document.
  * If a term appears frequently across the entire corpus (low IDF), even if it's frequent in a document (high TF), its TF-IDF score will be lower, reducing its perceived importance.
  * If a term is rare everywhere (low TF and high IDF), its score can still be significant if it appears in the document.

### Output Representation:

  * When applied to a document, TF-IDF transforms the document into a **sparse vector**. Each dimension of the vector corresponds to a unique term in the vocabulary, and its value is the TF-IDF score for that term in that document.
  * Since most words in the vocabulary will not appear in a given document, most values in the vector will be zero, hence "sparse."

### Use Cases:

  * **Information Retrieval:** Ranking documents by relevance to a query.
  * **Text Summarization:** Identifying key terms in a document.
  * **Document Classification:** As features for machine learning models.
  * **Keyword Extraction:** Finding the most important keywords in a text.

### Limitations:

  * **Semantic Blindness:** TF-IDF treats each word as independent. It doesn't understand synonyms, polysemy, or the semantic relationship between words.
  * **Lack of Context:** It only considers word presence and frequency, not the word order or surrounding context within a sentence.
  * **Vocabulary Size:** Performance can degrade with very large vocabularies, leading to extremely high-dimensional, sparse vectors.

### Further Reading:

  * **TF-IDF (Wikipedia):** The standard reference for the algorithm.
      * [Link: Tf-idf - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
  * **Scikit-learn TfidfVectorizer:** Practical implementation details and usage.
      * [Link: sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  * **Introduction to Information Retrieval (Manning, Raghavan, Schütze):** Chapter 6 covers TF-IDF in depth.
      * [Link: Introduction to Information Retrieval - Chapter 6](https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html)