# Semantic Search

Semantic search retrieves information by understanding the meaning and context of queries using techniques like vector embeddings, enabling more accurate and relevant results beyond simple keyword matching.

## 1. Embeddings

### 1.1 What are Embeddings?

Embeddings are N-dimensional vector representations derived from text data. They capture the semantic meaning of the text, enabling similarity comparisons using distance metrics. These representations allow for efficient search and retrieval of relevant information based on textual relationships.
 

![](../obsidian/Excalidraw/Embeddings.excalidraw.svg)


### 1.2 Creating Embeddings with Ollama  

Just like chat models, embedding models can be easily downloaded and hosted on our own hardware.  

The following code snippet demonstrates how to generate embeddings using an Ollama embedding model.


In [None]:
from ollama import Client

# Download the embedding model
MODEL = "all-minilm:33m"
client = Client(host="http://localhost:11434")
client.pull(MODEL)

In [None]:
# Generate embeddings for a given text
result = client.embed(
    model=MODEL,
    input="Hello world",
)

print(f"{result.embeddings[0]}")

### 1.3 Calculating Similarities

To determine how similar two sentences are, we compare the distance between their embeddings. A common approach is to use **cosine similarity**, which measures the cosine of the angle between two vectors:


$\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$


(An example implementation in Python is provided below.)

Visualizing the **similarity matrix** can offer deeper insights into the relationships between sentences. This allows us to identify the most semantically relevant sentences based on their contextual meaning.

In [None]:
import numpy as np


# Example function to calculate cosine similarity between multiple embeddings
def cosine_similarity(embeddings: np.ndarray) -> np.ndarray:
    """
    Calculate the cosine similarity between all pairs of embedding vectors in a 2D array.

    Parameters:
        embeddings (numpy.ndarray): 2D array where each row is an embedding.

    Returns:
        numpy.ndarray: 2D array of cosine similarities. The element at [i, j] is the cosine similarity between the ith and jth embeddings.
    """
    # Calculate the Gram matrix (dot product of each pair of embeddings)
    gram = np.dot(embeddings, embeddings.T)

    # Calculate the norms of each embedding
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)

    # Avoid division by zero by ensuring norms are at least a small epsilon
    epsilon = np.finfo(float).eps
    norms += epsilon

    # Compute the cosine similarity
    similarity = gram / (norms * norms.T)

    return similarity

### 1.3.1 Visualizing Sentence Similarities  
**a) Heatmaps**  

One effective way to analyze sentence similarities is through **heatmaps**, which provide a clear visual representation of the similarity matrix.

In [None]:
from utils import bulk_embed, plot_similarity_heatmap


example_sentences = [
    "The cat sat on the windowsill, watching the birds outside.",
    "A feline perched by the window, observing the chirping sparrows.",
    "The dog barked loudly when the mailman arrived.",
    "A postal worker delivered letters while a nearby canine growled.",
    "The sun sets in the evening, painting the sky orange and red.",
    "At dusk, the horizon glows with vibrant shades of crimson and gold.",
    "She enjoys reading mystery novels late at night.",
    "At night, she immerses herself in thrilling detective stories.",
    "The train arrived at the station five minutes late.",
    "Passengers waited patiently as the delayed locomotive approached.",
]

embeddings = bulk_embed(MODEL, example_sentences, client)


similarities = cosine_similarity(embeddings)

plot_similarity_heatmap(similarities, texts=example_sentences, limit=20)

**b) 3D-Plots**

Another powerful method for visualizing sentence similarities is **3D plots**, which offer a more dynamic perspective on the relationships between sentence embeddings.

In [None]:
from utils import plot_embeddings

plot_embeddings(embeddings=embeddings, texts=example_sentences)

### 1.4 Your Task: Multilingual Similarity Analysis  

Your task is to **visualize the similarities between sentences with similar semantic content but in different languages**. This will help assess whether our model effectively captures multilingual semantics.  

Once you have completed the analysis, repeat the process using the `granite-embedding:278m` model to compare its performance in handling multilingual embeddings.

In [None]:
### Your code goes here

### 2. Vector Stores and Search  

A **vector store** is a specialized database for managing and searching high-dimensional embeddings. Unlike traditional keyword-based search, vector stores enable **semantic search**, retrieving relevant information based on meaning rather than exact matches. This is achieved through **approximate nearest neighbor (ANN) search**, which efficiently finds semantically similar entries.  

To integrate vector stores into our system, we will again use **LangChain**, which provides seamless tools for storing and retrieving embeddings for effective semantic search.



In [None]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_ollama import OllamaEmbeddings
from langchain_core.documents import Document


def create_vector_store(model: str = "granite-embedding:278m"):
    # Wrap our Ollama model
    embedding_provider = OllamaEmbeddings(model=model)

    # Initialize the vector store
    index = faiss.IndexFlatL2(len(embedding_provider.embed_query("hello world")))

    # Store the documents in memory for now
    vector_store = FAISS(
        embedding_function=embedding_provider,
        index=index,
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )
    return vector_store


vector_store = create_vector_store()
vector_store.distance_strategy

### 2.1 Using the Vector Store  

To store text in the vector database, we wrap it in a `Document` object, which includes `page_content` for the text itself and optional `metadata` for additional context.

In [None]:
documents = []
for i, sentence in enumerate(example_sentences):
    document = Document(page_content=sentence, metadata={"document_id": i})
    documents.append(document)

ids = vector_store.add_documents(documents=documents)

### 2.2 Querying the Vector Store  

To retrieve similar documents, we can use the `similarity_search_with_score` function, which finds and ranks documents based on their relevance to a given query.

In [None]:
results = vector_store.similarity_search_with_score(query="cat", k=3)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

## 3. Indexing a Codebase  

<img src="https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" alt="Langchain Pipeline" style="width:800px;">  

Now, let's work with a real codebase and index it. As an example, we'll use the [TEI-Client](https://github.com/LLukas22/tei-client) repository, which is small and easy to understand.  

The code below clones the repository into the [`repo`](./repo/) folder. If you don’t have `git` installed, you can manually download the code from [GitHub](https://github.com/LLukas22/tei-client) and place it in the [`repo`](./repo/) directory.


In [None]:
from git import Repo
from pathlib import Path

repo_path = Path("repo").resolve()
if not repo_path.exists():
    repo_path.mkdir(parents=True)

repo_url = "https://github.com/LLukas22/tei-client.git"

try:
    Repo.clone_from(repo_url, "./repo")
except Exception as e:
    print(f"Failed to clone repository: {e}")

### 3.1 Loading Code Files  

Langchain provides utility functions to efficiently locate and load files as `Document` objects.  

The following code demonstrates how to accomplish this:

In [None]:
from langchain_community.document_loaders.directory import DirectoryLoader
from langchain_community.document_loaders.text import TextLoader

loader = DirectoryLoader(repo_path, glob="**/*.py", loader_cls=TextLoader)
docs = loader.load()

for doc in docs:
    print("_" * 8)
    print(f"Source: {doc.metadata["source"]}")
    print(f"Characters: {len(doc.page_content)}")

### 3.2 Challenges in Embedding Code  

When adding code files to the vector store, we may encounter errors due to their large size. This happens because our embedding model has a token limit and cannot process documents exceeding that limit in a single pass.  

To resolve this, we need to split our `Document` objects into smaller, manageable chunks before embedding them. This ensures that each chunk stays within the model's token constraints while preserving the overall structure and meaning of the code.

In [None]:
from ollama import ResponseError

# This will fail if the document exceeds the maximum context length
vector_store = create_vector_store()
try:
    ids = vector_store.add_documents(docs)
except ResponseError as e:
    if e.status_code == 500:
        print("Document exceeded the maximum context length")
    else:
        print(f"Failed to add document: {e}")

### 3.3 Your Task: Properly Splitting and Indexing Code Files  

To effectively store and search code files, we need to:  

1. **Load the code files** using Langchain’s [`GenericLoader` and `LanguageParser`](https://python.langchain.com/docs/integrations/document_loaders/source_code/). These tools help extract structured information from source code files, making them easier to process.  

2. **Split the files into manageable chunks** using [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/integrations/document_loaders/source_code/#splitting). This ensures that each piece remains within the embedding model's token limit while maintaining logical code segments for meaningful retrieval.  

By following this approach, we can efficiently index the codebase while preserving its readability and searchability.







In [None]:
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# Your code goes here


Now we should be able to add the documents to our vector store without problems.

In [None]:
vector_store = create_vector_store()
try:
    ids = vector_store.add_documents(docs)
    print(f"Added {len(ids)} documents!")
except ResponseError as e:
    if e.status_code == 500:
        print("Document exceeded the maximum context length")
    else:
        print(f"Failed to add document: {e}")

### 3.4 Searching the Codebase  

With our codebase successfully indexed, we can now perform searches using the vector store. Let's test it with a few queries:  

- **"Where is the `embed` function implemented?"**  
- **"How is reranking handled?"**  
- **"How can I create a client?"**  

These queries should return the most relevant code snippets, making it easy to locate specific implementations within the codebase.

In [None]:
results = vector_store.similarity_search_with_score(
    query="Where is the `embed` function implemented?", k=5
)
for doc, score in results:
    print("-"*10)
    print(f"* [SIM={score:3f}]")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}]")