## **Semantic Search System Using Document Embeddings and Vector Databases**

#### **Introduction:**
In today’s world, the ability to efficiently search and retrieve relevant information from large document collections is crucial for businesses and organizations. Traditional keyword-based search systems often fall short when dealing with complex queries or when users are unsure of the exact terms to search for. This is where **semantic search** comes into play.

Semantic search leverages **natural language processing (NLP)** and **machine learning** to understand the meaning behind words and phrases, enabling more accurate and context-aware retrieval of information. In this lecture, we will explore how to build a semantic search system using **document embeddings** and **vector databases**.

#### **Key Concepts Covered:**
1. **Document Embeddings**: Representing text as high-dimensional vectors that capture semantic meaning.
2. **Vector Databases**: Storing and querying embeddings efficiently for fast retrieval.
3. **Visualizing Embeddings**: Using dimensionality reduction techniques like t-SNE to visualize high-dimensional data in 2D.


In [None]:
# importing libraries
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go


import os
import glob
from dotenv import load_dotenv
import gradio as gr

#### **Configuring the Model and Database**


In [None]:
MODEL = "gpt-4o-mini"
db_name = "vector_db"

#### **Loading Environment Variables**


In [None]:
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not found in .env file")
os.environ['OPENAI_API_KEY'] = openai_api_key

#### **Loading and Preprocessing Documents**

let's load the documents from the subfolders in the `Palazon Global Database` directory.

In [None]:
folders = glob.glob("../Palazon Global Database/*")
text_loader_kwargs = {'encoding': 'utf-8'}
documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

#### **Splitting Documents into Chunks**


let's split the documents into smaller chunks (1000 characters each) with an overlap of 200 characters to ensure context is preserved across chunks.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

#### **Analyzing Document Types**

Let's identify and print the unique document types found in the dataset, to provide insights into the data structure.

In [None]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

#### **Generating Embeddings using OpenAI**

OpenAI embeddings are generated for each text chunk, converting text into high-dimensional vectors that capture semantic meaning.

In [None]:
embeddings = OpenAIEmbeddings()

### **Next step: Indexing**  

Indexing in a **RAG (Retrieval-Augmented Generation) system** involves converting documents into **vector embeddings** and storing them in a **vector database** for efficient retrieval. Here’s what actually happens:  

1. **Text Embedding:**  
   - The embedding model (**e.g., `sentence-transformers/all-MiniLM-L6-v2`**) converts each document (or chunk) into a high-dimensional numerical vector.  
   - This transformation captures the semantic meaning of the text.  

2. **Storing in the Vector Database:**  
   - The generated embeddings are stored in a **vector store** (e.g., **ChromaDB**) alongside metadata like document type.  
   - These embeddings act as **index entries**, allowing the system to quickly find semantically similar texts.  

In [None]:
# Managing the Chroma Vector Store

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Indexing
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

The Chroma vector store is initialized. If an existing store is found, it is deleted to start fresh. The new store is populated with document chunks and their embeddings.

---

#### **Analyzing Embedding Dimensions**

Let's retrieves a sample embedding and prints its dimensionality.

In [None]:
collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

#### **Visualizing the Vector Store**


In [None]:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

- **To explain the above code**: The embeddings, documents, and metadata are retrieved for visualization. Each document type is assigned a unique color for clarity.

---

#### **Reducing Dimensionality with t-SNE**

In [None]:
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

### **What is t-SNE?**
**t-SNE (t-Distributed Stochastic Neighbor Embedding)** is a dimensionality reduction technique used to visualize high-dimensional data in 2D or 3D. It preserves the local structure of the data, meaning points that are close in high-dimensional space remain close in the reduced space.

### **How It Works**
- t-SNE minimizes the difference between pairwise similarities in high-dimensional space and low-dimensional space.
- It emphasizes local relationships, making it ideal for visualizing clusters of semantically similar documents.

### **Why Is It Necessary?**
Without t-SNE, we cannot visualize or interpret the high-dimensional embeddings. It bridges the gap between complex mathematical representations and human-understandable visuals, making it easier to debug and analyze the semantic search system.

#### **Creating the 2D Scatter Plot**


In [None]:
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])
fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    xaxis_title='x',
    yaxis_title='y',
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)
fig.show()

---

### **Conclusion**
This lecture introduced the concept of **semantic search** and demonstrated how to build a semantic search system using **document embeddings** and **vector databases**. By the end of this session, students should understand how to:
1. Load and preprocess documents.
2. Generate embeddings and store them in a vector database.
3. Visualize embeddings to gain insights into the data.

*This foundation can later be extended to include a generative component, transforming the system into a full **Retrieval-Augmented Generation (RAG)** pipeline.*