## **Semantic Search System Using Document Embeddings and Vector Databases**

#### **Introduction:**
In today’s world, the ability to efficiently search and retrieve relevant information from large document collections is crucial for businesses and organizations. Traditional keyword-based search systems often fall short when dealing with complex queries or when users are unsure of the exact terms to search for. This is where **semantic search** comes into play.

Semantic search leverages **natural language processing (NLP)** and **machine learning** to understand the meaning behind words and phrases, enabling more accurate and context-aware retrieval of information. In this lecture, we will explore how to build a semantic search system using **document embeddings** and **vector databases**.

#### **Key Concepts Covered:**
1. **Document Embeddings**: Representing text as high-dimensional vectors that capture semantic meaning.
2. **Vector Databases**: Storing and querying embeddings efficiently for fast retrieval.
3. **Visualizing Embeddings**: Using dimensionality reduction techniques like t-SNE to visualize high-dimensional data in 2D.


In [1]:
# importing libraries
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.llms import HuggingFaceHub
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

import os
import glob
from dotenv import load_dotenv
import gradio as gr

#### **Configuring the Database**


In [2]:
db_name = "vector_db"

#### **Loading and Preprocessing Documents**

let's load the documents from the subfolders in the `Palazon Global Database` directory.

In [3]:
folders = glob.glob("../Palazon Global Database/*")
text_loader_kwargs = {'encoding': 'utf-8'}
documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

#### **Splitting Documents into Chunks**


let's split the documents into smaller chunks (1000 characters each) with an overlap of 200 characters to ensure context is preserved across chunks.

In [4]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

#### **Analyzing Document Types**

Let's identify and print the unique document types found in the dataset, to provide insights into the data structure.

In [5]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: products, company, contracts, employees


#### **Generating Embeddings using HuggingFace**

HuggingFace embeddings are generated for each text chunk, converting text into high-dimensional vectors that capture semantic meaning.

In [6]:
! pip install -U langchain-huggingface




### **Embedding**

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#### **Next step: Indexing**  

Indexing in a **RAG (Retrieval-Augmented Generation) system** involves converting documents into **vector embeddings** and storing them in a **vector database** for efficient retrieval. Here’s what actually happens:  

1. **Text Embedding:**  
   - The embedding model (**e.g., `sentence-transformers/all-MiniLM-L6-v2`**) converts each document (or chunk) into a high-dimensional numerical vector.  
   - This transformation captures the semantic meaning of the text.  

2. **Storing in the Vector Database:**  
   - The generated embeddings are stored in a **vector store** (e.g., **ChromaDB**) alongside metadata like document type.  
   - These embeddings act as **index entries**, allowing the system to quickly find semantically similar texts.  

In [20]:
# Managing the Chroma Vector Store

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Indexing
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 62 documents


The Chroma vector store is initialized. If an existing store is found, it is deleted to start fresh. The new store is populated with document chunks and their embeddings.

---

#### **Analyzing Embedding Dimensions**

Let's retrieves a sample embedding and prints its dimensionality.

In [21]:
collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 384 dimensions


In [22]:
# Load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Use Hugging Face's LLM
load_dotenv()

# Get the Hugging Face token
HF_TOKEN = os.getenv("HF_TOKEN")

llm = HuggingFaceHub(repo_id="Qwen/Qwen2.5-Coder-32B-Instruct", huggingfacehub_api_token=HF_TOKEN, model_kwargs={"temperature": 0.7})

#llm = HuggingFaceHub(repo_id="HuggingFaceH4/zephyr-7b-alpha", huggingfacehub_api_token=HF_TOKEN, model_kwargs={"temperature": 0.7})  

# Set up memory for conversation
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Use the retriever from your vectorstore
retriever = vectorstore.as_retriever()

# Create the conversational retrieval chain
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=retriever, memory=memory
)

### **Rough Test material**

In [24]:
query = "Can you describe Palazon Global in a few sentences"
result = conversation_chain.invoke({"question":query})
print(result["answer"])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

# Chukwuemeka James

## Summary  
- **Date of Birth**: April 10, 1980  
- **Job Title**: Co-Founder & Chief Executive Officer (CEO)  
- **Location**: Lagos, Nigeria  

## Palaozon Global Career Progression  
- **2018 - Present**: Co-Founder & CEO  
  Chukwuemeka James co-founded Palaozon Global with a vision to revolutionize renewable energy accessibility in Africa. Under his leadership, the company has grown into a premier provider of solar power solutions.  

- **2014 - 2018**: Director of Energy Solutions at GreenTech Africa  
  Before launching Palaozon Global, Chukwuemeka led large-scale solar projects, helping businesses transition to renewable energy sources.  

- **2010 - 2014**: Senior Engineer at EnergyWise Consulting  
  Chukwuemeka worked as a renewable energy consultant, advising companies on cost-effective sola

### **Let's use Gradio**

In [None]:
# Wrapping in a function
def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result["answer"]

# And in Gradio:
view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)