<a href="https://colab.research.google.com/github/Rohit-Munda/GenAIWorkshop/blob/main/vectordb_chunking_indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Vector Database Workshop: Introduction to ChromaDB & Indexing

Welcome! In this hands-on session, we will:


*   Setup a vector database
*   Chunk documents and store in vector database
*   Similarity search from vector store

In [1]:
!pip install -q langchain_community chromadb

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m86.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[

## 📄 Step 1: Create and Load a Sample Document

In [2]:

from langchain.document_loaders import TextLoader

# Create a sample text file
sample_text = (
    "Artificial Intelligence (AI) is transforming industries across the globe. "
    "From healthcare to finance, AI applications are driving innovation. "
    "Large Language Models (LLMs) are at the core of this revolution, enabling machines to understand and generate human-like language. "
    "In this session, we explore how to prepare documents for LLMs using LangChain."
)

with open("sample_doc.txt", "w") as f:
    f.write(sample_text)

# Load the document
loader = TextLoader("sample_doc.txt")
documents = loader.load()

print("Loaded document:")
print(documents[0].page_content)

Loaded document:
Artificial Intelligence (AI) is transforming industries across the globe. From healthcare to finance, AI applications are driving innovation. Large Language Models (LLMs) are at the core of this revolution, enabling machines to understand and generate human-like language. In this session, we explore how to prepare documents for LLMs using LangChain.


## 🔁 Step 2: Recursive Splitting with `RecursiveCharacterTextSplitter`

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=15)
recursive_chunks = recursive_splitter.split_documents(documents)

print(f"Number of chunks: {len(recursive_chunks)}")
for i, chunk in enumerate(recursive_chunks):
    print(f"Chunk {i+1}: {chunk.page_content}")

Number of chunks: 11
Chunk 1: Artificial Intelligence (AI) is transforming
Chunk 2: transforming industries across the globe. From
Chunk 3: globe. From healthcare to finance, AI
Chunk 4: to finance, AI applications are driving
Chunk 5: are driving innovation. Large Language Models
Chunk 6: Models (LLMs) are at the core of this revolution,
Chunk 7: revolution, enabling machines to understand and
Chunk 8: understand and generate human-like language. In
Chunk 9: language. In this session, we explore how to
Chunk 10: explore how to prepare documents for LLMs using
Chunk 11: for LLMs using LangChain.


In [4]:
recursive_chunks

[Document(metadata={'source': 'sample_doc.txt'}, page_content='Artificial Intelligence (AI) is transforming'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='transforming industries across the globe. From'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='globe. From healthcare to finance, AI'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='to finance, AI applications are driving'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='are driving innovation. Large Language Models'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='Models (LLMs) are at the core of this revolution,'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='revolution, enabling machines to understand and'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='understand and generate human-like language. In'),
 Document(metadata={'source': 'sample_doc.txt'}, page_content='language. In this session, we explore how to'),

## 📥 Step 3: Initialize Chroma database and create collection

In [None]:
import chromadb
from chromadb.config import Settings

# Disable telemetry
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))

# Create a new collection for your text chunks
collection = chroma_client.create_collection(name="text_chunks_demo")

## 📄 Step 4: Generate embeddings

In [10]:
from sentence_transformers import SentenceTransformer

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

#Create list of text from recursive_chunks
chunk_texts = [chunk.page_content for chunk in recursive_chunks]

# Generate embeddings
embeddings = embedding_model.encode(chunk_texts).tolist()
print(f"Created {len(embeddings)} embeddings.")

Created 11 embeddings.


## 📝 Step 5: Add Chunks and Embeddings to Chroma

In [11]:
# Add each chunk with its embedding to the Chroma collection
# We'll use the chunk index as the ID
ids = [f"chunk_{i}" for i in range(len(chunk_texts))]

collection.add(
    documents=chunk_texts,
    embeddings=embeddings,
    ids=ids
)

print("Chunks and their embeddings have been added to Chroma!")

Chunks and their embeddings have been added to Chroma!


## 🔍 Step 8: Similarity Search in Chroma

In [12]:
# Define a query
query = "What are Large Language Models?"

# Embed the query
query_embedding = embedding_model.encode([query]).tolist()

# Search in Chroma
results = collection.query(
    query_embeddings=query_embedding,
    n_results=3  # Show top 3 most similar chunks
)

print("Most similar chunks to your query:")
for doc, score in zip(results['documents'][0], results['distances'][0]):
    print(f"Score: {score:.4f} | Chunk: {doc}")

Most similar chunks to your query:
Score: 0.7126 | Chunk: are driving innovation. Large Language Models
Score: 1.0036 | Chunk: understand and generate human-like language. In
Score: 1.1861 | Chunk: language. In this session, we explore how to


## ✅ Summary (Add to your summary section)
You learned how to split text, create embeddings, store them in a vector database (Chroma), and perform similarity search.

This is the foundation for search and retrieval in modern AI applications!