# Vector Database Setup and Exploration

This notebook explores ChromaDB for our LLM long-term memory project.

## Goals:
- Install and setup ChromaDB
- Create a simple collection
- Add some sample data
- Test basic similarity search
- Understand how embeddings work

## What we're building:
A memory system that stores conversations and retrieves similar past interactions.

In [6]:
# Here we are importing usefull libraries to start with..
import chromadb
import os
from datetime import datetime

#lets check if imported


print("Libraries imported successfully!")
print(f"ChromaDB version: {chromadb.__version__}")
print(f"Current time: {datetime.now()}")


Libraries imported successfully!
ChromaDB version: 1.0.20
Current time: 2025-08-30 12:59:21.019075


In [7]:
# we will create a chromadb client now
client = chromadb.Client()

# Try to get existing collection or create new one
try:
    collection = client.get_collection(name="conversation_memory")
    print("Found existing collection!")
except:
    collection = client.create_collection(
        name="conversation_memory",
        metadata={"description": "Stores conversation history for LLM memory and have a large context over conversations."}
    )
    print("Created new collection!")

print(f"Collection name: {collection.name}")
print(f"Collection count: {collection.count()}")
print("Ready to store memories!")

Created new collection!
Collection name: conversation_memory
Collection count: 0
Ready to store memories!


In [None]:
# Now we will try to add some sample conversation so that we can use for testing and creating our agent...
sample_conversations = [
    "User asked about machine learning basics and showed interest in neural networks",
    "User wants to learn Python programming and mentioned they are a beginner",
    "Discussion about building a web scraping project using BeautifulSoup",
    "User asked for help with data visualization using matplotlib and pandas",
    "Conversation about setting up a virtual environment for Python projects"
]



#lets map with simple ids
for i, conversation in enumerate(sample_conversations):
    collection.add(
        documents=[conversation],
        ids=[f"conv_{i+1}"]
    )

print(f"Added {len(sample_conversations)} conversations to memory!")
print(f"Total conversations in collection: {collection.count()}")


#it will take few minutes to load the conversations
#sentence transformer will be downloaded. it will convert text into 384 dimensional vectors 
#it will be great for semantic similarity search



/home/dakshchoudhary/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [06:22<00:00, 217kiB/s] 


Added 5 conversations to memory!
Total conversations in collection: 5


In [14]:
# testing will be done now 
query = "I want to learn programming"   #a query we initialised for testing


# Search for similar conversations
results = collection.query(
    query_texts=[query],
    n_results=3  #  it will Get the top 3 most similar conversations
)


print(f"Query: '{query}'")
print("\nMost similar past conversations:")
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"{i+1}. {doc}")
    print(f"   Similarity score: {1-distance:.3f}\n")


##this will show you the result of similar searching 


Query: 'I want to learn programming'

Most similar past conversations:
1. User wants to learn Python programming and mentioned they are a beginner
   Similarity score: 0.243

2. User asked about machine learning basics and showed interest in neural networks
   Similarity score: -0.437

3. Conversation about setting up a virtual environment for Python projects
   Similarity score: -0.546



In [17]:
# Lets test it for another query and see how it works
query2 = "help me with data analysis and charts"

#again lets get those searches
results2 = collection.query(
    query_texts=[query2],
    n_results=3
)


##Now this will result in all the searches
print(f"Query: '{query2}'")
print("\nMost similar past conversations:")
for i, (doc, distance) in enumerate(zip(results2['documents'][0], results2['distances'][0])):
    print(f"{i+1}. {doc}")
    print(f"   Similarity score: {1-distance:.3f}\n")

Query: 'help me with data analysis and charts'

Most similar past conversations:
1. User asked for help with data visualization using matplotlib and pandas
   Similarity score: 0.164

2. User asked about machine learning basics and showed interest in neural networks
   Similarity score: -0.592

3. Discussion about building a web scraping project using BeautifulSoup
   Similarity score: -0.671



In [3]:
#now lets try to add some realistic conversation
import json

#here is the detailed conversation
detailed_conversations = [
    {
        "conversation": "User: I'm new to Python and want to build my first web scraper. Agent: Great! Let's start with requests and BeautifulSoup libraries.",
        "timestamp": "2024-08-28 10:30:00",
        "topic": "web scraping",
        "user_level": "beginner",
        "technologies": ["python", "requests", "beautifulsoup"]
    },
    {
        "conversation": "User: Can you help me understand neural networks? Agent: Sure! Neural networks are inspired by how the brain works...",
        "timestamp": "2024-08-28 14:15:00", 
        "topic": "machine learning",
        "user_level": "beginner",
        "technologies": ["neural networks", "AI", "deep learning"]
    }
]
#lets try to print the data and metadata
print("Prepared detailed conversation data with metadata!")
print(f"Sample metadata: {detailed_conversations[0]['technologies']}")

Prepared detailed conversation data with metadata!
Sample metadata: ['python', 'requests', 'beautifulsoup']


In [10]:
#now lets store all the convo in our data base chroma db with metadata also
for i, conv_data in enumerate(detailed_conversations):
    collection.add(
        documents=[conv_data["conversation"]],
        metadatas=[{
            "timestamp": conv_data["timestamp"],
            "topic": conv_data["topic"],
            "user_level": conv_data["user_level"],
            "technologies": json.dumps(conv_data["technologies"])  # Convert list to JSON string
        }],
        ids=[f"detailed_conv_{i+1}"]
    )

print(f"Added {len(detailed_conversations)} detailed conversations with metadata!")
print(f"Total conversations in collection: {collection.count()}")

# Let's see  now  how this metadata looks like
sample_result = collection.get(ids=["detailed_conv_1"], include=["metadatas"])
print(f"\nSample metadata: {sample_result['metadatas'][0]}")


Added 2 detailed conversations with metadata!
Total conversations in collection: 2

Sample metadata: {'technologies': '["python", "requests", "beautifulsoup"]', 'user_level': 'beginner', 'timestamp': '2024-08-28 10:30:00', 'topic': 'web scraping'}
