# ChromaDB Setup and Collection Creation

## Overview
This section demonstrates how to initialize a ChromaDB client and create your first vector database collection. ChromaDB is a vector database that allows you to store and query high-dimensional vectors efficiently.

## Key Components

### Client Initialization
The process begins by creating a ChromaDB client instance. This client serves as the main interface for interacting with the database and managing collections.

### Collection Creation
A collection is created with specific parameters:
- **Name**: A unique identifier for the collection ("my_documents")
- **Metadata**: Additional information describing the collection's purpose
- **Description**: A human-readable description explaining what the collection contains

### Verification
After creation, the system confirms:
- The collection was successfully created
- The collection name is properly set
- The initial document count (which should be zero for a new collection)

## Purpose
This setup establishes the foundation for storing and retrieving document embeddings. The collection acts as a container where you can store documents along with their vector representations, enabling semantic search capabilities.

## Benefits
- **Organized Storage**: Collections help organize different types of documents
- **Metadata Support**: Additional information can be stored alongside documents
- **Scalability**: ChromaDB handles the underlying vector storage efficiently

In [9]:
# Required Libraries
! pip install chromadb sentence-transformers

In [1]:
import chromadb
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize ChromaDB client
client = chromadb.Client()

# Create a collection (think of it as a table)
collection = client.create_collection(
    name="my_documents",
    metadata={"description": "My first vector database"}
)

print(f"Created collection: {collection.name}")
print(f"Collection count: {collection.count()}")

Created collection: my_documents
Collection count: 0


# Document Processing and Embedding Generation

## Overview
This section covers the process of converting text documents into numerical vectors (embeddings) and storing them in the ChromaDB collection. This transformation enables semantic similarity search.

## Sample Document Collection
The process starts with a diverse set of sample documents covering various technology topics:
- Programming languages and development
- Artificial intelligence and machine learning
- Data science and analytics
- Cloud computing infrastructure
- Natural language processing

## Embedding Model
A pre-trained sentence transformer model is utilized for generating embeddings:
- **Model Type**: all-MiniLM-L6-v2
- **Purpose**: Converts text into high-dimensional numerical vectors
- **Advantage**: Captures semantic meaning of text for similarity comparisons

## Embedding Generation Process
The transformation process involves:
1. **Loading the Model**: Initialize the sentence transformer
2. **Text Processing**: Convert each document into a numerical vector
3. **Dimension Verification**: Confirm the embedding dimensions are consistent
4. **Progress Tracking**: Monitor the embedding generation process

## Database Storage
Documents are stored with multiple components:
- **Document Text**: Original text content
- **Embeddings**: Numerical vector representations
- **Unique IDs**: Identifiers for each document
- **Metadata**: Additional information including source and index

## Results
The process concludes by confirming:
- Total number of embeddings generated
- Vector dimensions for each embedding
- Successful storage in the database
- Final document count in the collection

In [6]:
# Sample documents for our database
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Artificial intelligence is transforming technology",
    "Python is a popular programming language",
    "Machine learning models require large datasets",
    "Vector databases enable fast similarity search",
    "Natural language processing analyzes text data",
    "Deep learning uses neural networks",
    "Data science combines statistics and programming",
    "Cloud computing provides scalable infrastructure",
    "Software development involves writing code",
]

# Load sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
print("Generating embeddings...")
embeddings = model.encode(documents)
print(f"Generated {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")

# Create IDs for our documents
ids = [f"doc_{i}" for i in range(len(documents))]

# Add documents to ChromaDB
collection.add(
    documents=documents,
    embeddings=embeddings.tolist(),
    ids=ids,
    metadatas=[{"source": "sample", "index": i} for i in range(len(documents))]
)

print(f"Added {collection.count()} documents to the database")

Generating embeddings...
Generated 40 embeddings of dimension 384
Added 40 documents to the database


# Vector Search and Similarity Querying

## Overview
This section demonstrates how to perform semantic search using vector similarity. The system can find documents that are conceptually similar to a query, even if they don't share exact keywords.

## Query Processing
The search process begins with a natural language query about artificial intelligence and machine learning. This query undergoes the same transformation process as the stored documents.

## Embedding Generation for Query
The query text is converted into a numerical vector using the same sentence transformer model that was used for the document embeddings. This ensures consistency in the vector space representation.

## Similarity Search Execution
The database performs a similarity search with specific parameters:
- **Query Vector**: The numerical representation of the search query
- **Result Limit**: Number of most similar documents to return (top 3)
- **Included Data**: Specifies what information to return (documents, distances, metadata)

## Search Algorithm
The system uses vector similarity metrics to:
- Compare the query embedding with all stored document embeddings
- Calculate distance/similarity scores
- Rank documents by relevance
- Return the most similar matches

## Results Presentation
The search results include:
- **Distance Scores**: Numerical values indicating similarity (lower = more similar)
- **Original Documents**: The actual text content of matching documents
- **Ranking**: Documents ordered by relevance to the query

## Semantic Understanding
The search demonstrates semantic understanding by finding relevant documents that discuss:
- Artificial intelligence concepts
- Machine learning topics
- Related technology themes

This approach goes beyond keyword matching to understand the meaning and context of the query.

In [8]:
# Search query
query = "What is AI and machine learning?"

# Generate embedding for query
query_embedding = model.encode([query])

# Search the database
results = collection.query(
    query_embeddings=query_embedding.tolist(),
    n_results=3,
    include=['documents', 'distances', 'metadatas']
)

print(f"\nQuery: {query}")
print("\nTop 3 similar documents:")
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"{i+1}. Distance: {distance:.3f}")
    print(f"   Document: {doc}")
    print()


Query: What is AI and machine learning?

Top 3 similar documents:
1. Distance: 0.937
   Document: Artificial intelligence is transforming technology

2. Distance: 1.125
   Document: Deep learning uses neural networks

3. Distance: 1.145
   Document: Computer vision enables machines to understand visual information

