# ChromaDB Sample Tutorial
# ------------------------
# This notebook demonstrates how to use **ChromaDB** with a persistent client,
# create a collection, add documents, and query them.


# First, install ChromaDB if you don't already have it:
# !pip install chromadb

In [None]:
import chromadb

# ## Step 1: Create a Persistent Client
# A persistent client ensures that your data is stored permanently on disk
# (not just in memory) so it can be reused later.

In [None]:
client = chromadb.PersistentClient(path="./chroma_db_store")

# ## Step 2: Create (or Get) a Collection
# A collection is like a 'table' where your embeddings (vectors) and metadata are stored.
# The function `get_or_create_collection` will fetch the collection if it exists,
# otherwise create a new one.

In [None]:
collection = client.get_or_create_collection(name="sample_collection")

# ## Step 3: Add Documents to the Collection
# - `ids` must be a **unique list of strings** (each ID identifies a document).
# - `documents` should be a **list of strings** (or other supported types like text, etc.).
#
# When you add documents, Chroma automatically converts them into **embeddings (vectors)**
# you can either use the default embedding model or go with any other choices.

In [None]:
collection.add(
ids=["doc1", "doc2", "doc3"], # unique identifiers for each document
documents=[
"Dogs are wonderful pets.",
"Cats are independent animals.",
"I love driving fast cars."
]
)

# ## Step 4: Querying the Collection
# Now, let’s perform a similarity search.
# - The `query_texts` parameter accepts a list of query strings.
# - `n_results` specifies **how many closest (most similar) documents** should be returned.
#
# Under the hood, Chroma uses vector similarity search algorithms
# (e.g., cosine similarity, Euclidean distance, or inner product depending on configuration)
# to find nearest vectors.

In [None]:
results = collection.query(
query_texts=["Tell me about pets"],
n_results=2 # returns top 2 most similar documents
)

print("Query Results:")
print(results)

# ## Explanation of `n_results`:
# - If `n_results=1`, you get only the closest (most similar) document.
# - If `n_results=3`, you get the top 3 most relevant documents ranked by similarity.
#
# Example: Searching "Tell me about pets" should likely return documents about "Dogs" and "Cats".


# ## Summary:
# - PersistentClient ensures data is stored on disk.
# - get_or_create_collection manages collections safely.
# - add() inserts documents with unique IDs.
# - query() performs similarity search using embeddings.
# - n_results controls how many similar documents are returned.