# Chroma DB

* **ChromaDB**: A lightweight, open-source vector database that stores and searches embeddings. It offers fast similarity search and easy integration with popular embedding models.

Chroma is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

https://python.langchain.com/v0.2/docs/integrations/vectorstores/

- Environment Setup: To configure API keys and enable Langsmith tracking for the project.

In [1]:
from langchain_chroma import Chroma

from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

- Text Processing: Loads and splits text documents into smaller chunks with overlap for efficient processing.

In [None]:
# Load text file and split into manageable chunks with overlap

data = TextLoader("speech.txt").load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

- ChromaDB Creation: Creates a new Chroma vector database from document splits using Ollama embeddings.

In [5]:
# Initialize ChromaDB with documents using Ollama embeddings model

embedding=OllamaEmbeddings()
vectordb=Chroma.from_documents(documents=splits,embedding=embedding)

vectordb

  embedding=OllamaEmbeddings()


<langchain_chroma.vectorstores.Chroma at 0x1776b885a90>

- **Embedding**: a mathematical representation of text data in vector space.  
* **OllamaEmbeddings()**: A class that converts text into vector representations (embeddings) using Ollama's language models. By default, it uses a LLaMA-2 based model optimized for creating text embeddings.

* **Chroma.from_documents()**: A method that takes text documents as input, creates embeddings using the specified embedding model, and stores these embeddings in a Chroma vector database.

* **Model Configuration**: You can specify a different model by passing the model name when initializing OllamaEmbeddings. By default, it uses LLaMA-2 which is a 7B parameter model.

- Query Execution: Performs a semantic similarity search in the vector database to find relevant content about war outcome description.

In [None]:
# Search chroma DB for documents matching the war outcome query and return first result

query = "What does the speaker believe is the main reason the United States should enter the war?"
docs = vectordb.similarity_search(query)
docs[0].page_content

'To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'

- **Similarity Search**: a search technique that finds documents or items most similar to a given query based on their vector representations (embeddings). Instead of exact keyword matching, it looks for semantic meaning and contextual similarity, making it more effective for natural language queries.

- Save and Load: Saves the vector database to a file and loads it back when needed.

In [None]:
# Save database to disk 

vectordb=Chroma.from_documents(documents=splits,embedding=embedding,persist_directory="./chroma_db")


⚠️ It looks like you upgraded from a version below 0.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


* **persist_directory**:
   - A parameter that specifies where to save the Chroma database on disk (e.g., "./chroma_db"), allowing embeddings and texts to be stored for future use.

* **Chroma.from_documents Parameters**:
  - `documents=splits`: Takes the split text chunks as input
  - `embedding=embedding`: Uses the specified embedding model to create vector representations
  - Creates and stores embeddings in the database

In [8]:
# load from disk

db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding)
docs=db2.similarity_search(query)
print(docs[0].page_content)

To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.


In [None]:
## similarity Search With Score / vectordb

docs = vectordb.similarity_search_with_score(query)
docs

[(Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
  15442.443581513424),
 (Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
  15462.308901624307),
 (Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace mus

In [14]:
## similarity Search With Score / db2 loaded

docs2 = db2.similarity_search_with_score(query)
docs2

[(Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
  15442.443581513424),
 (Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
  15462.308901624307),
 (Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace mus

* **Retriever Option**: A method that converts the vector database into a retriever object, enabling direct querying of documents. The `invoke()` method performs embedding creation and similarity search in one step, returning the matching text content.

- Vector Database to Retriever: Converts the database into a search interface for easy querying.

In [None]:
# Convert to retriever and search for similar content

retriever=vectordb.as_retriever()
retriever.invoke(query)[0].page_content

'To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'

In [None]:
# END