#### Chroma DB
Chroma is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

https://python.langchain.com/v0.2/docs/integrations/vectorstores/

# **ChromaDB with LangChain and Hugging Face Embeddings**  

This notebook demonstrates how to use **ChromaDB**, an open-source vector database, with **LangChain** and **Hugging Face embeddings** to store and retrieve text based on semantic similarity.  

## **1. Installing Dependencies**  
The required libraries for ChromaDB, LangChain, and Hugging Face embeddings are installed to handle document processing, vector storage, and retrieval.  


In [6]:

!pip install langchain-community langchain-core langchain -q
!pip install langchain-chroma -q
!pip install langchain-huggingface -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [11]:
# Load and split documents
loader = TextLoader("speech.txt")
documents = loader.load()


## **2. Loading and Splitting Documents**  
A text file (`speech.txt`) is loaded into the notebook. Since raw text can be lengthy, it is split into smaller chunks using a text-splitting technique. This ensures that each chunk is processed separately, improving retrieval accuracy.  



In [12]:
# Split
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
#splits = text_splitter.split_documents(data)
# Split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

## **3. Embedding the Text**  
A pre-trained Hugging Face model (`all-MiniLM-L6-v2`) is used to convert text chunks into numerical representations called embeddings. These embeddings help in storing and retrieving text efficiently based on meaning rather than just keyword matching.  

## **4. Storing Data in ChromaDB**  
The generated embeddings are stored in ChromaDB, allowing fast similarity searches. ChromaDB acts as a vector database, organizing and indexing the embeddings for efficient retrieval.  


In [15]:
# Initialize Hugging Face Embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create and populate ChromaDB
vectorstore = Chroma.from_documents(texts, embedding_model)
vectorstore

<langchain_chroma.vectorstores.Chroma at 0x7b0465b4cd90>

## **5. Querying the Database**  
A search query is used to find the most relevant text chunks based on similarity. Instead of simple keyword matching, ChromaDB retrieves the text that is semantically closest to the query, providing more accurate results.  



In [17]:
## query it
query = "What does the speaker believe is the main reason the United States should enter the war?"
docs = vectorstore.similarity_search(query)
docs[0].page_content

"Veterans' and Families' Mental Health\nThis course discusses mental health in veterans and the . . .\nWhat Sparked this Trend of Deinstitutionalization?\nAlthough this trend began in the early 20th century, it largely came into focus during the period of the civil rights movement. According to the American Medical Association Journal of Ethics, many believe that the movement derived based off these three elements:"

In [20]:
## Saving to the disk
vectorstore=Chroma.from_documents(documents=texts,embedding=embedding_model,persist_directory="./chroma_db")


## **6. Persisting and Reloading the Database**  
The database is saved to disk, enabling reuse without needing to reprocess the text. When needed, ChromaDB can be reloaded, and searches can be performed without recomputing embeddings.  



In [22]:
# load from disk
db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_model)
docs=db2.similarity_search(query)
print(docs[0].page_content)

Veterans' and Families' Mental Health
This course discusses mental health in veterans and the . . .
What Sparked this Trend of Deinstitutionalization?
Although this trend began in the early 20th century, it largely came into focus during the period of the civil rights movement. According to the American Medical Association Journal of Ethics, many believe that the movement derived based off these three elements:


## **7. Similarity Search with Scores**  
ChromaDB also provides a similarity score with each search result, helping to determine the relevance of retrieved documents. The lower the score, the more relevant the document is to the query.  



In [24]:
## similarity Search With Score
docs = vectorstore.similarity_search_with_score(query)
docs

[(Document(id='b3292cd1-0b49-49b3-9448-7583c5c6c685', metadata={'source': 'speech.txt'}, page_content="Veterans' and Families' Mental Health\nThis course discusses mental health in veterans and the . . .\nWhat Sparked this Trend of Deinstitutionalization?\nAlthough this trend began in the early 20th century, it largely came into focus during the period of the civil rights movement. According to the American Medical Association Journal of Ethics, many believe that the movement derived based off these three elements:"),
  1.615055526276708),
 (Document(id='620273a3-330d-4922-abb3-270302edcfd0', metadata={'source': 'speech.txt'}, page_content='Government Aid\nOne of the incentives offered to patients leaving these mental health facilities was that the government would provide them with a type of welfare program, aiding in their recovery and helping them financially with daily situations.'),
  1.7150608289384517),
 (Document(id='d538ce32-113b-4e7c-92f0-0c29e432712e', metadata={'source': 's

## **8. Using a Retriever**  
A retriever is created from ChromaDB, which simplifies the process of fetching the most relevant text based on a given query. This is useful when integrating the database with other AI applications like chatbots or search engines.  


In [26]:
### Retriever option
retriever=vectorstore.as_retriever()
retriever.invoke(query)[0].page_content

"Veterans' and Families' Mental Health\nThis course discusses mental health in veterans and the . . .\nWhat Sparked this Trend of Deinstitutionalization?\nAlthough this trend began in the early 20th century, it largely came into focus during the period of the civil rights movement. According to the American Medical Association Journal of Ethics, many believe that the movement derived based off these three elements:"

******************