# chroma DB

* chroma is an AI-native open-source vector database focused on developer productivity and happiness. chroma is licenced under Apache 2.0


### 01. building a sample vectorDB

In [1]:
"""
1. TextLoader: Loads your text documents.
2. FAISS: Vector store used for similarity search.
3. OllamaEmbeddings: Generates vector embeddings using the llama-2.1b model.
4 CharacterTextSplitter: Splits long text documents into smaller chunks.
"""

from langchain_chroma import Chroma #langchain_chroma-0.2.2
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

### 2. load and split documents

* We load a text file (speech.txt) and then splitting it into smaller chunks using the CharacterTextSplitter.

In [4]:
#loading
loader = TextLoader("speech.txt")
data = loader.load()
data

[Document(metadata={'source': 'speech.txt'}, page_content="or the process of speaking to a group of people, see Public speaking. For other uses, see Speech (disambiguation).\nDuration: 15 seconds.0:15Subtitles available.CC\nSpeech production visualized by real-time MRI\nPart of a series on\nLinguistics\nOutlineHistoryIndex\nGeneral linguistics\nApplied linguistics\nTheoretical frameworks\nTopics\n Portal\nvte\nSpeech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, such as informing, declaring, asking, persuading, directing; acts may vary in various aspects like enunciation, intonation, loudness, and tempo to convey meaning. Individuals may also unintentionally communicate aspects of their social position through speech, such as sex, age, place of origin, physiological and mental condition, education, and

In [5]:
# splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size =500, chunk_overlap =0)
splits = text_splitter.split_documents(data)
splits

[Document(metadata={'source': 'speech.txt'}, page_content='or the process of speaking to a group of people, see Public speaking. For other uses, see Speech (disambiguation).\nDuration: 15 seconds.0:15Subtitles available.CC\nSpeech production visualized by real-time MRI\nPart of a series on\nLinguistics\nOutlineHistoryIndex\nGeneral linguistics\nApplied linguistics\nTheoretical frameworks\nTopics\n Portal\nvte'),
 Document(metadata={'source': 'speech.txt'}, page_content="Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, such as informing, declaring, asking, persuading, directing; acts may vary in various aspects like enunciation, intonation, loudness, and tempo to convey meaning. Individuals may also unintentionally communicate aspects of their social position through"),
 Document(metadata={'source':

### 3.  Generating Embeddings (Using OllamaEmbedding Model) and Chroma DB
* This creates a vector representation (embedding) of the text using the llama-2.1b model.

In [12]:
embeddings = OllamaEmbeddings(
    model="llama3.2:1b"  # Or the specific model name you installed
)
vectordb = Chroma.from_documents(documents=splits, embedding=embeddings)
vectordb

<langchain_chroma.vectorstores.Chroma at 0x12fbbb3d0>

### 4a. Querying the Database (Natural Language Search)
* We performed a semantic search to find the most relevant text chunks related to your query.


In [14]:
query = "who is Anthony ekle"
docs = vectordb.similarity_search(query)
docs[0].page_content

'Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs,'

### 5. How to save the vectorDB in dicks
* Saving: The `db.save_local()` command saves your FAISS index to disk for future use.


In [15]:
# saving to disk
vectordb = Chroma.from_documents(documents=splits, embedding=embeddings, persist_directory="./chroma_db")

In [16]:
# load from disk
db2 = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
docs = db2.similarity_search(query)
print(docs[0].page_content)

Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,
 specializing in graph-based machine learning, anomaly detection in dynamic graphs,


In [17]:
## getting similary search score
docs = vectordb.similarity_search_with_score(query)
docs

[(Document(id='2883b192-53ca-4b7b-8e3b-457674559aff', metadata={'source': 'speech.txt'}, page_content='Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs,'),
  11694.35555801066),
 (Document(id='dbe6e64d-7c9c-4a1a-9dd4-b45888dabfd9', metadata={'source': 'speech.txt'}, page_content='While normally used to facilitate communication with others, people may also use speech without the intent to communicate. Speech may nevertheless express emotions or desires; people talk to themselves sometimes in acts that are a development of what some psychologists (e.g., Lev Vygotsky) have maintained is the use of silent speech in an interior monologue to vivify and organize cognition, sometimes in the momentary adoption of a dual persona as self addressing self as though addressing'),
  11695.179162068562),
 (Document(id='a324831b-fcd0-45d3-90d5-28a7fe3783c6', metadata={'source': 'speech.t

### 6. Retriever options

In [18]:
retriever = vectordb.as_retriever()
retriever.invoke(query)[0].page_content

'Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs,'