<div align="center">

![ChromDB](https://user-images.githubusercontent.com/891664/227103090-6624bf7d-9524-4e05-9d2c-c28d5d451481.png)
</div>

**[ChromaDB](https://www.trychroma.com/)** similar to Pinecone is designed to handle vector storage and retrieval. It offers a robust set of features that cater to varoius use cases, making it a viable choice for many vector-based applications.

</br>

| Aspect                 | Pinecone (Managed)                           | Chroma (Open-source)                        |
|-------------------------|-----------------------------------------------|---------------------------------------------|
| **Speed**              | Fast similarity search in real-time           | Good performance with flexible queries       |
| **Scalability**        | Effortless scaling without infrastructure work| Requires manual setup for scaling            |
| **Indexing**           | Automatic indexing, minimal dev effort        | Customizable indexing, more control          |
| **Ecosystem**          | Easy-to-use Python SDK                       | Open-source, free, and community-driven      |
| **Flexibility**        | Limited advanced queries                      | Supports complex queries (vectors + metadata)|
| **Cost**               | Paid, can be expensive at scale               | Free, but infra cost and setup complexity    |



To use **ChromaDB** for semantic vector storage and search, we’ll need:

* `langchain`: for chaining and embedding integration  
* `chromadb`: the official Chroma client  
* Either:
  * `openai` + `tiktoken`: for **OpenAI embeddings**  
  * `google-generativeai`: for **Gemini embeddings**  


### Steps

1. **Install and Set Up Chroma**  
   Run Chroma locally or on your cloud platform. You can use the `chromadb` package to manage collections directly, without needing a separate managed service.  

2. **Integrate Chroma Client**  
   Use the Chroma client in your application (via Python SDK or REST API) to connect and define collections for storing embeddings.  

3. **Generate Embeddings**  
   Use an embedding model (from OpenAI or Gemini) to convert text into numerical vector representations.  

4. **Index Vectors**  
   Insert the generated embeddings into a Chroma collection, along with any metadata (such as original text, tags, or attributes).  

5. **Query Vectors**  
   Convert a new query into an embedding and use Chroma’s query functions to retrieve the most similar vectors, filtered or combined with metadata as needed.  


In [None]:
# Install necessary libraries
!pip install langchain langchain-community chromadb langchain-chroma -q
!pip install google-ai-generativelanguage langchain-google-genai -q

Let’s download some news articles from [Dropbox]( https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip) and use them as our dataset for vector database search and rerieval.

In [None]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
!unzip -q new_articles.zip -d new_articles

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")

In [None]:
from langchain.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.document_loaders import DirectoryLoader, TextLoader

# Load Embeddings and LLM Models
embeddings = GoogleGenerativeAIEmbeddings(model="text-embedding-004")
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

In [None]:
# Load all the *.txt files at once using DirectoryLoader
loader = DirectoryLoader("new_articles", glob="./*.txt", loader_cls=TextLoader)
documents = loader.load()

In [None]:
print(documents[0].page_content)

As brands incorporate generative AI into their creative workflows to generate new content associated with the company, they need to tread carefully to be sure that the new material adheres to the company’s style and brand guidelines.

Nova is an early-stage startup building a suite of generative AI tools designed to protect brand integrity, and today, the company is announcing two new products to help brands police AI-generated content: BrandGuard and BrandGPT.

With BrandGuard, you ingest your company’s brand guidelines and style guide, and with a series of models Nova has created, it can check the content against those rules to make sure it’s in compliance, while BrandGPT lets you ask questions about the brand’s content rules in ChatGPT style.

Rob May, founder and CEO at the company, who previously founded Backupify, a cloud backup startup that was acquired by Datto back in 2014, recognized that companies wanted to start taking advantage of generative AI technology to create content

Let’s move on to **splitting the texts we loaded**. This is useful because we **don’t need to send the entire long text to the API**, which:

* **Saves tokens**: reduces costs when using embedding or LLM models.
* **Improves relevance**: smaller chunks make semantic search and retrieval more accurate.
* **Speeds up processing**: shorter texts are faster to embed and query.

By splitting our documents into manageable chunks, we ensure that **each piece of text retains enough context** for meaningful embeddings without exceeding model limits.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Feed textual data from documents to text-splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=128,
    separators=["\n\n", "\n", " ", ""]
)

texts = text_splitter.split_documents(documents)

In [None]:
texts[0].page_content

'As brands incorporate generative AI into their creative workflows to generate new content associated with the company, they need to tread carefully to be sure that the new material adheres to the company’s style and brand guidelines.\n\nNova is an early-stage startup building a suite of generative AI tools designed to protect brand integrity, and today, the company is announcing two new products to help brands police AI-generated content: BrandGuard and BrandGPT.\n\nWith BrandGuard, you ingest your company’s brand guidelines and style guide, and with a series of models Nova has created, it can check the content against those rules to make sure it’s in compliance, while BrandGPT lets you ask questions about the brand’s content rules in ChatGPT style.'

In [None]:
texts[1].page_content

'Rob May, founder and CEO at the company, who previously founded Backupify, a cloud backup startup that was acquired by Datto back in 2014, recognized that companies wanted to start taking advantage of generative AI technology to create content faster, but they still worried about maintaining brand integrity, so he came up with the idea of building a guard rail system to protect the brand from generative AI mishaps.\n\n“We heard from multiple CMOs who were worried about ‘how do I know this AI-generated content is on brand?’ So we built this architecture that we’re launching called BrandGuard, which is a really interesting series of models, along with BrandGPT, which acts as an interface on top of the models,” May told TechCrunch.'

In [None]:
print(len(texts))

224


### Creating a Chroma Vector Database

Unlike Pinecone that runs through cloud servers, **Chroma** uses **local storage**, which means all your vectors and metadata are stored directly on your machine or chosen infrastructure.

This provides full control over data, flexibility, and cost savings, but requires you to manage the environment yourself.

> ChromaDB needs a `persistent_directory` to save indices and vectors for **long-term storage**, so that your embeddings, metadata, and collection state are retained across sessions and can be reloaded later without rebuilding the database from scratch.

In [None]:
vector_db = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,         # Gemini -> text-embedding-004
    persist_directory="chromadb"
)

In [None]:
# Directory File structure
# !sudo apt install tree -q
!tree chromadb/

[01;34mchromadb/[0m
├── [01;34m555d182e-b0f7-457d-acd8-841a6b3b2e47[0m
│   ├── [00mdata_level0.bin[0m
│   ├── [00mheader.bin[0m
│   ├── [00mlength.bin[0m
│   └── [00mlink_lists.bin[0m
└── [00mchroma.sqlite3[0m

1 directory, 5 files


| File                     | Description                                                                 |
|--------------------------------------|-----------------------------------------------------------------------------|
| `data_level0.bin`                     | Stores the **vector embeddings** in binary format (float32 arrays)         |
| `header.bin`                          | Contains **metadata about vector blocks**, number of vectors, dimensions   |
| `length.bin`                          | Keeps **length information** of vectors or chunks for reconstruction       |
| `link_lists.bin`                      | Stores **graph connectivity / ANN links** for fast similarity search       |
| `chroma.sqlite3`                      | SQLite DB storing **collection metadata, vector IDs, document metadata**   |

* **Binary files (`.bin`)**: heavy numeric/vector data + indexing for fast retrieval
* **SQLite file (`.sqlite3`)**: lightweight metadata and collection management


In [None]:
vector_db

<langchain_community.vectorstores.chroma.Chroma at 0x7b9f3e3cb440>

> Now lets set `vector_db` to None and Reload the `vector_db` from persisted directory again.

In [None]:
vector_db = None
vector_db

In [None]:
from langchain_chroma import Chroma

vector_db = Chroma(
    persist_directory="chromadb",
    embedding_function=embeddings
)

vector_db

<langchain_chroma.vectorstores.Chroma at 0x7b9f3cf342c0>

### Creating Retriever


| Aspect               | Similarity Search                                                                                                                                   | Maximum Marginal Relevance (MMR) Retrieval                                                                         |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Goal**             | Retrieve vectors that are **most similar** to the query                                                                                             | Retrieve vectors that are **both relevant and diverse**                                                            |
| **How it works**     | Computes **cosine similarity (or other distance metric)** between the query embedding and all stored embeddings, then returns top-k closest matches | Balances **relevance** (similarity to query) and **novelty** (dissimilarity to already selected items) iteratively |
| **Output**           | Top-k results that are closest in embedding space                                                                                                   | Top-k results that cover multiple aspects or subtopics, avoiding redundancy                                        |
| **Best for**         | Simple retrieval tasks, FAQs, or single-topic queries                                                                                               | Summarization, RAG pipelines, or multi-aspect query results where diversity matters                                |
| **Complexity**       | Low — just compute similarities and sort                                                                                                            | Higher — iterative selection with relevance-diversity trade-off                                                    |
| **Example use case** | Searching for similar news articles based on content                                                                                                | Selecting a diverse set of news articles covering different subtopics for a query                                  |


#### Key Intuition

* **Similarity search** = “Which items are most like my query?”
* **MMR retrieval** = “Which items are most like my query **without repeating the same information**?”


In [None]:
retriever = vector_db.as_retriever(
    search_type="similarity", # Default
    search_kwargs={"k": 3}    # Default: 4
)

relevant_docs = retriever.invoke(input="How much money did Microsft raise?")
relevant_docs

[Document(id='f3381b10-b121-4b58-af66-25df78451ac3', metadata={'source': 'new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}, page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”'),
 Document(id='81aba3

In [None]:
relevant_docs[0].page_content

'April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”'

In [None]:
mmr_retriever = vector_db.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3}    # Default: 4
)

mmr_relevant_docs = mmr_retriever.invoke(input="How much money did Microsft raise?")
mmr_relevant_docs

[Document(id='f3381b10-b121-4b58-af66-25df78451ac3', metadata={'source': 'new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}, page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”'),
 Document(id='81aba3

> Let's check last index results and compare the results of two different search retriever.

In [None]:
relevant_docs[-1].page_content

'Get your TechCrunch fix IRL. Join us at Disrupt 2023 in San Francisco this September to immerse yourself in all things startup. From headline interviews to intimate roundtables to a jam-packed startup expo floor, there’s something for everyone at Disrupt. Save up to $800 when you buy your pass now through May 15, and save 15% on top of that with promo code WIR. Learn more.'

In [None]:
mmr_relevant_docs[-1].page_content

'Even simple systems have the ability to surprise you. Especially when you have simple systems when a large number of them are interacting with each other. I’ve found myself not necessarily recognizing when these emergent properties will come, but I will say that whenever something gets monetized, you should anticipate there will be emergent properties and possibly unexpected behavior, all driven by greed.\n\nLet me ask you about some some other stuff you’re working on. I’m always happy when I see cutting-edge tech being applied to people who need it, people with disabilities, people who like just have not been addressed by the current use cases of tech. Are you still working in the accessibility community?'

### Refine the response using LLM

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate

# Setting up a system prompt
system_prompt = (
    "You are Luma, a friendly and knowledgeable AI assistant. "
    "Answer questions using only the provided context. "
    "If the answer is not in the context, respond with 'I don’t know.' "
    "Keep answers concise, clear, and easy to understand. "
    "Context: {context}"
)

# Chat Prompt Template
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

retriever = vector_db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 2, "score_threshold": 0.5}
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

In [None]:
response = chain.invoke({"input": "Can you give summary of new Hugging Face Code completion Model?"})
print(response)

{'input': 'Can you give summary of new Hugging Face Code completion Model?', 'context': [Document(id='7c51b641-bee9-4dd4-a664-08ef897be7c5', metadata={'source': 'new_articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt'}, page_content='“One thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,” von Werra told TechCrunch in an email interview. “Within weeks of the release the community had built dozens of variants of the model as well as custom applications. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.”\n\nBuilding a model\n\nStarCoder is a part of Hugging Face’s and ServiceNow’s over-600-person BigCode project, launched late last year, which aims to develop “state-of-the-art” AI systems for code in an “open and responsible” way. Hugging Face supplied an in-house compute 

In [None]:
response["context"][0].page_content

'“One thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,” von Werra told TechCrunch in an email interview. “Within weeks of the release the community had built dozens of variants of the model as well as custom applications. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.”\n\nBuilding a model\n\nStarCoder is a part of Hugging Face’s and ServiceNow’s over-600-person BigCode project, launched late last year, which aims to develop “state-of-the-art” AI systems for code in an “open and responsible” way. Hugging Face supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.'

In [None]:
response["answer"]

"Hugging Face, as part of the BigCode project with ServiceNow, has released StarCoder, a powerful open-source code generation model. It was trained using 512 Nvidia V100 GPUs. StarCoder's code repositories, training framework, dataset-filtering methods, code evaluation suite, and research analysis notebooks are available on GitHub. While it may not have as many features as GitHub Copilot at launch, its open-source nature allows the community to fine-tune, adapt, and improve it. However, the model may produce inaccurate, offensive, misleading content, PII, and malicious code."

In [None]:
response = chain.invoke({"input": "How much money did OpenAI raise?"})
response["answer"]

'OpenAI raised just over $300 million from VC firms including Sequoia Capital, Andreessen Horowitz, Thrive, K2 Global, and Founders Fund. This is separate from a Microsoft investment of around $10 billion.'

> You can also save the ChromaDB contents in zip file and then re-use by reloading later.