<a href="https://colab.research.google.com/github/AryaJeet1364/VectorDBs/blob/main/ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ChromaDB Hands-on -- Intro to Vector Databases

In this project, I learned how to build a complete local document-based Question Answering (QA) system using LangChain, ChromaDB, and a free HuggingFace model — all without needing any paid API keys. I understood how to load and preprocess raw text data, convert it into vector embeddings using sentence-transformers, store and retrieve relevant chunks using ChromaDB, and finally use a local LLM (flan-t5-base) to answer natural language queries based on the documents. I also gained hands-on experience in building retrieval-augmented generation (RAG) pipelines and making them fully reproducible, efficient, and privacy-friendly.

## Imports

In [1]:
!pip install -q langchain langchain-community chromadb sentence-transformers tiktoken

In [2]:
!pip show chromadb

Name: chromadb
Version: 1.0.12
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: bcrypt, build, fastapi, grpcio, httpx, importlib-resources, jsonschema, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, pyyaml, rich, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


## Loading Data

A sample data

In [3]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip

In [4]:
!unzip -q new_articles.zip -d new_articles

Loading Text Files

In [5]:
from langchain.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("new_articles", glob="*.txt", loader_cls=TextLoader)
documents = loader.load()

## Creating Vector DB

Splitting Text

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = splitter.split_documents(documents)

Embedding Model -- Free

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Vector Store : ChromaDB

In [8]:
from langchain.vectorstores import Chroma

persist_directory = "db"

vectordb = Chroma.from_documents(
    documents=texts,
    embedding=embedding,
    persist_directory=persist_directory
)

vectordb.persist()
vectordb = None  # Free memory

  vectordb.persist()


## Making a Retriever

In [9]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

retriever = vectordb.as_retriever(search_kwargs={"k": 2})

  vectordb = Chroma(


## Making a Chain

LLM

In [10]:
# Load the model locally (no API key needed)
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# Load a free & public model like flan-t5-base
local_llm_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=256)

llm = HuggingFacePipeline(pipeline=local_llm_pipeline)

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=local_llm_pipeline)


Retrieval QA

In [11]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,  # from ChromaDB setup
    return_source_documents=True
)

# Ask a question
query = "How much money did Microsoft raise?"
response = qa_chain(query)

# Print nicely
def process_llm_response(llm_response):
    print("Answer:", llm_response['result'])
    print("\nSources:")
    for source in llm_response["source_documents"]:
        print("-", source.metadata['source'])

process_llm_response(response)

  response = qa_chain(query)


Answer: VC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source. April 25, 2023 Called ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”

Sources:
- new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
- new_articles/05-04-microsoft-doubles-down-on-ai-with-new-bing-features.txt


Quick Test

In [12]:
llm("What is the capital of France?")

  llm("What is the capital of France?")


'london'

## Deleting the DB

In [13]:
vectordb.delete_collection()
vectordb.persist()

!rm -rf db/