In [None]:
"""
This script demonstrates how to build a local document retrieval system using open-source tools.
It uses LangChain with Chroma as the in-memory vector store for storing and querying embeddings.
Documents are defined manually and represent short knowledge entries.
Embeddings are generated locally using Hugging Face’s MiniLM model, which runs on CPU.
No internet connection or API key is required, making it fully offline and open-source.
Chroma stores the embedded vectors and allows fast similarity-based search.
The retriever fetches top-k documents relevant to a natural language query.
The script prints results from both retriever-style access and raw similarity search.
This setup is useful for question answering, semantic search, or chatbot context retrieval.
It can be extended to handle large-scale document collections or integrated with LLMs.
"""

In [3]:
!pip install langchain langchain-community chromadb sentence-transformers



In [4]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

In [5]:
# Step 1: Your source documents
documents = [
    Document(page_content="LangChain helps developers build LLM applications easily."),
    Document(page_content="Chroma is a vector database optimized for LLM-based search."),
    Document(page_content="Embeddings convert text into high-dimensional vectors."),
    Document(page_content="Open-source models provide local embedding generation."),
]

In [6]:
# Step 2: Initialize local Hugging Face embedding model (runs on CPU)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# Step 3: Create Chroma vector store in memory
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embedding_model,
    collection_name="my_collection"
)

In [8]:
# Step 4: Convert vectorstore into a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

In [9]:
# Step 5: Define a query and retrieve documents
query = "What is Chroma used for?"

In [10]:
results = retriever.invoke(query)

In [11]:
for i, doc in enumerate(results):
    print(f"\n--- Retriever Result {i+1} ---")
    print(doc.page_content)


--- Retriever Result 1 ---
Chroma is a vector database optimized for LLM-based search.

--- Retriever Result 2 ---
LangChain helps developers build LLM applications easily.


In [12]:
# Optional: Direct similarity search
results = vectorstore.similarity_search(query, k=2)

In [13]:
for i, doc in enumerate(results):
    print(f"\n--- Similarity Search Result {i+1} ---")
    print(doc.page_content)


--- Similarity Search Result 1 ---
Chroma is a vector database optimized for LLM-based search.

--- Similarity Search Result 2 ---
LangChain helps developers build LLM applications easily.
