# Demo: Retrieval-Augmented Generation (RAGs) on Local Machine

- RAG improves freshness, accuracy, and traceability by fetching relevant evidence first and then generating an answer that points back to its sources.

# 1.Set Up Environment

Download the necessary packages for building RAG pipelines

- langchain
Core framework for building LLM apps (chains, prompts, runnables). You use it for text splitters, message types, and composing the RAG flow.

- langchain_community
Community-maintained integrations that were split out of langchain. Includes loaders (e.g., PyPDFLoader) and many third-party connectors you call in your code.

- langchain-ollama
LangChain’s native driver for Ollama. Gives you ChatOllama (talk to local LLMs like llama3.2) and OllamaEmbeddings (create embeddings locally).

- sentence-transformers
Embedding models & utilities (Hugging Face) used for semantic search/reranking. Even if you embed via Ollama, this is handy for alternatives or upgrades (e.g., cross-encoders).

- chromadb
The actual vector database engine. Stores embeddings and supports fast similarity search (kNN) for retrieval in your RAG pipeline.

- langchain_chroma
LangChain <-> Chroma adapter. Provides the Chroma vector store class you import and use from LangChain to talk to chromadb.

- pypdf
PDF parsing. Lets you read pages/text from PDFs so you can chunk them and feed them into embeddings/vector DB.

In [9]:
%pip install -U langchain langchain_community langchain-ollama sentence-transformers chromadb langchain_chroma pypdf




[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





# 2. Load PDF file

- Ingest the document source,, split them into small chunks, create vector embeddings, and store them in a vector database (e.g., Chroma). Keep metadata (title, URL, page) for citation later.

In [10]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import warnings
warnings.filterwarnings('ignore')

In [11]:
# File path for the document
file_path = r"D:\Users\60175\Sharing_Demo\RAG\data\TheLittlePrince.pdf" #Replace with your file path

In [12]:
# Load and split the document(pages)
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
len(pages)

Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 1024 0 (offset 0)
Ignoring wrong pointing object 1026 0 (offset 0)


64

In [None]:
# Split pages into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
)
chunks = text_splitter.split_documents(pages)

len(chunks)

134

## 3. Embeddings

### Workflow Overview:
- Step 1: Generate embeddings using a pre-trained model (e.g., nomic-embed-text).
- Step 2: Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.
- Step 3: Use the stored embeddings to perform searches, matching, or context-based retrieval.

In [39]:
# Embedding model via Ollama
from langchain_community.embeddings import OllamaEmbeddings
import os

OLLAMA_BASE = os.getenv("OLLAMA_BASE", "http://localhost:11434")
TEXT_EMBEDDING_MODEL = os.getenv('TEXT_EMBEDDING_MODEL', 'nomic-embed-text')

embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL, base_url=OLLAMA_BASE, show_progress=True)


# 4. Build and Store in Vector Store (ChromaDB)

ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.
  
### Key Features of ChromaDB:
- Scalability: Handles large-scale datasets with optimized indexing and search capabilities.
- Speed: Provides fast and accurate retrieval of embeddings for real-time applications.
- Integration: Supports integration with popular frameworks and libraries for embedding generation.

In [40]:
from langchain.vectorstores import Chroma

# Create (and persist) a Chroma vector store from chunks
PERSIST_DIR = "./chroma_db"
db = Chroma.from_documents(chunks, embedding, persist_directory=PERSIST_DIR)
print("ChromaDB created with document embeddings at:", PERSIST_DIR)

OllamaEmbeddings: 100%|██████████| 134/134 [00:03<00:00, 39.06it/s]


ChromaDB created with document embeddings at: ./chroma_db


# 5. Retrieving Documents

- When a user asks a question, transform the query if needed (rewrite/expand), then retrieve the top-k relevant chunks via semantic (or hybrid) search. 
- Optionally rerank the candidates.

In [41]:
# Simple retrieval preview
user_question = "Who is the author?" 
retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00, 36.94it/s]


In [72]:
# Display top results
for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results
    print(f"Document {i+1}:\n{doc.page_content[:1000]}") # Display content

Document 1:
aviation buffs who helped authorities identify the debris, told the 
Associated Press. "We never even imagined this." Castellano said some 
Saint-Exupery fans resisted the effort to identify the wreck, preferring to 
keep the mystery alive. "In the end, I think everyone is satisfied," he said. 
"We didn't find a body, so the myth surrounding his disappearance will 
live on." Saint-Exupery also wrote poetic novels based on his flying 
adventures, such as "Wind, Sand, and Stars" and "Night Flight." A new 
opera based on "The Little Prince" opened in Houston, Texas, last year.
Document 2:
aviation buffs who helped authorities identify the debris, told the 
Associated Press. "We never even imagined this." Castellano said some 
Saint-Exupery fans resisted the effort to identify the wreck, preferring to 
keep the mystery alive. "In the end, I think everyone is satisfied," he said. 
"We didn't find a body, so the myth surrounding his disappearance will 
live on." Saint-Exupery als

In [43]:
import os
page = doc.metadata.get("page")
page = page + 1 if isinstance(page, int) else page
src  = os.path.basename(doc.metadata.get("source", ""))
preview = doc.page_content.strip().replace("\n", " ")[:300]

print(f"Document {i+1} | p{page} | {src}\n{preview}\n")

Document 3 | p64 | TheLittlePrince.pdf
aviation buffs who helped authorities identify the debris, told the  Associated Press. "We never even imagined this." Castellano said some  Saint-Exupery fans resisted the effort to identify the wreck, preferring to  keep the mystery alive. "In the end, I think everyone is satisfied," he said.  "We 



## 6. Preparing Content for GenAI

In [44]:
# Assemble readable context with source & page markers.
def _get_document_prompt(docs):
    prompt = "\n"
    for doc in docs:
        prompt += "\nContent:\n"
        prompt += doc.page_content + "\n\n"
    return prompt

# Generate a formatted context from the retrieved documents
formatted_context = _get_document_prompt(retrieved_docs)

## 7. Prompting model

- Build the model prompt with the user question + the retrieved chunks + instructions like “answer only from the context” and include source identifiers (so the model can cite).

In [58]:
# Write the prompt for the RAG model

prompt_example = f"""
## SYSTEM
You are a knowledgeable and factual assistant for the book "The Little Prince".
Answer **only** using the provided CONTEXT. If the answer cannot be found in the context,
reply exactly: "The provided context does not contain this information."

## USER QUESTION
{user_question}

## CONTEXT
{formatted_context}

## REQUIREMENTS
- Be concise and clear (Markdown).
- Include **Source** with file name and page numbers used.
- No speculation.

## RESPONSE FORMAT
'''
# [Brief Title of the Answer]
[Answer in simple, clear text.]

**Source**:
• [Book Title], Page(s): [...]
'''
"""
print("Prompt constructed.")

Prompt constructed.


## 8. Tune parameters (LLM via Ollama)

In [59]:
# Set up Ollama client and parameters

from langchain_community.chat_models import ChatOllama

GEN_MODEL = os.getenv("GEN_MODEL", "llama3.2")  # Replace with your model name
#GEN_MODEL = os.getenv("GEN_MODEL", "qwen2.5")  # Alternative model

llm = ChatOllama(
    model=GEN_MODEL,
    base_url=OLLAMA_BASE,   
    temperature=0.2,
    top_p=0.9,
    num_predict=512         
)
print("LLM ready:", GEN_MODEL)


LLM ready: llama3.2


## 9. Generate

- Ask the LLM to produce an answer grounded in those retrieved snippets. 
- This combines parametric memory (what’s in the model) with non-parametric memory (your external knowledge base).

In [60]:
# Simple retrieval preview
user_question = "Please tell me who is the author of the book?"
answer = llm.invoke(prompt_example)
print(answer)

content='## Who is the Author of "The Little Prince"?\nAntoine de Saint-Exupéry.\n\n**Source**: \n• Wreck Proves To Be Saint-Exupery\'s P-38!, Page 64.' additional_kwargs={} response_metadata={'model': 'llama3.2', 'created_at': '2025-10-16T21:45:00.9834516Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 1560022500, 'load_duration': 42964400, 'prompt_eval_count': 1831, 'prompt_eval_duration': 790278200, 'eval_count': 47, 'eval_duration': 724802100} id='run--c64da30a-a3c9-4615-829b-83258337d897-0'


## 10. Format Response(output) into a dataframe

In [71]:
import pandas as pd

for idx, row in df.iterrows():
    print(f"row {idx + 1}:")
    print(f"User Question: {row['Question']}")
    print(f"Answer: {row['Answer']}")
    print(f"LLM Model: {row['Model']}")
    print("-" * 50)

row 1:
User Question: Please tell me who is the author of the book?
Answer: ## Who is the Author of "The Little Prince"?
Antoine de Saint-Exupéry.

**Source**: 
• Wreck Proves To Be Saint-Exupery's P-38!, Page 64.
LLM Model: llama3.2
--------------------------------------------------
