# Retrieval-Augmented Generation (RAG) with Vector Databases  
### A Practical, End-to-End Walkthrough using ChromaDB

Large Language Models (LLMs) are powerful reasoning engines, but they have a fundamental limitation:
they **do not have access to private, proprietary, or up-to-date data**.

By default, an LLM:
- Relies only on its training data (with a fixed knowledge cutoff)
- Cannot access internal documents or databases
- May produce generic or outdated answers for factual questions

**Retrieval-Augmented Generation (RAG)** addresses this limitation by grounding LLM responses in external data at inference time.

---

## What is RAG?

RAG is an architecture that combines:
1. **Information Retrieval** — finding relevant documents
2. **Language Generation** — reasoning over retrieved content

Instead of asking the model to “know everything”, we:
- Retrieve the most relevant documents for a query
- Inject those documents into the prompt
- Force the model to answer *only using retrieved context*

This makes answers:
- More accurate
- More explainable
- Auditable and enterprise-ready

---

## Where is RAG Useful?

RAG is particularly effective for:
- Financial reports and earnings analysis
- Internal knowledge bases
- Policy and compliance documents
- Research papers and technical documentation
- Any domain where **correctness matters more than creativity**

---

## Why Vector Databases?

To retrieve relevant documents efficiently, we need a way to compare **semantic similarity**, not just keywords.

Vector databases (like **ChromaDB**) enable this by:
- Storing documents as numerical embeddings
- Performing fast similarity search in high-dimensional space
- Scaling retrieval beyond simple keyword matching

In this notebook, we use **ChromaDB** as a lightweight vector database to demonstrate this workflow end-to-end.


In [10]:
import numpy as np
import os

import chromadb
from gensim.models import Word2Vec

from dotenv import load_dotenv
load_dotenv()

True

## Step 1: Prepare the Document Corpus

We start by defining a small set of documents derived from JPMorgan Chase’s Q1 2023 earnings performance.

Each document represents a **retrieval unit (chunk)** that the system can later search over.
In real systems, these would typically be paragraphs or sections extracted from PDFs, filings, or internal reports.


In [11]:
documents = [
    "In Q1 2023, JPMorgan Chase reported net income of $12.6 billion, or $4.10 earnings per share, reflecting strong profitability.",
    "Total net revenue for JPMorgan Chase in the first quarter of 2023 was $38.3 billion, up significantly year-over-year.",
    "Net interest income was approximately $20.7 billion in Q1 2023, up roughly 49% driven by higher interest rates.",
    "JPMorgan’s first-quarter 2023 return on equity was around 18%, and return on tangible common equity was about 23%.",
    "Capital levels remained well above regulatory minimums in Q1 2023, with a common equity tier 1 (CET1) ratio near 13.8%.",
    "Credit quality trends during the quarter remained stable with delinquency and charge-off rates largely unchanged.",
    "Noninterest expenses for Q1 2023 included higher investments in technology and staffing to support long-term growth.",
    "Segment performance in Q1 2023 showed resilience across consumer, commercial, and investment banking operations."
]


## Step 3: Store Embeddings in a Vector Database (ChromaDB)

Once embeddings are created, they are stored in a vector database.

ChromaDB allows us to:
- Bring our own embeddings
- Persist vectors alongside raw text
- Perform efficient similarity search at query time

For simplicity, we use the Word2Vec embedding to demonstrate.

In [12]:

tokenized_docs = [doc.lower().split() for doc in documents]

w2v_model = Word2Vec(
    sentences=tokenized_docs,
    vector_size=100,
    window=5,
    min_count=1,
    workers=2,
    epochs=200
)

def embed_text(text):
    words = text.lower().split()
    vectors = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    return np.mean(vectors, axis=0)

doc_embeddings = np.vstack([embed_text(doc) for doc in documents])
doc_embeddings.shape


(8, 100)

In [21]:
client = chromadb.Client()

existing_collections = [c.name for c in client.list_collections()]

if "rag_demo" in existing_collections:
    collection = client.get_collection(name="rag_demo")
else:
    collection = client.create_collection(name="rag_demo")

for i, doc in enumerate(documents):
    collection.add(
        documents=[doc],
        ids=[str(i)],
        embeddings=[doc_embeddings[i]]
    )

collection.count()


8


## Step 4: Retrieve Relevant Documents for a User Query

Given a user query:
- The query is embedded using the same embedding model
- The vector database returns the most similar documents
- These documents form the factual context for generation

At this point, we have **retrieval without generation**, which is already valuable on its own.

In [29]:
query = "How strong is JPMorgan's capital position in Q1 2023?"

query_embedding = embed_text(query)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

retrieved_docs = results["documents"][0]
retrieved_docs


['JPMorgan’s first-quarter 2023 return on equity was around 18%, and return on tangible common equity was about 23%.',
 'In Q1 2023, JPMorgan Chase reported net income of $12.6 billion, or $4.10 earnings per share, reflecting strong profitability.',
 'Capital levels remained well above regulatory minimums in Q1 2023, with a common equity tier 1 (CET1) ratio near 13.8%.']

## Step 5: LLM Generation Without RAG (Baseline)

Before using retrieval, we first observe how the LLM responds **without any external context**.

This serves as a baseline and highlights:
- Generic language
- Potentially outdated or vague answers
- Lack of traceability


#### OpenAI Client Setup


In [30]:
from dotenv import load_dotenv
load_dotenv()

from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MODEL_NAME = os.getenv("OPENAI_MODEL_NAME")

In [31]:

prompt_no_rag = f"""
Answer the following question:

Question:
{query}
"""

response_no_rag = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[{"role": "user", "content": prompt_no_rag}],
    temperature=0.0
)

print(response_no_rag.choices[0].message.content)


As of Q1 2023, JPMorgan Chase reported a strong capital position, with a Common Equity Tier 1 (CET1) capital ratio of 13.0%. This ratio reflects the bank's solid capital base relative to its risk-weighted assets, indicating a robust ability to absorb losses and support ongoing operations. Additionally, the bank maintained a well-capitalized status, which is crucial for regulatory compliance and financial stability. Overall, JPMorgan's capital position in Q1 2023 was considered strong, supporting its growth and resilience in the financial sector.


## Step 6: LLM Response With RAG

Now we inject the retrieved documents into the prompt and explicitly instruct the model to answer **only using this context**.

This is the core RAG step:
- Retrieval + Generation
- Same model, different behavior
- Accuracy driven by documents, not model memory

In [33]:

context = "\n".join(retrieved_docs)

prompt_rag = f"""
Answer the question using ONLY the context below.
If the answer is not present, say "Not found in documents."

Context:
{context}

Question:
{query}
"""

response_rag = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[{"role": "user", "content": prompt_rag}],
    temperature=0.0
)

print(response_rag.choices[0].message.content)


JPMorgan's capital position in Q1 2023 is strong, with capital levels remaining well above regulatory minimums and a common equity tier 1 (CET1) ratio near 13.8%.


## Conclusion: Why RAG Works

This notebook demonstrates a key insight:

> **The quality of factual answers depends more on retrieval than on model size or recency.**

Without RAG:
- The LLM relies on pretrained knowledge
- Answers are generic, high-level, or outdated or sometime inaccurate (e.g. CET1 ratio in this example was incorrect for me without the RAG!)
- There is no clear link between answer and source

With RAG:
- The LLM reasons over retrieved, domain-specific documents
- Answers are precise, grounded, and explainable
- The system becomes suitable for enterprise and analytical use cases

Importantly, both responses use the **same language model**.
The improvement comes entirely from **retrieval-augmented context**.

---

## Key Takeaway

RAG transforms an LLM from a general conversational model into a **domain-aware reasoning system**.
This makes it a foundational architecture for real-world applications where correctness matters.


## Next Steps and Possible Improvements

This notebook presents a minimal but complete RAG pipeline.
In production systems, several enhancements are commonly applied:

1. **Stronger Embedding Models**
   - Sentence transformers or domain-specific embeddings
   - Better semantic recall and robustness

2. **Hybrid Retrieval**
   - Combine vector similarity with keyword or BM25 search
   - Improves precision for numbers, entities, and exact terms

3. **Chunking and Overlap Strategies**
   - Smarter document splitting improves retrieval quality
   - Especially important for long reports and PDFs

4. **Reranking Models**
   - Use a secondary model to rerank retrieved documents
   - Improves final context relevance

5. **Graph-based RAG (GraphRAG)**
   - Model relationships between entities and documents
   - Enables multi-hop reasoning and structured retrieval

6. **Evaluation and Monitoring**
   - Measure retrieval quality (Recall@K)
   - Track faithfulness and hallucination rates

These techniques extend the same core idea demonstrated here:  
**ground generation in reliable, retrievable data**.
