# Naive RAG with RAGAS Evaluation

## Educational Notebook - Bitcoin Whitepaper

This notebook demonstrates:
1. **Document Ingestion**: Load and chunk PDF documents
2. **Vector Store**: Create embeddings with local Nomic model via TEI
3. **RAG Pipeline**: Query using Claudex (Claude) or Ollama
4. **RAGAS Evaluation**: Measure quality metrics with Quality Gates

### Infrastructure (100% Local)
- **TEI Server**: `http://localhost:8080` - Nomic embeddings
- **Claudex**: `http://localhost:8081` - Claude CLI wrapper ([GitHub](https://github.com/Leeaandrob/claudex))
- **Ollama**: Fallback with qwen2.5:3b

---

## Setup

In [None]:
# Configuration - 100% Local Infrastructure
import requests

# Local infrastructure endpoints
TEI_URL = "http://localhost:8080"
CLAUDEX_URL = "http://localhost:8081/v1"
OLLAMA_MODEL = "qwen2.5:3b"

# Choose LLM backend
USE_CLAUDEX = True  # Set to False to use Ollama instead

# Verify TEI is running
try:
    response = requests.get(f"{TEI_URL}/health", timeout=5)
    print(f"‚úÖ TEI Server: {TEI_URL} - Healthy")
except:
    print(f"‚ùå TEI Server not available at {TEI_URL}")
    print("   Run: docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 --model-id nomic-ai/nomic-embed-text-v1.5")

# Verify Claudex or Ollama
if USE_CLAUDEX:
    try:
        response = requests.get(f"{CLAUDEX_URL.replace('/v1', '')}/health", timeout=5)
        print(f"‚úÖ Claudex Server: {CLAUDEX_URL} - Healthy")
    except:
        print(f"‚ö†Ô∏è Claudex not available, will fall back to Ollama")
        USE_CLAUDEX = False
else:
    print(f"‚ÑπÔ∏è Using Ollama with model: {OLLAMA_MODEL}")

## Step 1: Document Ingestion

We'll load the Bitcoin whitepaper and split it into chunks.

**Key parameters:**
- `chunk_size`: Target size for each chunk (in characters)
- `chunk_overlap`: Overlap between consecutive chunks

In [None]:
from src.document_loader import DocumentProcessor, analyze_chunks

# Initialize processor
processor = DocumentProcessor(
    chunk_size=500,
    chunk_overlap=100,
)

# Load and chunk the PDF
chunks = processor.process("bitcoin_paper.pdf")

# Analyze the chunks
stats = analyze_chunks(chunks)
print("\nChunk Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

In [None]:
# Explore a sample chunk
print("Sample Chunk (index 0):")
print("-" * 40)
print(chunks[0].page_content)
print("-" * 40)
print(f"Metadata: {chunks[0].metadata}")

## Step 2: Vector Store Creation

Convert chunks to embeddings using **local Nomic model via TEI** and store in FAISS.

**How it works:**
1. Each chunk is sent to TEI server for embedding (Nomic model)
2. FAISS indexes these vectors for efficient similarity search
3. Queries are also embedded and compared to find similar chunks

**Local Setup**: No API keys needed - all processing happens locally!

In [None]:
from src.vector_store import VectorStoreManager

# Initialize vector store with local TEI embeddings
vector_manager = VectorStoreManager(
    use_local=True,
    tei_url=TEI_URL,
)

# Create index from chunks
vector_manager.create_from_documents(chunks)

print(f"‚úÖ Vector store created with {len(chunks)} chunks using Nomic embeddings")

In [None]:
# Test retrieval with similarity scores
query = "What is proof of work?"

results = vector_manager.similarity_search_with_score(query, k=3)

print(f"Query: '{query}'")
print("\nTop 3 Results:")
for i, (doc, score) in enumerate(results, 1):
    print(f"\n[{i}] Score: {score:.4f} | Page: {doc.metadata.get('page', '?')}")
    print(f"Content: {doc.page_content[:200]}...")

## Step 3: RAG Pipeline

Combine retrieval with LLM generation using **Claudex** (Claude) or **Ollama**.

**Pipeline:**
1. User asks a question
2. Retrieve k most similar chunks from FAISS
3. Construct prompt with question + context
4. Generate answer using LLM (Claudex or Ollama)

**LLM Options:**
- **Claudex**: Claude via OpenAI-compatible API (recommended)
- **Ollama**: Local qwen2.5:3b (fallback)

In [None]:
from src.rag_pipeline import NaiveRAG

# Initialize RAG pipeline with local LLM
rag = NaiveRAG(
    vector_store_manager=vector_manager,
    use_local_llm=not USE_CLAUDEX,
    use_claudex=USE_CLAUDEX,
    claudex_url=CLAUDEX_URL,
    ollama_model=OLLAMA_MODEL,
    temperature=0.0,
    k=4,
)

if USE_CLAUDEX:
    print(f"‚úÖ RAG initialized with Claudex at {CLAUDEX_URL}")
else:
    print(f"‚úÖ RAG initialized with Ollama ({OLLAMA_MODEL})")

In [None]:
# Test with a question
question = "What is Bitcoin and how does it work?"

result = rag.query(question)

print(f"Question: {question}")
print("\n" + "=" * 50)
print("\nResponse:")
print(result["response"])
print("\n" + "=" * 50)
print(f"\nRetrieved {len(result['retrieved_contexts'])} context chunks")

In [None]:
# View retrieved contexts
print("Retrieved Contexts:")
for i, ctx in enumerate(result["retrieved_contexts"], 1):
    print(f"\n[Context {i}]")
    print(ctx[:300] + "..." if len(ctx) > 300 else ctx)

## Step 4: RAGAS Evaluation with Quality Gates

Evaluate the RAG pipeline using RAGAS metrics:

| Metric | What It Measures | Quality Gate |
|--------|------------------|--------------|
| **Faithfulness** | Is the answer grounded in context? | ‚â• 0.7 |
| **Answer Relevancy** | Is the answer relevant to the question? | ‚â• 0.8 |

**Why Quality Gates Matter:**
- Without metrics, you don't know what you're putting in production
- Small models (qwen2.5:3b) scored 0.691 average
- Large models (Claude via Claudex) scored 0.906 average (+31% improvement)

In [None]:
from src.evaluator import RAGASEvaluator, create_test_questions_bitcoin

# Create test questions
questions, references = create_test_questions_bitcoin()

print(f"Created {len(questions)} test questions:")
for i, q in enumerate(questions, 1):
    print(f"  {i}. {q}")

In [None]:
# Process all questions through RAG
results = rag.batch_query(questions, references)

print(f"\nProcessed {len(results)} questions")

In [None]:
# Initialize evaluator with local LLM (same as RAG pipeline)
evaluator = RAGASEvaluator(
    metrics=["faithfulness", "answer_relevancy"],
    use_local=not USE_CLAUDEX,
    use_claudex=USE_CLAUDEX,
    claudex_url=CLAUDEX_URL,
    ollama_model=OLLAMA_MODEL,
    tei_url=TEI_URL,
)

# Run evaluation
print("Running RAGAS evaluation...")
evaluation = evaluator.evaluate(results)
print("‚úÖ Evaluation complete!")

In [None]:
# View overall scores with Quality Gate validation
evaluator.print_report(evaluation)

# Quality Gate Check
FAITHFULNESS_THRESHOLD = 0.7
RELEVANCY_THRESHOLD = 0.8

scores = evaluation["scores"]
faithfulness_pass = scores.get("faithfulness", 0) >= FAITHFULNESS_THRESHOLD
relevancy_pass = scores.get("answer_relevancy", 0) >= RELEVANCY_THRESHOLD

print("\n" + "=" * 50)
print("QUALITY GATE VALIDATION")
print("=" * 50)
print(f"Faithfulness: {scores.get('faithfulness', 0):.3f} {'‚úÖ PASS' if faithfulness_pass else '‚ùå FAIL'} (threshold: {FAITHFULNESS_THRESHOLD})")
print(f"Relevancy:    {scores.get('answer_relevancy', 0):.3f} {'‚úÖ PASS' if relevancy_pass else '‚ùå FAIL'} (threshold: {RELEVANCY_THRESHOLD})")
print("=" * 50)
if faithfulness_pass and relevancy_pass:
    print("üéâ ALL QUALITY GATES PASSED - Production Ready!")
else:
    print("‚ö†Ô∏è QUALITY GATES FAILED - Needs Improvement")

In [None]:
# Detailed results DataFrame
df = evaluation["dataframe"]
df[["user_input", "faithfulness", "answer_relevancy"]]

## Step 5: Analysis & Experimentation

Let's analyze the results and identify areas for improvement.

In [None]:
# Find questions with lowest scores
print("Questions with LOWEST faithfulness:")
for _, row in df.nsmallest(3, "faithfulness").iterrows():
    print(f"  Score: {row['faithfulness']:.2f} | {row['user_input'][:60]}")

print("\nQuestions with LOWEST answer_relevancy:")
for _, row in df.nsmallest(3, "answer_relevancy").iterrows():
    print(f"  Score: {row['answer_relevancy']:.2f} | {row['user_input'][:60]}")

In [None]:
# Visualize scores distribution
import matplotlib.pyplot as plt

metrics = ["faithfulness", "answer_relevancy"]
thresholds = {"faithfulness": 0.7, "answer_relevancy": 0.8}

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for i, metric in enumerate(metrics):
    axes[i].hist(df[metric].dropna(), bins=10, edgecolor='black', alpha=0.7, color='steelblue')
    axes[i].set_title(f"{metric.replace('_', ' ').title()}")
    axes[i].set_xlabel("Score")
    axes[i].set_ylabel("Count")
    axes[i].axvline(df[metric].mean(), color='blue', linestyle='--', linewidth=2, label=f'Mean: {df[metric].mean():.2f}')
    axes[i].axvline(thresholds[metric], color='red', linestyle='-', linewidth=2, label=f'Threshold: {thresholds[metric]}')
    axes[i].legend()
    axes[i].set_xlim(0, 1)

plt.tight_layout()
plt.suptitle("RAGAS Score Distribution with Quality Gate Thresholds", y=1.02, fontsize=14)
plt.show()

## Experiments

Try different configurations and compare results.

In [None]:
# Experiment: Different chunk sizes
chunk_sizes = [300, 500, 800]
experiment_results = {}

for chunk_size in chunk_sizes:
    print(f"\nTesting chunk_size={chunk_size}")
    
    # Create new processor
    processor = DocumentProcessor(chunk_size=chunk_size, chunk_overlap=100)
    chunks = processor.process("bitcoin_paper.pdf")
    
    # Create new vector store with local embeddings
    vm = VectorStoreManager(use_local=True, tei_url=TEI_URL)
    vm.create_from_documents(chunks)
    
    # Create new RAG with local LLM
    test_rag = NaiveRAG(
        vm, 
        k=4,
        use_local_llm=not USE_CLAUDEX,
        use_claudex=USE_CLAUDEX,
        claudex_url=CLAUDEX_URL,
        ollama_model=OLLAMA_MODEL,
    )
    
    # Run on sample questions (just 3 for speed)
    test_results = test_rag.batch_query(questions[:3], references[:3])
    
    # Evaluate
    eval_result = evaluator.evaluate(test_results)
    experiment_results[chunk_size] = eval_result["scores"]

# Compare results
print("\n" + "=" * 50)
print("Experiment Results:")
for chunk_size, scores in experiment_results.items():
    print(f"\nChunk Size: {chunk_size}")
    for metric, score in scores.items():
        print(f"  {metric}: {score:.3f}")

## Save Results

In [None]:
# Save evaluation results
evaluator.save_results(evaluation, "outputs/evaluation_results.csv")

# Save vector store for later use
vector_manager.save("data/faiss_index")

print("Results saved!")