# RAGnarok-AI Quickstart


### Local-first RAG evaluation. No API keys. No data leakage. Just metrics that matter.

---
---

**RAGnarok-AI** is a local-first evaluation framework for RAG pipelines. 
No API keys, no data leaving your network, no complex setup.

Evaluate your retrieval quality with deterministic metrics (Precision, Recall, MRR, NDCG), 
assess generation quality with LLM-as-Judge (Prometheus 2), and track costs across providers — 
all running 100% on your machine.

This notebook walks you through the complete workflow: from basic evaluation to pipeline comparison.

---

**RAGnarok-AI** est un framework d'évaluation local-first pour pipelines RAG.
Pas de clés API, pas de données qui quittent votre réseau, pas de configuration complexe.

Évalue la qualité de votre retrieval avec des métriques déterministes (Precision, Recall, MRR, NDCG),
analyse la qualité de génération avec LLM-as-Judge (Prometheus 2), et suis les coûts par provider — le tout tournant à 100% sur votre machine.

Ce notebook vous guide à travers le workflow complet : de l'évaluation basique à la comparaison de pipelines.

---
---




1. **Setup** - Installation and imports
2. **Basic Evaluation** - Evaluate a RAG pipeline with retrieval metrics
3. **Cost Tracking** - Track token usage and costs
4. **LLM-as-Judge** - Multi-criteria evaluation with Prometheus 2
5. **Jupyter Display** - Rich HTML visualization
6. **Pipeline Comparison** - Compare multiple configurations

## Prerequisites

- Python 3.10+
- [Ollama](https://ollama.ai/) running locally
- RAGnarok-AI installed: `pip install ragnarok-ai[ollama]`

## 1. Setup

---
---

Import the core modules. RAGnarok-AI is designed with minimal dependencies — only import what you need.

---

Importez les modules essentiels. RAGnarok-AI est conçu avec un minimum de dépendances — importez uniquement ce dont vous avez besoin.

---
---

In [1]:
# Install ragnarok-ai (uncomment if needed)
# !pip install ragnarok-ai[ollama]

In [2]:
import ragnarok_ai

print(f"RAGnarok-AI v{ragnarok_ai.__version__} ready!")

RAGnarok-AI v1.3.1 ready!


In [3]:
from ragnarok_ai import evaluate, compare, LLMJudge
from ragnarok_ai.core.types import Document, Query, RAGResponse, TestSet
from ragnarok_ai.notebook import display, display_metrics, display_cost, display_comparison

## 2. Create a Simple RAG Pipeline

For this demo, we'll create a mock RAG pipeline. In practice, you would use your own RAG implementation (LangChain, LlamaIndex, custom, etc.).

---
---

Any RAG system can be evaluated as long as it implements the query() method returning a RAGResponse. This mock uses keyword matching; in production, you would plug in your LangChain, LlamaIndex, or custom implementation.

---


N'importe quel système RAG peut être évalué tant qu'il implémente la méthode query() retournant une RAGResponse. Ce mock utilise du keyword matching ; en production, on branche une implémentation LangChain, LlamaIndex, ou custom.

---
---

In [4]:
# Sample knowledge base
KNOWLEDGE_BASE = [
    Document(id="doc1", content="Python was created by Guido van Rossum and first released in 1991."),
    Document(id="doc2", content="Python is known for its simple, readable syntax and extensive standard library."),
    Document(id="doc3", content="Python supports multiple programming paradigms: procedural, object-oriented, and functional."),
    Document(id="doc4", content="Popular Python frameworks include Django, Flask, and FastAPI for web development."),
    Document(id="doc5", content="Python is widely used in data science, machine learning, and artificial intelligence."),
]

print(f"Knowledge base: {len(KNOWLEDGE_BASE)} documents")

Knowledge base: 5 documents


In [5]:
class SimpleRAG:
    """A simple mock RAG for demonstration."""
    
    def __init__(self, docs: list[Document], k: int = 3):
        self.docs = docs
        self.k = k
    
    async def query(self, question: str) -> RAGResponse:
        # Simple keyword matching (in practice, use embeddings)
        keywords = question.lower().split()
        scored_docs = []
        
        for doc in self.docs:
            score = sum(1 for kw in keywords if kw in doc.content.lower())
            if score > 0:
                scored_docs.append((score, doc))
        
        # Sort by score and take top k
        scored_docs.sort(key=lambda x: x[0], reverse=True)
        retrieved = [doc for _, doc in scored_docs[:self.k]]
        
        # Generate answer from context
        if retrieved:
            context = " ".join(doc.content for doc in retrieved)
            # Simple extractive answer
            answer = retrieved[0].content
        else:
            answer = "I don't have enough information to answer this question."
        
        return RAGResponse(
            answer=answer,
            retrieved_docs=retrieved,
        )

# Create RAG instance
rag = SimpleRAG(KNOWLEDGE_BASE)

# Test it
response = await rag.query("Who created Python?")
print(f"Answer: {response.answer}")
print(f"Retrieved {len(response.retrieved_docs)} documents")

Answer: Python was created by Guido van Rossum and first released in 1991.
Retrieved 1 documents


## 3. Create a Test Set

Define queries with expected relevant documents for evaluation.

---
---

A test set pairs queries with their expected relevant documents. This ground truth enables deterministic retrieval metrics. You can generate test sets automatically with ragnarok generate or curate them manually.

---

Un test set associe des requêtes à leurs documents pertinents attendus. Cette vérité terrain permet d'avoir des métriques de retrieval déterministes. Vous pouvez générer des test sets automatiquement avec ragnarok generate ou les créer manuellement.

---
---


In [6]:
# Create test queries
testset = TestSet(
    queries=[
        Query(
            text="Who created Python and when?",
            ground_truth_docs=["doc1"],
        ),
        Query(
            text="What are Python's main characteristics?",
            ground_truth_docs=["doc2", "doc3"],
        ),
        Query(
            text="What web frameworks are available in Python?",
            ground_truth_docs=["doc4"],
        ),
        Query(
            text="What is Python used for in data science?",
            ground_truth_docs=["doc5"],
        ),
        Query(
            text="What programming paradigms does Python support?",
            ground_truth_docs=["doc3"],
        ),
    ]
)

print(f"Test set: {len(testset)} queries")

Test set: 5 queries


## 4. Basic Evaluation

Run evaluation with retrieval metrics (Precision, Recall, MRR, NDCG).

---
---

Core retrieval metrics are computed deterministically:
- Precision measures accuracy of retrieved documents,
- Recall measures coverage, 
- MRR rewards finding relevant documents early, 
- NDCG accounts for ranking quality.

---

Les métriques de retrieval sont calculées de manière déterministe : 
- Precision mesure la précision des documents récupérés,
- Recall mesure la couverture,
- MRR récompense les documents pertinents trouvés tôt, 
- NDCG prend en compte la qualité du classement.

---
---

In [None]:
# Run evaluation
results = await evaluate(
    rag_pipeline=rag,
    testset=testset,
)

# View the summary
print(results.summary())

{'num_queries': 5, 'precision': 0.26666666666666666, 'recall': 0.8, 'mrr': 0.8, 'ndcg': 0.8}


In [None]:
display_metrics(results)

0,1
precision,[█████░░░░░░░░░░░░░░░]  0.27
recall,[████████████████░░░░]  0.80
mrr,[████████████████░░░░]  0.80
ndcg,[████████████████░░░░]  0.80


## 5. Evaluation with Cost Tracking

Track token usage and costs when using LLM providers.

---
---

Every LLM call is tracked with token counts and costs. Local models (Ollama, vLLM) are marked as $0.00, making the cost advantage of local-first evaluation immediately visible.

---

Chaque appel LLM est tracké avec le nombre de tokens et les coûts. Les modèles locaux (Ollama, vLLM) sont marqués à $0.00, rendant l'avantage économique du local-first immédiatement visible.

---
---

In [None]:
# Run evaluation with cost tracking & display
results_with_cost = await evaluate(
    rag_pipeline=rag,
    testset=testset,
    track_cost=True,
)

display_cost(results_with_cost)

PROVIDER,TOKENS,COST
TOTAL,0,$0.0000


In [None]:
# Full dashboard with metrics & cost
display(results_with_cost)

0,1
precision,[█████░░░░░░░░░░░░░░░]  0.27
recall,[████████████████░░░░]  0.80
mrr,[████████████████░░░░]  0.80
ndcg,[████████████████░░░░]  0.80


## 6. LLM-as-Judge

Use Prometheus 2 for multi-criteria evaluation (faithfulness, relevance, hallucination, completeness).

**Prerequisites:**
```bash
# Install Prometheus 2 model (~5GB)
ollama pull hf.co/RichardErkhov/prometheus-eval_-_prometheus-7b-v2.0-gguf:Q5_K_M
```

---
---

Prometheus 2 evaluates response quality across multiple criteria:
- faithfulness (grounded in context),
- relevance (answers the question), 
- hallucination (no fabricated information).

These judgments are advisory — use them to surface potential issues, not as absolute truth.


---

Prometheus 2 évalue la qualité des réponses selon plusieurs critères :
- faithfulness (ancré dans le contexte),
- relevance (répond à la question), 
- hallucination (pas d'information inventée).

Ces jugements sont indicatifs — utilisez-les pour détecter des problèmes potentiels, pas comme vérité absolue.

---
---

In [11]:
# Initialize judge (uses Prometheus 2 by default)
# Note: This requires Ollama running with Prometheus 2 model
judge = LLMJudge()

# Evaluate a single response
judgment = await judge.evaluate_all(
    context="Python was created by Guido van Rossum in 1991. It is known for its simple syntax.",
    question="Who created Python?",
    answer="Guido van Rossum created Python.",
)

print(f"Overall: {judgment.overall_verdict} ({judgment.overall_score:.2f})")
print(f"Faithfulness: {judgment.faithfulness.verdict} - {judgment.faithfulness.explanation}")
print(f"Relevance: {judgment.relevance.verdict}")
print(f"Hallucination: {judgment.hallucination.verdict}")

Overall: PARTIAL (0.50)
Faithfulness: PARTIAL - The answer to the question is "Guido van Rossum created Python." This statement aligns perfectly with the context provided that Python was created by Guido van Rossum in 1991, thus adhering to the criterion of faithfulness. The claim made by the respondent is entirely supported by the information presented within the given context. Therefore, based on the score rubric, which demands that the answer should contain only information supported by the context, this response fulfills all requirements, making it a complete and faithful answer. So the overall score is 5. [PASS]
Relevance: PARTIAL
Hallucination: PARTIAL


## 7. Pipeline Comparison

Compare multiple RAG configurations side-by-side.

---
---

Compare configurations side-by-side to make data-driven decisions. The * marker highlights the best score for each metric, making trade-offs immediately visible.

---

Compare des configurations côte à côte pour prendre des décisions basées sur les données. Le marqueur * indique le meilleur score pour chaque métrique, rendant les compromis immédiatement visibles.

---
---

In [12]:
# Create RAG variants with different k values
rag_k3 = SimpleRAG(KNOWLEDGE_BASE, k=3)
rag_k5 = SimpleRAG(KNOWLEDGE_BASE, k=5)

# Evaluate both
results_k3 = await evaluate(rag_k3, testset, track_cost=True)
results_k5 = await evaluate(rag_k5, testset, track_cost=True)

# Compare side-by-side
display_comparison([
    ("Baseline (k=3)", results_k3),
    ("More docs (k=5)", results_k5),
])

metric,Baseline (k=3),More docs (k=5)
precision,0.27 *,0.17
recall,0.80 *,0.80 *
mrr,0.80 *,0.80 *
ndcg,0.80 *,0.80 *
cost,$0.00,$0.00
latency,0.01s,0.00s


## 8. Export Results

Export results for CI/CD integration or further analysis.

---
---

Results export to JSON for CI/CD integration. Use --fail-under in the CLI to enforce quality gates in your pipeline.

---

Les résultats s'exportent en JSON pour l'intégration CI/CD. Utilise --fail-under dans le CLI pour imposer des seuils de qualité dans ton pipeline.

---
---


In [None]:
# Yon can export to JSON
import json

summary = results_with_cost.summary()
print(json.dumps(summary, indent=2))

{
  "num_queries": 5,
  "precision": 0.26666666666666666,
  "recall": 0.8,
  "mrr": 0.8,
  "ndcg": 0.8
}


Note :
---
---

No LLM calls in this mock - in prodcution you'd see token counts here.

---

Pas d'appels LLM dans ce mock — en production, les compteurs de tokens seront ici.

---
---

In [14]:
# Cost summary
print(results_with_cost.cost_summary())

No usage tracked.


## Next Steps

- **Real RAG**: Replace `SimpleRAG` with your actual RAG pipeline (LangChain, LlamaIndex, etc.)
- **CI/CD**: Use the CLI for automated testing: `ragnarok evaluate --config ragnarok.yaml --fail-under 0.8`
- **Adapters**: Check available adapters with `ragnarok plugins list`

---
---
## Prochaines étapes

- **RAG réel** : Remplacez `SimpleRAG` par votre vrai pipeline RAG (LangChain, LlamaIndex, ou custom)
- **CI/CD** : Utilisez le CLI pour des tests automatisés : `ragnarok evaluate --config ragnarok.yaml --fail-under 0.8`
- **Adapters** : Consultez les adapters disponibles avec `ragnarok plugins list`

## Resources

- [GitHub](https://github.com/2501Pr0ject/RAGnarok-AI)
- [Plugin Guide](https://github.com/2501Pr0ject/RAGnarok-AI/blob/main/docs/PLUGINS.md)