A simple Q&A bot for technical documentation designed to test and compare different LLM evaluation frameworks including DeepEval, LangChain Evaluation, RAGAS, and OpenAI Evals.
This project serves as a testbed for comparing how different evaluation frameworks assess the same RAG (Retrieval-Augmented Generation) system.
- Clone the repository
git clone https://github.com/LiteObject/eval-framework-sandbox.git
cd eval-framework-sandbox
- Install dependencies
pip install -r requirements.txt
- Set up environment variables
cp .env.example .env
# Edit .env with your API keys (optional unless running remote evals)
- Ask a question
python -m src.main "How do you install the Python requests library?"
The bot will print a synthesized answer and list matching documents.
- Run the unit tests
pytest
- (Optional) Try an evaluation framework
- Update
.env
with the relevant API keys or enable the Ollama flag for a local model (details below). - Install extras:
pip install -r requirements.txt
already includes optional libs, orpip install .[eval]
after editable install. - Use the runner scripts in
evaluations/
as starting points; each script writes results intoresults/
.
- Update
The core QA bot already runs fully offline using TF-IDF retrieval. If you also want LangChain's evaluators to call a local Ollama model instead of OpenAI:
- Install Ollama and pull a model, e.g.
ollama pull llama3
. - Set the following environment variables (via
.env
or your shell):LANGCHAIN_USE_OLLAMA=true
OLLAMA_MODEL=llama3
(or any other pulled model)- Optionally
OLLAMA_BASE_URL=http://localhost:11434
if you're running Ollama on a non-default host/port.
- Leave
OPENAI_API_KEY
blank; the LangChain evaluator will detect the Ollama flag and useChatOllama
.
If LANGCHAIN_USE_OLLAMA
is false
, the evaluator falls back to ChatOpenAI
and expects a valid OPENAI_API_KEY
plus LANGCHAIN_OPENAI_MODEL
(defaults to gpt-3.5-turbo
).
These integrations are opt-in. Install the additional dependencies with:
pip install .[eval]
Each runner expects the dataset built from the JSON files in data/questions.json
and data/ground_truth.json
. The helper below mirrors what the runners use
internally:
from pathlib import Path
from evaluations.utils import load_dataset_from_files
dataset = load_dataset_from_files(
Path("data/questions.json"),
Path("data/ground_truth.json"),
)
-
Set
DEEPEVAL_API_KEY
in.env
if you plan to submit results to the hosted DeepEval service (local scoring works without it). -
Run the runner programmatically:
from evaluations.deepeval_runner import DeepEvalRunner runner = DeepEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The report is also written to
results/deepeval_result.json
.
-
Choose your backend:
- Remote OpenAI models: set
OPENAI_API_KEY
and optionallyLANGCHAIN_OPENAI_MODEL
(defaults togpt-3.5-turbo
). - Local Ollama: set
LANGCHAIN_USE_OLLAMA=true
,OLLAMA_MODEL
, and optionallyOLLAMA_BASE_URL
; no OpenAI key required.
- Remote OpenAI models: set
-
Invoke the runner:
from evaluations.langchain_eval_runner import LangChainEvalRunner runner = LangChainEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
LangChain will call the configured chat model to grade responses and store the output at
results/langchain_result.json
.
-
Install the
ragas
extras (already included in.[eval]
). Some metrics call an LLM; setOPENAI_API_KEY
or configure RagAS to use a local model before running. -
Evaluate the dataset:
from evaluations.ragas_runner import RagasRunner runner = RagasRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The raw metric results are saved to
results/ragas_result.json
.
This repository only prepares the dataset and relies on OpenAI's CLI for the
actual evaluation. Ensure evals
is installed and OPENAI_API_KEY
is set, then
use evaluations/openai_eval_runner.py
to export a dataset and follow the
OpenAI Evals documentation to launch the
experiments with oaieval
.
data/
: Test questions, ground truth, and source documentssrc/
: Core Q&A bot implementationevaluations/
: Framework-specific evaluation scriptsresults/
: Evaluation results and comparisons (gitignored except for.gitkeep
)
- Answer Correctness
- Context Relevance
- Faithfulness
- Answer Similarity
- Response Time
- Hallucination Rate