Rue is a Python testing framework for AI projects. It follows pytest syntax and culture while introducing components essential for testing AI software: metrics, typed datasets, semantic predicates (LLM-as-a-Judge), and OTEL traces.
uv add rueFollow pytest habits...
- Create
test_*.pyfiles - Mark plain Rue tests with
@rue.test - Use 'rue.resource' instead of 'pytest.fixture'
- Put shared suite setup, shared resources, and directory-specific overrides in
conftest.pyorconfrue_*.pyfiles. Rue loadsconftest.pyfirst in each directory. See confrue files - Add 'assert' expressions within the functions
- Run 'uv run rue tests run'
Rue only collects tests marked through Rue decorators, so pytest and Rue tests can live side by side in the same test_*.py module.
...while leveraging Rue APIs.
- Use 'with metrics()' context to turn failed assertions into quality metrics
- Use 'has_facts()' and other semantic predicates for asserting natural language
- Access OTEL span data and assert it with 'follows_policy()' predicate
- Parse datasets into clearly typed and validated data objects
import rue
from rue import Case, Metric, metrics
from rue.predicates import has_unsupported_facts, follows_policy
from pydantic import BaseModel
@rue.resource.sut
def store_chatbot():
return rue.SUT(call_llm)
@rue.resource.metric
def accuracy():
metric = Metric()
yield metric
assert metric.mean > 0.8
yield metric.mean
class Refs(BaseModel):
kb: str
expected_tool: str | None = None
cases = [
Case(inputs={"prompt": "When are you open?"}, references=Refs(kb="Store hours: 9 AM - 6 PM, Monday-Saturday. Closed Sundays.")),
Case(inputs={"prompt": "Return policy?"}, references=Refs(kb="30-day returns with receipt.")),
Case(inputs={"prompt": "How much for the Nike Air Max?"}, references=Refs(kb="Nike Air Max: $129.99", expected_tool="offer_product")),
]
@rue.test.iterate.cases(*cases)
@rue.test.iterate(3)
async def test_chatbot_no_hallucinations(
case: Case[dict[str, str], Refs],
store_chatbot,
accuracy: Metric,
):
"""AI agent relies on knowledge base and tool calls for transactional questions"""
response = store_chatbot.instance(**case.inputs)
# Verify the answer don't have any unsupported facts
with metrics(accuracy):
assert not await has_unsupported_facts(response, case.references.kb)
# Verify tool was called when expected
if expected_tool := case.references.expected_tool:
root_spans = store_chatbot.root_spans
tool_names = [
s.attributes.get("llm.request.functions.0.name")
for s in store_chatbot.llm_spans
if s.attributes
]
assert root_spans
assert expected_tool in tool_namesRun it:
uv run rue tests runUse a custom run UUID when you need stable correlation IDs:
uv run rue tests run --run-id 3f5f5e9a-1c2d-4b5f-9c2b-7f6d8a9b0c1dOutput:
Rue Test Runner
=================
Collected 1 test
test_example.py::test_chatbot_responds ✓
==================== 1 passed in 0.08s ====================
This project is licensed under the MIT License - see the LICENSE file for details.