diff --git a/README.md b/README.md index befcbec..cb6698a 100644 --- a/README.md +++ b/README.md @@ -1,744 +1,216 @@ -
-
-[**Run your first eval**](#quick-start) · [**Browse 213 models**](https://stratix.layerlens.ai) · [**Star if useful ⭐**](https://github.com/LayerLens/stratix-python)
-
-
- Vendor-neutral evals in 5 lines of Python.
-| - -### Vendor-neutral - -Stratix is not owned by a model provider. The same benchmark runs across 213 public models from 26 providers in one workspace. No labs grading their own homework. No leaderboards optimized for marketing. - - | -- -### Reproducible by default - -Every score is backed by a verifiable, persisted trace you can re-run, inspect, and cite. Same prompt, same prompt template, same scoring logic, same model version. Every time. - - | -- -### Production-ready - -Wire evals into CI. Calibrate judges to a quality goal in plain English. Score full agent traces, not just last-token outputs. Ship reliable agents faster. - - | -
| Standard (pip) | -Modern (uv) | -Authenticate | -
|---|---|---|
|
+
+
+ Stratix Python SDK+ ++ Ship AI that actually works. Evaluate 200+ models across 100+ benchmarks, trace agent behavior, build custom judges, and gate CI/CD on eval results. + + + + ++ Install · + Quick Start · + Compare · + Docs · + Examples · + Discord + + +--- + +## Why Stratix? + +Stratix is built differently. It gives you production-grade evaluation infrastructure out of the box: rich public benchmarks, powerful custom judges, full agent trace analysis, playback, bulk evaluation, and CI/CD gates. + +**What makes it click:** + +- **200+ models and 100+ benchmarks, ready to query.** No scraping leaderboards, no CSV wrangling. `pc.models.get()` and you're looking at real evaluation data. +- **Prompt-level comparisons.** Not just "Model A scores 82%." You get the exact prompts where Model A passes and Model B fails, with outcome filters to find the interesting divergences. +- **A 4-generation eval ladder.** Start with heuristic checks, graduate to model-graded scoring, add deliberation panels, then build auto-optimized GEPA judges. One SDK covers the full spectrum. +- **Agent trace evaluation.** Upload a multi-step agent trace, replay it, and judge every step. Built for the world where agents do real work. +- **CI/CD eval gates.** `layerlens ci run --threshold 0.8` in your pipeline. Non-zero exit on regression. No custom scripts needed. + +## How Stratix Compares + +| Capability | **Stratix** | LangSmith | Langfuse | DeepEval | Phoenix (Arize) | +| ----------------------- | ---------------------------------------------- | -------------------------- | ----------------------- | ------------------- | ---------------------- | +| Pre-built benchmarks | 100+ benchmarks, 200+ models | No public benchmarks | No public benchmarks | ~14 metrics | Bring your own | +| Prompt-level comparison | Native head-to-head with outcome filters | Side-by-side runs (manual) | Not built-in | Manual setup | Not built-in | +| Custom judge builder | Auto-optimized GEPA judges with budget control | LLM-as-judge (manual) | LLM-as-judge (manual) | Basic LLM judges | LLM-as-judge templates | +| Agent trace evaluation | Upload, replay, judge every step | Trace logging + annotation | Trace logging + scoring | Trace logging only | Trace visualization | +| Eval generation ladder | Heuristic > model-graded > deliberation > GEPA | Single generation | Single generation | Single generation | Single generation | +| CI/CD eval gate | `layerlens ci run` with threshold | Custom integration | Custom integration | `deepeval test` | Manual integration | +| Evaluation Spaces | Collaborative eval environments | Hub (paid) | Not available | Not available | Not available | +| Dataset versioning | Pin evals to versions, diff between runs | Dataset management | Not built-in | Basic support | Dataset management | +| OpenTelemetry export | Native OTLP exporter | Not built-in | Native OTLP | Not built-in | Native (OpenInference) | +| Pricing model | Free public data; premium for org features | Per-trace pricing | Per-event pricing | Open source + cloud | Open source + cloud | + +## Installation ```bash -pip install layerlens +# Recommended (includes CLI, rich output, and examples) +pip install layerlens[cli] ``` - |
-+> **Note:** During early access the package is hosted on a private index. Use: +> +> ```bash +> pip install --extra-index-url https://sdk.layerlens.ai/package layerlens[cli] +> ``` -```bash -uv pip install layerlens -``` +## Quick Start - | -+**Easiest way** — use the one-command template: ```bash -export LAYERLENS_STRATIX_API_KEY=... +stratix init my-first-eval +cd my-first-eval +python main.py ``` -Or pass `api_key=...` to the client. - - | -
| - -### Model evaluation - -Run any of 213 public models across 59 benchmarks. AIME, GPQA, ARC-AGI-2, HumanEval, Terminal-Bench, MMLU Pro, BIRD-CRITIC, more. Reasoning, coding, math, agentic, multilingual. - -[Docs →](https://layerlens.gitbook.io/stratix-python-sdk) - - | -- -### Agent trace evaluation - -Upload OpenAI-format trace files and score multi-step agent behavior. Tool use, planning quality, recovery from failures. Not just the final token. - -[Docs →](https://layerlens.gitbook.io/stratix-python-sdk) - - | -- -### Judge calibration - -Define a quality goal in plain English. Stratix calibrates an LLM-as-judge to that goal, validates against your gold examples, and reuses the judge across runs. - -[Docs →](https://layerlens.gitbook.io/stratix-python-sdk) - - | -
| - -### Custom benchmarks - -Bring your own dataset. Smart benchmark generation for adversarial cases, edge inputs, and domain-specific evals. Reuses public scoring infrastructure. - -[Docs →](https://layerlens.gitbook.io/stratix-python-sdk) - - | -- -### CI integration - -Fail the build on quality regressions, not just on red unit tests. Use `stratix ci report` in GitHub Actions, GitLab CI, CircleCI, or any Python-capable runner. - -[Sample →](./samples/cicd) - - | -- -### Reproducible runs - -Every evaluation persists model version, prompt template, judge config, and full traces. Re-run any evaluation by ID. Cite the result with confidence. - -[Docs →](https://layerlens.gitbook.io/stratix-python-sdk) - - | -
| Hand-rolled (typical) | -Stratix | -
|---|---|
| +Or wire it up yourself in Python: ```python -import openai, json, asyncio -from datasets import load_dataset - -ds = load_dataset("aime-2026")["test"] -client = openai.OpenAI() - -results = [] -async def score_one(item): - resp = await client.chat.completions.create( - model="gpt-5.5-20260423", - messages=[{"role":"user","content":item["q"]}], - ) - answer = parse_answer(resp.choices[0].message.content) - return {"q": item["q"], "ans": answer, "expected": item["a"], - "correct": answer == item["a"]} - -# Implement: rate limiting, retries, cost tracking, -# trace storage, judge logic, schema versioning, -# benchmark drift detection, regression alerting. -# Repeat per benchmark. Per model. Per release. -``` +from layerlens import PublicClient, Stratix - | -+# Public data (models, benchmarks, evaluations) +pc = PublicClient(api_key="your-api-key") -```python -from layerlens import Stratix +models = pc.models.get(page_size=200) +print(f"{models.total_count} models available") -client = Stratix() # reads LAYERLENS_STRATIX_API_KEY - -evaluation = client.evaluations.create( - model=client.models.get_by_key("openai/gpt-5.5-20260423"), - benchmark=client.benchmarks.get_by_key("aime2026"), +# Compare two models head-to-head at prompt level +comparison = pc.comparisons.compare_models( + benchmark_id="benchmark-id", + model_id_1="model-a", + model_id_2="model-b", + outcome_filter="comparison_fails", # where model B fails ) -result = client.evaluations.wait_for_completion(evaluation) - -print(result.accuracy) -print(f"https://stratix.layerlens.ai/evaluations/{result.id}") -``` - - | -
| - | Stratix | -Braintrust | -LangSmith | -Phoenix | -OpenAI Evals | -
|---|---|---|---|---|---|
| Public-model leaderboard | 213 | none | none | none | limited |
| Independent grading | ✅ | ✅ | ✅ | ✅ | ⚠️ vendor |
| Reproducible scores | ✅ traces persisted | ✅ | ✅ | ✅ | ✅ |
| Agent trace evaluation | ✅ | ✅ | ✅ | ✅ | ⚠️ |
| Judge calibration in SDK | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ |
| Custom benchmarks | ✅ | ✅ | ✅ | ✅ | ✅ |
| Smart benchmark generation | ✅ | via templates | via templates | manual | manual |
| 59 prebuilt benchmarks out of the box | ✅ | via templates | via templates | via Arize | small core set |
| Context manager (sync) | -Context manager (async) | -
|---|---|
| - -```python -from layerlens import Stratix +## Documentation -with Stratix() as client: - eval = client.evaluations.create(...) -# HTTP connection released -``` - - | -- -```python -import asyncio -from layerlens import AsyncStratix - -async def main(): - async with AsyncStratix() as client: - eval = await client.evaluations.create(...) - -asyncio.run(main()) -``` - - | -
| Recently shipped | -In progress | -Coming up | -Exploring | -
|---|---|---|---|
| - -- [x] 213 public models -- [x] Agent trace evaluation -- [x] Judge calibration -- [x] Smart benchmark generation -- [x] Async client -- [x] Reproducible runs - - | -- -- [ ] Deliberation panels -- [ ] Custom-model adapters (open weights) -- [ ] Cost-aware eval routing - - | -- -- [ ] Per-domain leaderboards -- [ ] Streaming eval results -- [ ] TypeScript SDK - - | -- -- [ ] Cross-model A/B harness -- [ ] Latency-quality Pareto plots -- [ ] OpenTelemetry trace ingest - - | -
+ ⭐ Star us if you found this useful! ⭐
+ It helps more developers discover Stratix.
+