SDK samples overhaul: 70+ production-ready samples, docs, and tests by mmercuri · Pull Request #73 · LayerLens/stratix-python

mmercuri · 2026-03-26T17:48:12Z

Summary

Complete overhaul of the SDK samples, documentation, and test suite for commercial readiness. Rebased on latest main (e8a8033). The examples/ directory has been fully consolidated into samples/ -- all unique patterns merged, all doc references remapped.

70+ runnable samples across 12 categories
Correct SDK signatures throughout (evaluation_goal, judge_id, attribute access, exponential backoff polling)
495 non-live tests passing (337 structural + 158 e2e mocked with per-sample SDK call assertions)
58 live API tests available (pytest -m live)
examples/ removed -- all content merged into samples/, all 53 doc references remapped
5 API/SDK bugs filed as Linear tickets (LAY-3277 through LAY-3282) with workarounds in samples

Samples by Category

Category	Count	Description
`core/`	20	Traces, judges, evaluations, results, models, benchmarks, async, pagination, benchmark evaluation
`industry/`	10	Healthcare, finance, legal, government, insurance, retail
`cowork/`	5	Multi-agent evaluation patterns (Cowork / Agent Teams compatible)
`modalities/`	3	Text, brand, document evaluation
`integrations/`	4	OpenAI, Anthropic (manual trace + auto-instrumentation)
`cicd/`	2+1	Quality gate, pre-commit hook, GitHub Actions workflow
`cli/`	10	CLI workflow scripts (moved from examples/cli)
`openclaw/`	10+skill	OpenClaw agent evaluation with real `openclaw` SDK dependency
`mcp/`	1	MCP server exposing LayerLens as tools
`copilotkit/`	2+UI	LangGraph CoAgents + React components and hooks
`claude-code/`	6	Slash command skills for CLI and desktop
`data/`	23	Traces, datasets, 16 industry evaluation datasets

Key Changes

Consolidate examples/ into samples/: All 29 example files either removed (duplicates) or merged (unique patterns integrated into samples/core equivalents). All 53 doc references in docs/examples/ remapped to samples/.
New sample: samples/core/benchmark_evaluation.py -- model+benchmark evaluation workflow (evaluations.create -> wait_for_completion -> results.get/get_all)
Fix all SDK signatures: evaluation_goal (not goal), judge_id (not judge_type), attribute access on model objects (not .get())
Resilient polling: Shared poll_evaluation_results() with exponential backoff handles async evaluation pattern (404 during PENDING, empty during EXECUTING)
Judge lifecycle: create_judge() helper auto-resolves model_id, handles 409 conflicts, all samples clean up judges in try/finally
OpenClaw integration: All 10 demos use real from openclaw import OpenClawClient with graceful fallback
Per-sample test assertions: Every sample verified to call the correct SDK methods (not just "didn't crash")

API Bugs Discovered

Ticket	Issue	Priority
LAY-3277	`judges.create()` 404 without `model_id`	High
LAY-3278	`models.get(type="public")` returns empty	High
LAY-3279	`models.add()` 500 MongoDB error	High
LAY-3280	No retry logic for 429 rate limits	Medium
LAY-3282	`get_results()` 404 during async execution	Medium

Test Plan

pytest tests/test_samples.py -- 337 structural tests (parsing, main(), docstrings, imports, no invalid imports)
pytest tests/test_samples_e2e.py -m "not live" -- 158 mocked e2e tests with per-sample SDK call assertions
pytest tests/test_samples_e2e.py -m live -- 58 live API tests (requires LAYERLENS_STRATIX_API_KEY)
All Python files compile clean
Zero references to deleted examples/ directory
CLAUDE.md compliance: 10/10 (no fake data, no workarounds, no escape hatches)

Review Tickets

Ticket	Area
LAY-3283	Documentation
LAY-3284	Core SDK samples
LAY-3285	MCP Server
LAY-3286	CI/CD Integration
LAY-3287	Industry Solutions
LAY-3288	Claude Code Skills
LAY-3289	OpenClaw Agent Evaluation
LAY-3290	Content-Type Evaluations
LAY-3291	LLM Provider Integrations
LAY-3292	Multi-Agent Evaluation (Cowork)
LAY-3293	CopilotKit Integration

…n main) Rebased onto latest main (e8a8033) which includes: - CLI with auth (PR #72) - layerlens.instrument tracing + adapters (PR #66, #69) - Scorers resource, integrations resource - API naming convention fixes (PR #61) No impact on samples: Stratix() constructor is backward-compatible, use_bearer_auth defaults to False, all existing API signatures unchanged. Samples include: core (18), industry (10), cowork (5), modalities (3), integrations (2), cicd (2+workflow), openclaw (10+skill), mcp (1), copilotkit (2+UI), claude-code skills (6), sample data (23 files). 469 non-live tests passing. 54 live tests available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@trace

Copy 3 new files from examples/ that had no equivalent in samples/: - samples/integrations/openai_instrumented.py (instrument_openai + @trace + span) - samples/integrations/langchain_instrumented.py (LangChainCallbackHandler) - samples/core/integration_management.py (client.integrations CRUD) Update docs/instrumentation/providers.md and frameworks.md with Related Samples links. Update samples/integrations/README.md and samples/core/README.md. Update samples/README.md integrations count (2 → 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…que patterns - Remove 14 examples/ files already covered by samples/core equivalents - Create samples/core/benchmark_evaluation.py for model+benchmark workflow (evaluations.create → wait_for_completion → results.get/get_all) - Integrate 12 unique patterns from remaining examples/ into samples/: - trace_evaluation.py: add get_results().steps iteration, get_many() without filter - compare_evaluations.py: add compare_models(), outcome_filter, result field access - judge_optimization.py: add BadRequestError catch, optimization result fields - model_benchmark_management.py: add models.add/remove, benchmarks.add/remove, filters - evaluation_filtering.py: document both camelCase and snake_case sort_by conventions - paginated_results.py: add results.get_by_id() alternative - public_catalog.py: add evaluation summary fields, get_prompts search/sort params - async_workflow.py: add evaluation instance methods (wait_for_completion_async, etc) - Add Related Samples to docs/examples/creating-evaluations.md - Add Related Samples to docs/instrumentation/providers.md and frameworks.md - Update all READMEs for new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…le 3) The "0.92" similarity score was fabricated and displayed as if computed by a real retrieval engine. Removed the fake score -- retrieval is by document ID, and actual quality scoring comes from the judge evaluation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every sample now has specific assertions verifying which SDK methods it calls (not just "didn't crash"). Covers: - 20 core samples (benchmark_evaluation, integration_management added) - 5 cowork samples (code_review, pair_programming, rag_assessment, etc) - 3 modality samples (text, brand, document evaluation) - 4 integration samples (openai/anthropic traced + instrumented) - 2 cicd samples Also adds mock setup for client.integrations and client.trace_evaluations.get_many. 495 non-live tests passing, 58 live tests deselected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All example files have been either: - Removed (14 duplicates already covered by samples/core equivalents) - Removed after integrating unique patterns into samples/ (12 files) - Replaced by samples/core/benchmark_evaluation.py (3 client workflow files) Updated all 53 doc references in docs/examples/ to point to samples/core/. Updated docs/examples/README.md with new file table. examples/ directory no longer exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests cover all 6 tool handlers, dispatch logic, error handling, asyncio.to_thread wrapping, and helper functions: - TestToolCatalogue: server creation and handler existence - TestHandleListTraces: summary output, default limit, empty/null responses - TestHandleGetTrace: detail output, not-found handling - TestHandleRunEvaluation: creation output, failure handling - TestHandleGetEvaluation: status+results, not-found, pending state - TestHandleCreateJudge: creation output, failure handling - TestHandleListJudges: list output, empty/null responses - TestDispatchAndErrors: unknown tool, SDK exceptions, helper functions - TestAsyncWrapping: all 5 handlers verified to use asyncio.to_thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

agentecobuilder · 2026-04-22T17:33:49Z

+## Prerequisites
+
+```bash
+pip install layerlens --index-url https://sdk.layerlens.ai/package copilotkit langgraph pydantic mcp


Tested on Mac with Python 3.14

The pip install command in the README uses --index-url which replaces PyPI entirely and causes missing dependencies. Should it be --extra-index-url instead?

Maybe adding a Python version requirement (e.g., 3.10–3.13) to the Prerequisites section. While the sample ran on 3.14 after some dependency downgrades, it’s not officially supported and may behave unexpectedly.

mmercuri force-pushed the mmercuri/sdk branch from 8b2cd0f to 4e45b4e Compare March 30, 2026 13:16

mmercuri and others added 10 commits March 31, 2026 13:43

Remove marc-only/ from tracking, add to .gitignore

5e01077

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move examples/cli/ to samples/cli/

350754f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix samples

eeb80f9

m-peko force-pushed the mmercuri/sdk branch from 255a27e to eeb80f9 Compare April 2, 2026 10:52

m-peko approved these changes Apr 21, 2026

View reviewed changes

m-peko merged commit 7b2864e into main Apr 21, 2026
7 checks passed

m-peko deleted the mmercuri/sdk branch April 21, 2026 16:12

agentecobuilder reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK samples overhaul: 70+ production-ready samples, docs, and tests#73

SDK samples overhaul: 70+ production-ready samples, docs, and tests#73
m-peko merged 10 commits into
mainfrom
mmercuri/sdk

mmercuri commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

agentecobuilder Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mmercuri commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Samples by Category

Key Changes

API Bugs Discovered

Test Plan

Review Tickets

Uh oh!

Uh oh!

agentecobuilder Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmercuri commented Mar 26, 2026 •

edited

Loading

agentecobuilder Apr 22, 2026 •

edited

Loading