Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,5 @@ codegen.log
Brewfile.lock.json

.DS_Store
.coverage
.coveragedocs/review/
marc-only/
13 changes: 13 additions & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,19 @@
* [Judges and Traces](examples/judges-and-traces.md)
* [Public API](examples/public-api.md)

## Samples & Tutorials
* [Samples Overview](samples-guide.md)
* [Core SDK Operations](samples-guide.md#core-sdk-operations-18-samples) -- Traces, judges, evaluations, results, models, benchmarks, async
* [Industry Solutions](samples-guide.md#industry-solutions-10-samples) -- Healthcare, finance, legal, government, insurance, retail
* [Multi-Agent Evaluation](samples-guide.md#multi-agent-evaluation-5-samples) -- Cowork and Agent Teams patterns
* [Content-Type Evaluations](samples-guide.md#content-type-evaluations-3-samples) -- Text, brand, document
* [CI/CD Integration](samples-guide.md#cicd-integration-2-samples--workflow) -- Quality gates, pre-commit hooks, GitHub Actions
* [LLM Provider Integrations](samples-guide.md#llm-provider-integrations-2-samples) -- OpenAI, Anthropic
* [OpenClaw Agent Evaluation](samples-guide.md#openclaw-agent-evaluation-10-demos--skill) -- Cage match, code gate, safety audit, red-team
* [MCP Server](samples-guide.md#mcp-server-1-sample) -- LayerLens as tools for Claude and other MCP clients
* [CopilotKit Integration](samples-guide.md#copilotkit-integration-2-agents--ui-components) -- LangGraph CoAgents, React components
* [Claude Code Skills](samples-guide.md#claude-code-skills-6-skills) -- Slash commands for CLI and desktop

## Troubleshooting
* [Overview](troubleshooting/README.md)
* [Common Issues](troubleshooting/common-issues.md)
Expand Down
61 changes: 27 additions & 34 deletions docs/examples/README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,34 @@
# Examples
# Code Examples

This section provides practical code examples for common SDK use cases. All examples are available as runnable scripts in the [`examples/`](../../examples/) directory.
This section provides practical code examples for common SDK use cases. All examples are available as runnable scripts in the [`samples/`](../../samples/) directory.

## Quick Reference

| Example | Description |
| ------- | ----------- |
| [`client_simple.py`](../../examples/client_simple.py) | Minimal sync client usage |
| [`client.py`](../../examples/client.py) | Full sync evaluation workflow |
| [`async_client_simple.py`](../../examples/async_client_simple.py) | Minimal async client usage |
| [`async_client.py`](../../examples/async_client.py) | Full async evaluation workflow |
| [`async_run_evaluations.py`](../../examples/async_run_evaluations.py) | Run multiple evaluations in parallel |
| [`get_models.py`](../../examples/get_models.py) | Filter models by name, company, region, type |
| [`get_benchmarks.py`](../../examples/get_benchmarks.py) | Filter benchmarks by name and type |
| [`get_evaluation.py`](../../examples/get_evaluation.py) | Fetch an evaluation by ID |
| [`evaluation_sorting.py`](../../examples/evaluation_sorting.py) | Sort and filter evaluations |
| [`compare_evaluations.py`](../../examples/compare_evaluations.py) | Compare two models on a benchmark |
| [`paginated_results.py`](../../examples/paginated_results.py) | Paginate through evaluation results |
| [`all_results_no_pagination.py`](../../examples/all_results_no_pagination.py) | Fetch all results at once |
| [`fetch_results_async.py`](../../examples/fetch_results_async.py) | Fetch results for multiple evaluations concurrently |
| [`create_custom_model.py`](../../examples/create_custom_model.py) | Create a custom model with an OpenAI-compatible API |
| [`create_custom_benchmark.py`](../../examples/create_custom_benchmark.py) | Create a custom benchmark from a JSONL file |
| [`create_smart_benchmark.py`](../../examples/create_smart_benchmark.py) | Create an AI-generated benchmark from documents |
| [`manage_project_models_benchmarks.py`](../../examples/manage_project_models_benchmarks.py) | Add/remove models and benchmarks from a project |
| [`judges.py`](../../examples/judges.py) | Create, list, update, and delete judges |
| [`traces.py`](../../examples/traces.py) | Upload, list, get, and delete traces |
| [`trace_evaluations.py`](../../examples/trace_evaluations.py) | Run judges on traces, estimate cost, get results |
| [`async_judges_and_traces.py`](../../examples/async_judges_and_traces.py) | Async judge and trace evaluation workflow |
| [`judge_optimizations.py`](../../examples/judge_optimizations.py) | Estimate, run, and apply judge optimizations |
| [`public_models.py`](../../examples/public_models.py) | Browse, search, and filter public models |
| [`public_benchmarks.py`](../../examples/public_benchmarks.py) | Browse public benchmarks and download prompts |
| [`public_evaluations.py`](../../examples/public_evaluations.py) | Get public evaluation details and results |
| Sample | Description |
|--------|-------------|
| [`benchmark_evaluation.py`](../../samples/core/benchmark_evaluation.py) | Run a model against a benchmark, wait for completion, retrieve results |
| [`quickstart.py`](../../samples/core/quickstart.py) | Minimal end-to-end trace evaluation |
| [`async_workflow.py`](../../samples/core/async_workflow.py) | Full async evaluation workflow with concurrent operations |
| [`async_results.py`](../../samples/core/async_results.py) | Fetch results for multiple evaluations concurrently |
| [`model_benchmark_management.py`](../../samples/core/model_benchmark_management.py) | Filter models by name/company/region, add/remove from project |
| [`evaluation_filtering.py`](../../samples/core/evaluation_filtering.py) | Sort and filter evaluations by status, accuracy, date |
| [`compare_evaluations.py`](../../samples/core/compare_evaluations.py) | Compare two models on a benchmark with outcome filtering |
| [`paginated_results.py`](../../samples/core/paginated_results.py) | Paginate through results or fetch all at once |
| [`custom_model.py`](../../samples/core/custom_model.py) | Register a custom model with an OpenAI-compatible API |
| [`custom_benchmark.py`](../../samples/core/custom_benchmark.py) | Create custom and smart benchmarks from data files |
| [`create_judge.py`](../../samples/core/create_judge.py) | Create, list, update, and delete judges |
| [`basic_trace.py`](../../samples/core/basic_trace.py) | Upload, list, get, and delete traces |
| [`trace_evaluation.py`](../../samples/core/trace_evaluation.py) | Run judges on traces, estimate cost, get results with steps |
| [`judge_optimization.py`](../../samples/core/judge_optimization.py) | Estimate, run, and apply judge optimizations |
| [`public_catalog.py`](../../samples/core/public_catalog.py) | Browse public models, benchmarks, evaluations, and prompts |
| [`integration_management.py`](../../samples/core/integration_management.py) | List, inspect, and test configured integrations |

## Guides

- [Creating Evaluations](creating-evaluations.md) - Sync, async, and parallel evaluations
- [Retrieving Results](retrieving-results.md) - Paginated, bulk, and concurrent result fetching
- [Models and Benchmarks](models-and-benchmarks.md) - Filtering, custom models, custom/smart benchmarks, project management
- [Judges and Traces](judges-and-traces.md) - Judge CRUD, trace uploads, trace evaluations, and optimizations
- [Public API](public-api.md) - Public models, benchmarks, evaluations, and comparisons
- [Creating Evaluations](creating-evaluations.md) -- Sync, async, and parallel evaluations
- [Retrieving Results](retrieving-results.md) -- Paginated, bulk, and concurrent result fetching
- [Models and Benchmarks](models-and-benchmarks.md) -- Filtering, custom models, custom/smart benchmarks, project management
- [Judges and Traces](judges-and-traces.md) -- Judge CRUD, trace uploads, trace evaluations, and optimizations
- [Public API](public-api.md) -- Public models, benchmarks, evaluations, and comparisons

For the complete samples catalog including industry solutions, OpenClaw agent evaluation, CI/CD integration, and more, see the [Samples Guide](../samples-guide.md).
26 changes: 17 additions & 9 deletions docs/examples/creating-evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Examples for creating evaluations on the Stratix platform using the LayerLens Py

### Using Synchronous Client

> Source: [`examples/client.py`](../../examples/client.py)
> Source: [`samples/core/benchmark_evaluation.py`](../../samples/core/benchmark_evaluation.py)

```python
from layerlens import Stratix
Expand Down Expand Up @@ -49,7 +49,7 @@ else:

### Minimal Sync Example

> Source: [`examples/client_simple.py`](../../examples/client_simple.py)
> Source: [`samples/core/benchmark_evaluation.py`](../../samples/core/benchmark_evaluation.py)

```python
from layerlens import Stratix
Expand All @@ -70,7 +70,7 @@ evaluation = client.evaluations.create(

### Using Async Client

> Source: [`examples/async_client_simple.py`](../../examples/async_client_simple.py)
> Source: [`samples/core/async_workflow.py`](../../samples/core/async_workflow.py)

```python
import asyncio
Expand Down Expand Up @@ -106,7 +106,7 @@ if __name__ == "__main__":

## Sorting and Filtering Evaluations

> Source: [`examples/evaluation_sorting.py`](../../examples/evaluation_sorting.py)
> Source: [`samples/core/evaluation_filtering.py`](../../samples/core/evaluation_filtering.py)

```python
import asyncio
Expand Down Expand Up @@ -163,7 +163,7 @@ if __name__ == "__main__":

## Comparing Evaluations

> Source: [`examples/compare_evaluations.py`](../../examples/compare_evaluations.py)
> Source: [`samples/core/compare_evaluations.py`](../../samples/core/compare_evaluations.py)

```python
from layerlens import PublicClient
Expand Down Expand Up @@ -200,7 +200,7 @@ comparison = client.comparisons.compare(

## Running Multiple Evaluations in Parallel

> Source: [`examples/async_run_evaluations.py`](../../examples/async_run_evaluations.py)
> Source: [`samples/core/async_results.py`](../../samples/core/async_results.py)

```python
import asyncio
Expand Down Expand Up @@ -253,7 +253,7 @@ if __name__ == "__main__":

### Paginated Results

> Source: [`examples/paginated_results.py`](../../examples/paginated_results.py)
> Source: [`samples/core/paginated_results.py`](../../samples/core/paginated_results.py)

```python
import asyncio
Expand Down Expand Up @@ -298,7 +298,7 @@ if __name__ == "__main__":

### All Results Without Pagination

> Source: [`examples/all_results_no_pagination.py`](../../examples/all_results_no_pagination.py)
> Source: [`samples/core/paginated_results.py`](../../samples/core/paginated_results.py)

```python
import asyncio
Expand Down Expand Up @@ -326,7 +326,7 @@ if __name__ == "__main__":

### Fetch Results for Multiple Evaluations Concurrently

> Source: [`examples/fetch_results_async.py`](../../examples/fetch_results_async.py)
> Source: [`samples/core/async_results.py`](../../samples/core/async_results.py)

```python
import asyncio
Expand Down Expand Up @@ -385,3 +385,11 @@ except layerlens.NotFoundError:
except layerlens.APIError as e:
print(f"API error: {e}")
```

## Related Samples

- [`samples/core/benchmark_evaluation.py`](../../samples/core/benchmark_evaluation.py) -- Full model+benchmark evaluation workflow with result pagination
- [`samples/core/run_evaluation.py`](../../samples/core/run_evaluation.py) -- Evaluation lifecycle management
- [`samples/core/trace_evaluation.py`](../../samples/core/trace_evaluation.py) -- Trace-level evaluation with judges
- [`samples/core/async_results.py`](../../samples/core/async_results.py) -- Concurrent async evaluation and result fetching
- [`samples/core/compare_evaluations.py`](../../samples/core/compare_evaluations.py) -- Side-by-side evaluation comparison
10 changes: 5 additions & 5 deletions docs/examples/judges-and-traces.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Examples for working with judges, traces, and trace evaluations on the Stratix p

## Creating and Managing Judges

> Source: [`examples/judges.py`](../../examples/judges.py)
> Source: [`samples/core/create_judge.py`](../../samples/core/create_judge.py)

```python
import time
Expand Down Expand Up @@ -51,7 +51,7 @@ print(f"Deleted judge {deleted.id}")

## Uploading and Managing Traces

> Source: [`examples/traces.py`](../../examples/traces.py)
> Source: [`samples/core/basic_trace.py`](../../samples/core/basic_trace.py)

```python
import os
Expand Down Expand Up @@ -94,7 +94,7 @@ print(f"Deleted: {deleted}")

## Running Trace Evaluations

> Source: [`examples/trace_evaluations.py`](../../examples/trace_evaluations.py)
> Source: [`samples/core/trace_evaluation.py`](../../samples/core/trace_evaluation.py)

```python
import time
Expand Down Expand Up @@ -150,7 +150,7 @@ client.judges.delete(judge.id)

## Judge Optimizations

> Source: [`examples/judge_optimizations.py`](../../examples/judge_optimizations.py)
> Source: [`samples/core/judge_optimization.py`](../../samples/core/judge_optimization.py)

Optimization requires that the judge has at least 10 annotations (trace evaluation results). Run trace evaluations first to build up annotation data.

Expand Down Expand Up @@ -221,7 +221,7 @@ client.judges.delete(judge.id)

## Async Judges and Traces

> Source: [`examples/async_judges_and_traces.py`](../../examples/async_judges_and_traces.py)
> Source: [`samples/core/async_results.py`](../../samples/core/async_results.py)

```python
import os
Expand Down
12 changes: 6 additions & 6 deletions docs/examples/models-and-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Examples for browsing, filtering, creating, and managing models and benchmarks u

## Filtering Models

> Source: [`examples/get_models.py`](../../examples/get_models.py)
> Source: [`samples/core/model_benchmark_management.py`](../../samples/core/model_benchmark_management.py)

```python
import asyncio
Expand Down Expand Up @@ -56,7 +56,7 @@ if __name__ == "__main__":

## Filtering Benchmarks

> Source: [`examples/get_benchmarks.py`](../../examples/get_benchmarks.py)
> Source: [`samples/core/model_benchmark_management.py`](../../samples/core/model_benchmark_management.py)

```python
import asyncio
Expand Down Expand Up @@ -98,7 +98,7 @@ if __name__ == "__main__":

## Creating a Custom Model

> Source: [`examples/create_custom_model.py`](../../examples/create_custom_model.py)
> Source: [`samples/core/custom_model.py`](../../samples/core/custom_model.py)

Custom models let you evaluate any model accessible via an OpenAI-compatible chat completions endpoint.

Expand Down Expand Up @@ -139,7 +139,7 @@ if __name__ == "__main__":

## Creating a Custom Benchmark

> Source: [`examples/create_custom_benchmark.py`](../../examples/create_custom_benchmark.py)
> Source: [`samples/core/custom_benchmark.py`](../../samples/core/custom_benchmark.py)

Custom benchmarks are created from JSONL files with `input` and `truth` fields.

Expand Down Expand Up @@ -197,7 +197,7 @@ Optional field: `subset` (for grouping prompts into categories).

## Creating a Smart Benchmark

> Source: [`examples/create_smart_benchmark.py`](../../examples/create_smart_benchmark.py)
> Source: [`samples/core/custom_benchmark.py`](../../samples/core/custom_benchmark.py)

Smart benchmarks use AI to automatically generate benchmark prompts from uploaded documents. Supported file types: `.txt`, `.pdf`, `.html`, `.docx`, `.csv`, `.json`, `.jsonl`, `.parquet`.

Expand Down Expand Up @@ -238,7 +238,7 @@ if __name__ == "__main__":

## Managing Project Models and Benchmarks

> Source: [`examples/manage_project_models_benchmarks.py`](../../examples/manage_project_models_benchmarks.py)
> Source: [`samples/core/model_benchmark_management.py`](../../samples/core/model_benchmark_management.py)

Add and remove public models and benchmarks from your project.

Expand Down
8 changes: 4 additions & 4 deletions docs/examples/public-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ public = PublicClient()

## Public Models

> Source: [`examples/public_models.py`](../../examples/public_models.py)
> Source: [`samples/core/public_catalog.py`](../../samples/core/public_catalog.py)

```python
from layerlens import PublicClient
Expand Down Expand Up @@ -79,7 +79,7 @@ if __name__ == "__main__":

## Public Benchmarks

> Source: [`examples/public_benchmarks.py`](../../examples/public_benchmarks.py)
> Source: [`samples/core/public_catalog.py`](../../samples/core/public_catalog.py)

```python
from layerlens import PublicClient
Expand Down Expand Up @@ -144,7 +144,7 @@ if __name__ == "__main__":

## Public Evaluations

> Source: [`examples/public_evaluations.py`](../../examples/public_evaluations.py)
> Source: [`samples/core/public_catalog.py`](../../samples/core/public_catalog.py)

```python
from layerlens import PublicClient
Expand Down Expand Up @@ -207,7 +207,7 @@ if __name__ == "__main__":

## Comparing Evaluations

> Source: [`examples/compare_evaluations.py`](../../examples/compare_evaluations.py)
> Source: [`samples/core/compare_evaluations.py`](../../samples/core/compare_evaluations.py)

Compare how two models perform on the same benchmark, prompt by prompt.

Expand Down
6 changes: 3 additions & 3 deletions docs/examples/retrieving-results.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Examples for fetching evaluation results using the LayerLens Python SDK, includi

## Paginated Results

> Source: [`examples/paginated_results.py`](../../examples/paginated_results.py)
> Source: [`samples/core/paginated_results.py`](../../samples/core/paginated_results.py)

Walk through results page by page with full control over page size.

Expand Down Expand Up @@ -83,7 +83,7 @@ if __name__ == "__main__":

## All Results Without Pagination

> Source: [`examples/all_results_no_pagination.py`](../../examples/all_results_no_pagination.py)
> Source: [`samples/core/paginated_results.py`](../../samples/core/paginated_results.py)

Use `get_all()` to fetch every result in a single call. Simpler but loads everything into memory.

Expand Down Expand Up @@ -122,7 +122,7 @@ if __name__ == "__main__":

## Fetch Results for Multiple Evaluations Concurrently

> Source: [`examples/fetch_results_async.py`](../../examples/fetch_results_async.py)
> Source: [`samples/core/async_results.py`](../../samples/core/async_results.py)

Use `asyncio.gather` to load results for several evaluations in parallel.

Expand Down
Loading
Loading