Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/test-drift.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Drift Tests
on:
schedule:
- cron: "0 6 * * 1" # Weekly Monday 6am UTC
workflow_dispatch: # Manual trigger
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm test:drift
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# @copilotkit/llmock

## 1.3.2

### Patch Changes

- Fix missing `refusal` field on OpenAI Chat Completions responses — both the SDK and real API return `refusal: null` on non-refusal messages, but llmock was omitting it
- Live API drift detection test suite: three-layer triangulation between SDK types, real API responses, and llmock output across OpenAI (Chat + Responses), Anthropic Claude, and Google Gemini
- Weekly CI workflow for automated drift checks
- `DRIFT.md` documentation for the drift detection system

## 1.3.1

### Patch Changes
Expand Down
118 changes: 118 additions & 0 deletions DRIFT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Live API Drift Detection

llmock produces responses shaped like real LLM APIs. Providers change their APIs over time. **Drift** means the mock no longer matches reality — your tests pass against llmock but break against the real API.

## Three-Layer Approach

Drift detection compares three independent sources to triangulate the cause of any mismatch:

| SDK types = Real API? | Real API = llmock? | Diagnosis |
| --------------------- | ------------------ | -------------------------------------------------------------------- |
| Yes | No | **llmock drift** — response builders need updating |
| No | No | **Provider changed before SDK update** — flag, wait for SDK catch-up |
| Yes | Yes | **No drift** — all clear |
| No | Yes | **SDK drift** — provider deprecated something SDK still references |

Two-way comparison (mock vs real) can't distinguish between "we need to fix llmock" and "the SDK hasn't caught up yet." Three-way comparison can.

## Running Drift Tests

```bash
# All providers (requires all three API keys)
OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-... GOOGLE_API_KEY=... pnpm test:drift

# Single provider (others skip automatically)
OPENAI_API_KEY=sk-... pnpm test:drift

# Strict mode — warnings also fail
STRICT_DRIFT=1 OPENAI_API_KEY=sk-... pnpm test:drift
```

Required environment variables:

- `OPENAI_API_KEY` — OpenAI API key
- `ANTHROPIC_API_KEY` — Anthropic API key
- `GOOGLE_API_KEY` — Google AI API key

Each provider's tests skip independently if its key is not set. You can run drift tests for just one provider.

## Reading Results

### Severity levels

- **critical** — Test fails. llmock produces a different shape than the real API for a field that both the SDK and real API agree on. This means llmock needs an update.
- **warning** — Test passes (unless `STRICT_DRIFT=1`). The real API has a field that neither the SDK nor llmock knows about, or the SDK and real API disagree. Usually means a provider added something new.
- **info** — Always passes. Known intentional differences (usage fields are always zero, optional fields llmock omits, etc.).

### Example report output

```
API DRIFT DETECTED: OpenAI Chat Completions (non-streaming text)

1. [critical] LLMOCK DRIFT — field in SDK + real API but missing from mock
Path: usage.completion_tokens_details
SDK: object { reasoning_tokens: number }
Real: object { reasoning_tokens: number, accepted_prediction_tokens: number }
Mock: <absent>

2. [warning] PROVIDER ADDED FIELD — in real API but not in SDK or mock
Path: system_fingerprint
SDK: <absent>
Real: string
Mock: <absent>

3. [info] MOCK EXTRA FIELD — in mock but not in real API
Path: choices[0].logprobs
SDK: null | object
Real: <absent>
Mock: null
```

## Fixing Detected Drift

When a `critical` drift is detected:

1. **Identify the response builder** — the report path tells you which provider and field:
- OpenAI Chat Completions → `src/helpers.ts` (`buildTextCompletion`, `buildToolCallCompletion`, `buildTextChunks`, `buildToolCallChunks`)
- OpenAI Responses API → `src/responses.ts` (`buildTextResponse`, `buildToolCallResponse`, `buildTextStreamEvents`, `buildToolCallStreamEvents`)
- Anthropic Claude → `src/messages.ts` (`buildClaudeTextResponse`, `buildClaudeToolCallResponse`, `buildClaudeTextStreamEvents`, `buildClaudeToolCallStreamEvents`)
- Google Gemini → `src/gemini.ts` (`buildGeminiTextResponse`, `buildGeminiToolCallResponse`, `buildGeminiTextStreamChunks`, `buildGeminiToolCallStreamChunks`)

2. **Update the builder** — add or modify the field to match the real API shape.

3. **Run conformance tests** — `pnpm test` to verify existing API conformance tests still pass.

4. **Run drift tests** — `pnpm test:drift` to verify the drift is resolved.

## Model Deprecation

The `models.drift.ts` test scrapes model names referenced in llmock's test files, README, and fixtures, then checks each provider's model listing API to verify they still exist.

When a model is deprecated:

1. Update the model name in the affected test files and fixtures
2. Update `src/__tests__/drift/providers.ts` if the cheap test model changed
3. Run `pnpm test` and `pnpm test:drift`

## Adding a New Provider

1. Add the provider's SDK as a devDependency in `package.json`
2. Add shape extraction functions to `src/__tests__/drift/sdk-shapes.ts`
3. Add raw fetch client functions to `src/__tests__/drift/providers.ts`
4. Create `src/__tests__/drift/<provider>.drift.ts` with 4 test scenarios
5. Add model listing function to `providers.ts` and model check to `models.drift.ts`
6. Update the allowlist in `schema.ts` if needed

## CI Schedule

Drift tests run on a schedule:

- **Weekly**: Monday 6:00 AM UTC
- **Manual**: Trigger via GitHub Actions UI (`workflow_dispatch`)
- **NOT** on PR or push — these tests hit real APIs and cost money

See `.github/workflows/test-drift.yml`.

## Cost

~20 API calls per run using the cheapest available models (`gpt-4o-mini`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`) with 10-100 max tokens each. Under $0.01/week.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -673,7 +673,7 @@ Areas where llmock could grow, and explicit non-goals for the current scope.

### Testing

- **Live API conformance**: The `api-conformance` tests validate response format structure but do not run against real LLM APIs. A subset of tests that hit actual OpenAI/Anthropic/Gemini endpoints (gated behind API keys) would catch format drift as providers evolve their APIs.
- **Live API drift detection**: The `drift` test suite runs against real OpenAI, Anthropic, and Gemini APIs to catch response format drift. See [DRIFT.md](DRIFT.md) for details on the three-layer triangulation approach, how to run tests, and how to fix detected drift. Runs weekly in CI; requires API keys.
- **Token counts**: Usage fields are always zero across all providers.
- **Vision/image content**: Image content parts are not handled by any provider.

Expand Down
6 changes: 5 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@copilotkit/llmock",
"version": "1.3.1",
"version": "1.3.2",
"description": "Deterministic mock LLM server for testing (OpenAI, Anthropic, Gemini)",
"license": "MIT",
"packageManager": "pnpm@10.28.2",
Expand Down Expand Up @@ -36,6 +36,7 @@
"scripts": {
"build": "tsdown",
"test": "vitest run",
"test:drift": "vitest run --config vitest.config.drift.ts",
"test:exports": "publint && attw --pack .",
"lint": "eslint .",
"format:check": "prettier --check .",
Expand All @@ -60,6 +61,9 @@
"tsdown": "^0.12.5",
"typescript": "^5.8.3",
"typescript-eslint": "^8.35.1",
"@anthropic-ai/sdk": "^0.78.0",
"@google/generative-ai": "^0.24.0",
"openai": "^4.0.0",
"vitest": "^3.2.1"
}
}
Loading
Loading