Skip to content

Add live API drift detection + fix missing refusal field#33

Merged
jpr5 merged 3 commits intomainfrom
feat/drift-detection
Mar 15, 2026
Merged

Add live API drift detection + fix missing refusal field#33
jpr5 merged 3 commits intomainfrom
feat/drift-detection

Conversation

@jpr5
Copy link
Copy Markdown
Contributor

@jpr5 jpr5 commented Mar 15, 2026

Summary

  • Bug fix: OpenAI Chat Completions responses now include refusal: null — a field both the SDK and real API return that llmock was omitting. Conformance and unit tests updated to assert the field.
  • New feature: Three-layer drift detection test suite that triangulates between SDK types, real API responses, and llmock output to catch response shape drift across all 4 providers (OpenAI Chat, OpenAI Responses, Anthropic Claude, Google Gemini)
  • CI: Weekly GitHub Actions workflow for automated drift checks + manual trigger
  • Docs: Added concrete Gemini base URL setup instructions to README (was previously just a comment with no actionable env var)

Details

19 drift tests across 5 files:

  • 16 shape comparison tests (4 per provider × 4 scenarios: non-streaming text/tool, streaming text/tool)
  • 3 model deprecation checks (one per provider)

Key robustness features:

  • All provider functions fail fast on non-2xx responses with status code + body in the error message
  • All streaming tests assert events were actually received (no silent pass on zero events)
  • SSE parsers handle \r\n line endings and continuation lines (Gemini sends wrapped JSON)
  • Retry with exponential backoff on 429/500/502/503
  • ping and other transport-level SSE events classified as info, not critical
  • Known intentional differences (usage fields, system_fingerprint, etc.) allowlisted

The refusal bug was discovered by running the drift tests against real APIs — exactly the value prop.

See DRIFT.md for full documentation.

Test plan

  • pnpm test — 540/540 existing tests pass (including new refusal assertions)
  • pnpm test:drift with all 3 API keys — 19/19 pass
  • pnpm test:drift without keys — 19 tests skip gracefully
  • Prettier + ESLint clean
  • 4 rounds of code review (code-reviewer, silent-failure-hunter, code-simplifier, comment-analyzer, pr-test-analyzer, type-design-analyzer) — all clean

🤖 Generated with Claude Code

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 15, 2026

Open in StackBlitz

npm i https://pkg.pr.new/CopilotKit/llmock/@copilotkit/llmock@33

commit: 7a961f8

@jpr5 jpr5 force-pushed the feat/drift-detection branch 4 times, most recently from 19a1825 to 169156f Compare March 15, 2026 05:32
OpenAI now returns a `refusal` field (null for non-refusal responses)
on all Chat Completions messages. Both the SDK types and real API
include it, but llmock was omitting it — causing shape mismatches
for consumers that validate response structure.
@jpr5 jpr5 force-pushed the feat/drift-detection branch from 169156f to 7a626fc Compare March 15, 2026 05:33
jpr5 added 2 commits March 14, 2026 22:35
Three-layer triangulation between SDK types, real API responses, and
llmock output to detect response shape drift across OpenAI (Chat +
Responses), Anthropic Claude, and Google Gemini.

- schema.ts: shape extraction, three-way comparison, severity classification
- sdk-shapes.ts: expected shapes from SDK types
- providers.ts: raw fetch clients, SSE parsing, model listing
- helpers.ts: shared test fixtures and server lifecycle
- 4 provider drift test files (16 tests) + model deprecation checks (3 tests)
- vitest.config.drift.ts: separate config with 30s timeout
- Weekly CI workflow (.github/workflows/test-drift.yml)
- DRIFT.md: full documentation
@jpr5 jpr5 force-pushed the feat/drift-detection branch from 7a626fc to 7a961f8 Compare March 15, 2026 05:35
@jpr5 jpr5 merged commit e75918a into main Mar 15, 2026
9 checks passed
@jpr5 jpr5 deleted the feat/drift-detection branch March 15, 2026 05:46
jpr5 added a commit that referenced this pull request Apr 3, 2026
## Summary

- **Bug fix**: OpenAI Chat Completions responses now include `refusal:
null` — a field both the SDK and real API return that llmock was
omitting. Conformance and unit tests updated to assert the field.
- **New feature**: Three-layer drift detection test suite that
triangulates between SDK types, real API responses, and llmock output to
catch response shape drift across all 4 providers (OpenAI Chat, OpenAI
Responses, Anthropic Claude, Google Gemini)
- **CI**: Weekly GitHub Actions workflow for automated drift checks +
manual trigger

## Details

19 drift tests across 5 files:
- 16 shape comparison tests (4 per provider × 4 scenarios: non-streaming
text/tool, streaming text/tool)
- 3 model deprecation checks (one per provider)

Key robustness features:
- All provider functions fail fast on non-2xx responses with status code
+ body in the error message
- All streaming tests assert events were actually received (no silent
pass on zero events)
- SSE parsers handle `\r\n` line endings and continuation lines (Gemini
sends wrapped JSON)
- Retry with exponential backoff on 429/500/502/503
- `ping` and other transport-level SSE events classified as `info`, not
`critical`
- Known intentional differences (usage fields, system_fingerprint, etc.)
allowlisted

The refusal bug was discovered by running the drift tests against real
APIs — exactly the value prop.

See [DRIFT.md](DRIFT.md) for full documentation.

## Test plan

- [x] `pnpm test` — 540/540 existing tests pass (including new refusal
assertions)
- [x] `pnpm test:drift` with all 3 API keys — 19/19 pass
- [x] `pnpm test:drift` without keys — 19 tests skip gracefully
- [x] Prettier + ESLint clean
- [x] 4 rounds of code review (code-reviewer, silent-failure-hunter,
code-simplifier, comment-analyzer, pr-test-analyzer,
type-design-analyzer) — all clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant