Skip to content

chore: update model references to latest + add check-models skill & CI#64

Closed
jimbobbennett wants to merge 8 commits into
mainfrom
chore/check-models
Closed

chore: update model references to latest + add check-models skill & CI#64
jimbobbennett wants to merge 8 commits into
mainfrom
chore/check-models

Conversation

@jimbobbennett

Copy link
Copy Markdown
Contributor

What

Adds the check-models skill and CI gate (matching Arize-ai/docs#616), and runs it across the tutorials to bring every OpenAI/Anthropic model reference up to date.

New tooling

  • .agents/skills/check-models/ — finds outdated model references and migrates them to the latest size-equivalent model (a *-mini → latest mini, never the flagship), plus the code changes each new generation needs (e.g. max_tokensmax_completion_tokens for GPT-5 raw SDK calls). scripts/models.json is the single source of truth; scan-models.mjs --refresh prints the lookup checklist when the list looks stale.
  • .github/workflows/check-models.yml — PR gate that scans only the changed lines under python/ and typescript/, comments findings, and fails on newly-introduced outdated model IDs.

Content migration

Migrated outdated model IDs across python/ and typescript/ (mostly notebooks):

  • OpenAI → gpt-5.5 / gpt-5.4-mini / gpt-5.4-nano (size-tier preserved)
  • Anthropic → claude-opus-4-8 / claude-sonnet-4-6 / claude-haiku-4-5
  • GPT-5 param changes applied to raw OpenAI SDK calls only — wrapper libraries (phoenix.evals.OpenAIModel, langchain_openai.ChatOpenAI, litellm) keep their own max_tokens/temperature kwargs.

Every notebook source change is model-related (verified); notebooks re-serialized to nbformat indent=1 to keep diffs minimal. Comparative/historical prose and base64 image data are left untouched.

Verify

node .agents/skills/check-models/scripts/scan-models.mjs python typescript → 0 errors.

🤖 Generated with Claude Code

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jimbobbennett jimbobbennett force-pushed the chore/check-models branch 6 times, most recently from 1b569b3 to 59a23f8 Compare June 18, 2026 01:17
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Model version check

⚠ 6 item(s) to review (not blocking)

Prose mentions, specialised variants (*-codex, *-chat-latest), or GPT-5/o-series code changes (max_tokensmax_completion_tokens, drop temperature).

Location Found Suggested Why
python/llm/agents/couchbase_langgraph_agentic_rag.ipynb:426 temperature GPT-5/o-series: remove temperature (unsupported on reasoning models)
python/llm/agents/couchbase_langgraph_agentic_rag.ipynb:478 temperature GPT-5/o-series: remove temperature (unsupported on reasoning models)
python/llm/agents/couchbase_langgraph_agentic_rag.ipynb:513 temperature GPT-5/o-series: remove temperature (unsupported on reasoning models)
python/llm/agents/couchbase_langgraph_agentic_rag.ipynb:539 temperature GPT-5/o-series: remove temperature (unsupported on reasoning models)
python/llm/agents/couchbase_langgraph_agentic_rag.ipynb:740 max_tokens GPT-5/o-series: rename max_tokens → max_completion_tokens
python/llm/tracing/crewai/crewai-tracing.ipynb:182 temperature GPT-5/o-series: remove temperature (unsupported on reasoning models)

See the check-models skill. Policy date: 2026-06-18. Add check-models:ignore to a line to skip it.

ℹ️ Platform-wrapped IDs (Bedrock [region.]anthropic.claude-…, Databricks databricks-claude-…, OpenRouter/LiteLLM provider/model) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a us./eu./apac. inference-profile prefix). See the skill's Platform-specific IDs section.

jimbobbennett added a commit that referenced this pull request Jun 18, 2026
Align all six cookbooks with the size-matched models policy:
- Eval judges / determinism-critical code -> the non-reasoning gpt-4.1 tier, keeping
  temperature=0 (reasoning gpt-5 models reject temperature):
  - jailbreak + realtime-guardrails: assistant AND judges -> gpt-4.1-mini, temperature=0
    (these are measurement notebooks — ASR / coverage / FP must be deterministic).
  - trace-level judges -> gpt-4.1-mini (size-matched to the original gpt-4o-mini judge).
  - session-level judges -> gpt-4.1 (flagship, size-matched to the original Claude
    Sonnet judge — the mini missed the goal-not-met case).
  - custom-evaluator vision judge -> gpt-4.1, temperature=0 (size-matched to original).
- App/agent code stays on reasoning gpt-5.4-mini (no temperature).
- eval-harness comparison gpt-5.4 (off-allowlist) -> flagship gpt-5.5.
- Soften the jailbreak "ASR to zero" claim to "down sharply" (the labeled suite is small).

Re-validated live: session demo now catches goal-not-met; guardrails layered =
100% coverage / 0% FP on a clean project; gpt-5.5 + gpt-4.1 vision judge confirmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the check-models skill (.agents/skills/check-models) and a PR gate
(.github/workflows/check-models.yml, scanning python/ and typescript/),
matching the docs repo. The skill migrates outdated OpenAI/Anthropic
model references to the latest size-equivalent model and applies the
code changes each new generation needs (e.g. max_tokens ->
max_completion_tokens for GPT-5 raw SDK calls).

Run the skill across the tutorials: migrate outdated model IDs in
python/ and typescript/ notebooks and code to GPT-5.5 / GPT-5.4-mini|nano
and Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5, preserving size tiers.
Wrapper-library params (phoenix OpenAIModel, ChatOpenAI) are left intact;
only raw OpenAI SDK calls get the GPT-5 param changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jimbobbennett and others added 6 commits June 18, 2026 13:03
Revert all notebook/code/markdown model-reference edits back to main —
the migration will be re-done deliberately after review. Keep the
check-models skill and CI gate, updated to the latest version from the
docs repo: gpt-5.4 treated as a valid flagship, specialised (non-chat)
model classification, image alt-text skipping, the decoupled staleness
ceiling, and the platform-specific-ID guidance. Scan paths kept as
`python typescript` for this repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
For every docs cookbook that references a tutorials notebook, mirror the
model migration applied in the docs into the notebook, preserving .ipynb
serialization:
- reasoning-tier cookbooks -> gpt-5.5 / gpt-5.4-mini (+ max_completion_tokens
  rename on raw OpenAI calls in Azure red-teaming & building-a-custom-evaluator)
- temperature-using cookbooks -> gpt-4.1 / gpt-4.1-mini (temperature kept)
- Claude -> claude-sonnet-4-6 / claude-opus-4-8 / claude-haiku-4-5
- realtime example -> gpt-realtime-1.5 (n/a: the notebook pins no model literal)

session_evals.ipynb is a temperature=0.0 phoenix OpenAIModel judge, so it
goes to gpt-4.1-mini (not gpt-5.4-mini) — a reasoning model rejects
temperature. All 28 touched notebooks scan clean and remain valid JSON.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpt-5.4-mini is a reasoning model and rejects temperature, so remove the
temperature invocation param (v1 and v2) and the raw-call temperature
kwarg. The v1/v2 optimization now varies the system prompt alone — the
actual subject of the tutorial — with identical invocation params.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
arize-phoenix-evals 3.x removed the legacy OpenAIModel / llm_classify /
HallucinationEvaluator / QAEvaluator / run_evals API the notebooks used,
so migrate all 13 affected notebooks to the current API
(LLM + create_classifier + evaluate_dataframe; FaithfulnessEvaluator /
CorrectnessEvaluator for the prebuilt cases), keeping each notebook's
current model. The new API sends no temperature, so gpt-5.5 works as an
eval model. Downstream label/explanation/score consumers adapted to the
new <name>_score dict output. All eval logic live-verified.

Also:
- openai-tracing: fix gpt-5.4-mini regression (max_tokens=20 ->
  max_completion_tokens=256; reasoning models reject max_tokens).
- building-a-custom-evaluator: vision eval kept via the OpenAI client
  (the new classifier path strips image inputs); prose GPT-5.5 ->
  GPT-5.4-mini to match code.
- Stale-text fixes: email_text_extraction headings/labels, text2sql and
  session_level comments, quickstart-evals prose, summarization variable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the Gemini policy + scanner support from the docs repo: google block
in models.json (pro=gemini-2.5-pro, flash=gemini-3.5-flash,
flash-lite=gemini-3.1-flash-lite, with current/deprecated lists),
classifyGoogle in the scanner, and the SKILL.md + --refresh docs. Scan
paths kept as `python typescript` for this repo. No notebook changes
here — the 5 deprecated Gemini refs the scanner now surfaces are part of
the pending broader tutorials migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sting

Sync the two-tier --diff gate from the docs repo (introduced vs pre-existing in
touched files). Scanner identical; SKILL.md + workflow keep `python typescript`
scan paths for this repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jimbobbennett added a commit that referenced this pull request Jun 22, 2026
…65)

* docs(tutorials): port trace-level + jailbreak cookbooks from Phoenix to Arize AX

Phase 1 of porting Nancy-Chauhan's Phoenix tutorials to Arize AX.

- trace-level-evals.ipynb: merge the Phoenix "level-up" into the existing AX fork
  (concept-first framing, tool_path/tool_io span reconstruction, three evaluators
  — relevance + decision-path + reasoning, controlled-failure demo).
- jailbreak_and_prompt_injection_defense.ipynb: new port (attack taxonomy,
  indirect injection via a poisoned KB, GUARDRAIL/CHAIN spans, ASR + block-rate).

Targets the latest arize 8.x SDK (ArizeClient().spans.export_to_df /
update_evaluations) and arize-phoenix-evals 3.x (LLM/ClassificationEvaluator);
model hoisted to a MODEL = "gpt-5.4-mini" constant. Validated end-to-end against
a live Arize AX space.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): port realtime-guardrails + eval-harness cookbooks (Phase 2)

- designing_realtime_guardrails.ipynb: new port — layered guardrail (PII redact,
  3-way injection filter, escalate-to-judge, output check), each check a GUARDRAIL
  span, with coverage / false-positive / latency read off the traces.
- email_text_extraction_experiments_arize.ipynb: merge-update into the "build your
  own eval harness" tutorial — hand-built domain dataset, structured-output task,
  two evaluators (jaro_winkler + field_accuracy) whose rankings can disagree, and a
  fair gpt-5.4-mini vs gpt-5.4 comparison.

Uses the latest arize 8.x SDK (ArizeClient().spans / .datasets / .experiments) and
arize-phoenix-evals 3.x; model = gpt-5.4-mini. Both validated end-to-end against a
live Arize AX space.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): level up session-level evaluation cookbook (Phase 3)

Merge-update of the AI-tutor session-evals notebook to the Phoenix level-up:
concept-first framing (why turn-by-turn misses emergent failures), a simulated
student so it runs top-to-bottom with no input(), a 4th evaluator (coherence
alongside goal_completion / frustration / correctness), clean transcript
reconstruction from llm.input/output_messages, and a controlled-failure demo
where each session-only failure is caught on its own dimension.

Switched to OpenAI gpt-5.4-mini (single MODEL constant), latest arize 8.x SDK
(ArizeClient().spans), and arize-phoenix-evals 3.x. Session evals are logged via
spans.update_evaluations with session_eval.* columns attached to one span per
session. Validated end-to-end against a live Arize AX space.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): migrate custom-evaluator cookbook to latest SDK (Phase 4)

Bring the existing custom-LLM-evaluator (benchmark-dataset) notebook onto current
package versions:

- Datasets/experiments: v7 ArizeExportClient/ArizeDatasetsClient/run_experiment ->
  arize 8.x ArizeClient().spans.export_to_df / .datasets.create / .experiments.run.
- The custom judge previously used the (now-removed) phoenix.evals multimodal
  ClassificationTemplate; rewrite it as a direct gpt-5.4-mini vision call, so the
  notebook no longer depends on phoenix.evals at all. Evaluator is now a plain
  function returning arize.experiments.EvaluationResult.
- Model gpt-4o-mini/gpt-4.1 -> gpt-5.4-mini (vision-capable); drop max_tokens.
- Fix leftover "explore directly in Phoenix" wording.

Dataset/experiment path validated live against Arize AX with synthetic annotations
(the in-notebook annotation step is a manual UI action).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): reconcile model choices with check-models policy (#64)

Align all six cookbooks with the size-matched models policy:
- Eval judges / determinism-critical code -> the non-reasoning gpt-4.1 tier, keeping
  temperature=0 (reasoning gpt-5 models reject temperature):
  - jailbreak + realtime-guardrails: assistant AND judges -> gpt-4.1-mini, temperature=0
    (these are measurement notebooks — ASR / coverage / FP must be deterministic).
  - trace-level judges -> gpt-4.1-mini (size-matched to the original gpt-4o-mini judge).
  - session-level judges -> gpt-4.1 (flagship, size-matched to the original Claude
    Sonnet judge — the mini missed the goal-not-met case).
  - custom-evaluator vision judge -> gpt-4.1, temperature=0 (size-matched to original).
- App/agent code stays on reasoning gpt-5.4-mini (no temperature).
- eval-harness comparison gpt-5.4 (off-allowlist) -> flagship gpt-5.5.
- Soften the jailbreak "ASR to zero" claim to "down sharply" (the labeled suite is small).

Re-validated live: session demo now catches goal-not-met; guardrails layered =
100% coverage / 0% FP on a clean project; gpt-5.5 + gpt-4.1 vision judge confirmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): drop model-choice rationale comments from cookbooks

Remove the inline comments explaining model selection (reasoning vs non-reasoning,
size-matched, deterministic judge, etc.) added during the policy reconciliation.
Model assignments are unchanged; comment-only edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): surface created dataset id in eval-harness output

Addresses review feedback on the upstream Phoenix PR (Arize-ai/phoenix#13718):
print the dataset name + id after creation so it's easy to find in the Datasets tab.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): use 'Arize AX' for product references; fix session demo prose

- Naming: refer to the product as 'Arize AX' (not bare 'Arize') in tutorial prose
  across all six cookbooks; leave the company name, SDK identifiers (ArizeClient),
  repo/URLs, and credential labels untouched.
- Session demo: the incoherent case reliably trips only the correctness judge (not
  frustration) — align the narrative to the observed behaviour.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(tutorials): normalize unicode encoding in custom-evaluator notebook

Encoding-only change (literal em-dash/emoji vs \u escapes); no content change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant