chore: update model references to latest + add check-models skill & CI by jimbobbennett · Pull Request #64 · Arize-ai/tutorials

jimbobbennett · 2026-06-17T22:09:37Z

What

Adds the check-models skill and CI gate (matching Arize-ai/docs#616), and runs it across the tutorials to bring every OpenAI/Anthropic model reference up to date.

New tooling

.agents/skills/check-models/ — finds outdated model references and migrates them to the latest size-equivalent model (a *-mini → latest mini, never the flagship), plus the code changes each new generation needs (e.g. max_tokens → max_completion_tokens for GPT-5 raw SDK calls). scripts/models.json is the single source of truth; scan-models.mjs --refresh prints the lookup checklist when the list looks stale.
.github/workflows/check-models.yml — PR gate that scans only the changed lines under python/ and typescript/, comments findings, and fails on newly-introduced outdated model IDs.

Content migration

Migrated outdated model IDs across python/ and typescript/ (mostly notebooks):

OpenAI → gpt-5.5 / gpt-5.4-mini / gpt-5.4-nano (size-tier preserved)
Anthropic → claude-opus-4-8 / claude-sonnet-4-6 / claude-haiku-4-5
GPT-5 param changes applied to raw OpenAI SDK calls only — wrapper libraries (phoenix.evals.OpenAIModel, langchain_openai.ChatOpenAI, litellm) keep their own max_tokens/temperature kwargs.

Every notebook source change is model-related (verified); notebooks re-serialized to nbformat indent=1 to keep diffs minimal. Comparative/historical prose and base64 image data are left untouched.

Verify

node .agents/skills/check-models/scripts/scan-models.mjs python typescript → 0 errors.

🤖 Generated with Claude Code

review-notebook-app · 2026-06-17T22:09:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2026-06-18T01:17:47Z

🤖 Model version check

⚠ 6 item(s) to review (not blocking)

Prose mentions, specialised variants (*-codex, *-chat-latest), or GPT-5/o-series code changes (max_tokens → max_completion_tokens, drop temperature).

Location	Found	Suggested	Why
`python/llm/agents/couchbase_langgraph_agentic_rag.ipynb`:426	`temperature`	—	GPT-5/o-series: remove temperature (unsupported on reasoning models)
`python/llm/agents/couchbase_langgraph_agentic_rag.ipynb`:478	`temperature`	—	GPT-5/o-series: remove temperature (unsupported on reasoning models)
`python/llm/agents/couchbase_langgraph_agentic_rag.ipynb`:513	`temperature`	—	GPT-5/o-series: remove temperature (unsupported on reasoning models)
`python/llm/agents/couchbase_langgraph_agentic_rag.ipynb`:539	`temperature`	—	GPT-5/o-series: remove temperature (unsupported on reasoning models)
`python/llm/agents/couchbase_langgraph_agentic_rag.ipynb`:740	`max_tokens`	—	GPT-5/o-series: rename max_tokens → max_completion_tokens
`python/llm/tracing/crewai/crewai-tracing.ipynb`:182	`temperature`	—	GPT-5/o-series: remove temperature (unsupported on reasoning models)

See the check-models skill. Policy date: 2026-06-18. Add check-models:ignore to a line to skip it.

ℹ️ Platform-wrapped IDs (Bedrock [region.]anthropic.claude-…, Databricks databricks-claude-…, OpenRouter/LiteLLM provider/model) are flagged on their embedded model name — bump the version but keep the platform's ID format (e.g. Bedrock 4.x needs a us./eu./apac. inference-profile prefix). See the skill's Platform-specific IDs section.

Align all six cookbooks with the size-matched models policy: - Eval judges / determinism-critical code -> the non-reasoning gpt-4.1 tier, keeping temperature=0 (reasoning gpt-5 models reject temperature): - jailbreak + realtime-guardrails: assistant AND judges -> gpt-4.1-mini, temperature=0 (these are measurement notebooks — ASR / coverage / FP must be deterministic). - trace-level judges -> gpt-4.1-mini (size-matched to the original gpt-4o-mini judge). - session-level judges -> gpt-4.1 (flagship, size-matched to the original Claude Sonnet judge — the mini missed the goal-not-met case). - custom-evaluator vision judge -> gpt-4.1, temperature=0 (size-matched to original). - App/agent code stays on reasoning gpt-5.4-mini (no temperature). - eval-harness comparison gpt-5.4 (off-allowlist) -> flagship gpt-5.5. - Soften the jailbreak "ASR to zero" claim to "down sharply" (the labeled suite is small). Re-validated live: session demo now catches goal-not-met; guardrails layered = 100% coverage / 0% FP on a clean project; gpt-5.5 + gpt-4.1 vision judge confirmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add the check-models skill (.agents/skills/check-models) and a PR gate (.github/workflows/check-models.yml, scanning python/ and typescript/), matching the docs repo. The skill migrates outdated OpenAI/Anthropic model references to the latest size-equivalent model and applies the code changes each new generation needs (e.g. max_tokens -> max_completion_tokens for GPT-5 raw SDK calls). Run the skill across the tutorials: migrate outdated model IDs in python/ and typescript/ notebooks and code to GPT-5.5 / GPT-5.4-mini|nano and Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5, preserving size tiers. Wrapper-library params (phoenix OpenAIModel, ChatOpenAI) are left intact; only raw OpenAI SDK calls get the GPT-5 param changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert all notebook/code/markdown model-reference edits back to main — the migration will be re-done deliberately after review. Keep the check-models skill and CI gate, updated to the latest version from the docs repo: gpt-5.4 treated as a valid flagship, specialised (non-chat) model classification, image alt-text skipping, the decoupled staleness ceiling, and the platform-specific-ID guidance. Scan paths kept as `python typescript` for this repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

For every docs cookbook that references a tutorials notebook, mirror the model migration applied in the docs into the notebook, preserving .ipynb serialization: - reasoning-tier cookbooks -> gpt-5.5 / gpt-5.4-mini (+ max_completion_tokens rename on raw OpenAI calls in Azure red-teaming & building-a-custom-evaluator) - temperature-using cookbooks -> gpt-4.1 / gpt-4.1-mini (temperature kept) - Claude -> claude-sonnet-4-6 / claude-opus-4-8 / claude-haiku-4-5 - realtime example -> gpt-realtime-1.5 (n/a: the notebook pins no model literal) session_evals.ipynb is a temperature=0.0 phoenix OpenAIModel judge, so it goes to gpt-4.1-mini (not gpt-5.4-mini) — a reasoning model rejects temperature. All 28 touched notebooks scan clean and remain valid JSON. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gpt-5.4-mini is a reasoning model and rejects temperature, so remove the temperature invocation param (v1 and v2) and the raw-call temperature kwarg. The v1/v2 optimization now varies the system prompt alone — the actual subject of the tutorial — with identical invocation params. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

arize-phoenix-evals 3.x removed the legacy OpenAIModel / llm_classify / HallucinationEvaluator / QAEvaluator / run_evals API the notebooks used, so migrate all 13 affected notebooks to the current API (LLM + create_classifier + evaluate_dataframe; FaithfulnessEvaluator / CorrectnessEvaluator for the prebuilt cases), keeping each notebook's current model. The new API sends no temperature, so gpt-5.5 works as an eval model. Downstream label/explanation/score consumers adapted to the new <name>_score dict output. All eval logic live-verified. Also: - openai-tracing: fix gpt-5.4-mini regression (max_tokens=20 -> max_completion_tokens=256; reasoning models reject max_tokens). - building-a-custom-evaluator: vision eval kept via the OpenAI client (the new classifier path strips image inputs); prose GPT-5.5 -> GPT-5.4-mini to match code. - Stale-text fixes: email_text_extraction headings/labels, text2sql and session_level comments, quickstart-evals prose, summarization variable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port the Gemini policy + scanner support from the docs repo: google block in models.json (pro=gemini-2.5-pro, flash=gemini-3.5-flash, flash-lite=gemini-3.1-flash-lite, with current/deprecated lists), classifyGoogle in the scanner, and the SKILL.md + --refresh docs. Scan paths kept as `python typescript` for this repo. No notebook changes here — the 5 deprecated Gemini refs the scanner now surfaces are part of the pending broader tutorials migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sting Sync the two-tier --diff gate from the docs repo (introduced vs pre-existing in touched files). Scanner identical; SKILL.md + workflow keep `python typescript` scan paths for this repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…65) * docs(tutorials): port trace-level + jailbreak cookbooks from Phoenix to Arize AX Phase 1 of porting Nancy-Chauhan's Phoenix tutorials to Arize AX. - trace-level-evals.ipynb: merge the Phoenix "level-up" into the existing AX fork (concept-first framing, tool_path/tool_io span reconstruction, three evaluators — relevance + decision-path + reasoning, controlled-failure demo). - jailbreak_and_prompt_injection_defense.ipynb: new port (attack taxonomy, indirect injection via a poisoned KB, GUARDRAIL/CHAIN spans, ASR + block-rate). Targets the latest arize 8.x SDK (ArizeClient().spans.export_to_df / update_evaluations) and arize-phoenix-evals 3.x (LLM/ClassificationEvaluator); model hoisted to a MODEL = "gpt-5.4-mini" constant. Validated end-to-end against a live Arize AX space. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): port realtime-guardrails + eval-harness cookbooks (Phase 2) - designing_realtime_guardrails.ipynb: new port — layered guardrail (PII redact, 3-way injection filter, escalate-to-judge, output check), each check a GUARDRAIL span, with coverage / false-positive / latency read off the traces. - email_text_extraction_experiments_arize.ipynb: merge-update into the "build your own eval harness" tutorial — hand-built domain dataset, structured-output task, two evaluators (jaro_winkler + field_accuracy) whose rankings can disagree, and a fair gpt-5.4-mini vs gpt-5.4 comparison. Uses the latest arize 8.x SDK (ArizeClient().spans / .datasets / .experiments) and arize-phoenix-evals 3.x; model = gpt-5.4-mini. Both validated end-to-end against a live Arize AX space. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): level up session-level evaluation cookbook (Phase 3) Merge-update of the AI-tutor session-evals notebook to the Phoenix level-up: concept-first framing (why turn-by-turn misses emergent failures), a simulated student so it runs top-to-bottom with no input(), a 4th evaluator (coherence alongside goal_completion / frustration / correctness), clean transcript reconstruction from llm.input/output_messages, and a controlled-failure demo where each session-only failure is caught on its own dimension. Switched to OpenAI gpt-5.4-mini (single MODEL constant), latest arize 8.x SDK (ArizeClient().spans), and arize-phoenix-evals 3.x. Session evals are logged via spans.update_evaluations with session_eval.* columns attached to one span per session. Validated end-to-end against a live Arize AX space. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): migrate custom-evaluator cookbook to latest SDK (Phase 4) Bring the existing custom-LLM-evaluator (benchmark-dataset) notebook onto current package versions: - Datasets/experiments: v7 ArizeExportClient/ArizeDatasetsClient/run_experiment -> arize 8.x ArizeClient().spans.export_to_df / .datasets.create / .experiments.run. - The custom judge previously used the (now-removed) phoenix.evals multimodal ClassificationTemplate; rewrite it as a direct gpt-5.4-mini vision call, so the notebook no longer depends on phoenix.evals at all. Evaluator is now a plain function returning arize.experiments.EvaluationResult. - Model gpt-4o-mini/gpt-4.1 -> gpt-5.4-mini (vision-capable); drop max_tokens. - Fix leftover "explore directly in Phoenix" wording. Dataset/experiment path validated live against Arize AX with synthetic annotations (the in-notebook annotation step is a manual UI action). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): reconcile model choices with check-models policy (#64) Align all six cookbooks with the size-matched models policy: - Eval judges / determinism-critical code -> the non-reasoning gpt-4.1 tier, keeping temperature=0 (reasoning gpt-5 models reject temperature): - jailbreak + realtime-guardrails: assistant AND judges -> gpt-4.1-mini, temperature=0 (these are measurement notebooks — ASR / coverage / FP must be deterministic). - trace-level judges -> gpt-4.1-mini (size-matched to the original gpt-4o-mini judge). - session-level judges -> gpt-4.1 (flagship, size-matched to the original Claude Sonnet judge — the mini missed the goal-not-met case). - custom-evaluator vision judge -> gpt-4.1, temperature=0 (size-matched to original). - App/agent code stays on reasoning gpt-5.4-mini (no temperature). - eval-harness comparison gpt-5.4 (off-allowlist) -> flagship gpt-5.5. - Soften the jailbreak "ASR to zero" claim to "down sharply" (the labeled suite is small). Re-validated live: session demo now catches goal-not-met; guardrails layered = 100% coverage / 0% FP on a clean project; gpt-5.5 + gpt-4.1 vision judge confirmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): drop model-choice rationale comments from cookbooks Remove the inline comments explaining model selection (reasoning vs non-reasoning, size-matched, deterministic judge, etc.) added during the policy reconciliation. Model assignments are unchanged; comment-only edit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): surface created dataset id in eval-harness output Addresses review feedback on the upstream Phoenix PR (Arize-ai/phoenix#13718): print the dataset name + id after creation so it's easy to find in the Datasets tab. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): use 'Arize AX' for product references; fix session demo prose - Naming: refer to the product as 'Arize AX' (not bare 'Arize') in tutorial prose across all six cookbooks; leave the company name, SDK identifiers (ArizeClient), repo/URLs, and credential labels untouched. - Session demo: the incoherent case reliably trips only the correctness judge (not frustration) — align the narrative to the observed behaviour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(tutorials): normalize unicode encoding in custom-evaluator notebook Encoding-only change (literal em-dash/emoji vs \u escapes); no content change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimbobbennett force-pushed the chore/check-models branch 6 times, most recently from 1b569b3 to 59a23f8 Compare June 18, 2026 01:17

jimbobbennett force-pushed the chore/check-models branch from 59a23f8 to 286d970 Compare June 18, 2026 16:38

jimbobbennett and others added 6 commits June 18, 2026 13:03

Merge remote-tracking branch 'origin/main' into chore/check-models

2764303

jimbobbennett mentioned this pull request Jun 18, 2026

docs(tutorials): port six Phoenix cookbooks to Arize AX (latest SDK) #65

Merged

jimbobbennett closed this Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: update model references to latest + add check-models skill & CI#64

chore: update model references to latest + add check-models skill & CI#64
jimbobbennett wants to merge 8 commits into
mainfrom
chore/check-models

jimbobbennett commented Jun 17, 2026

Uh oh!

review-notebook-app Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jimbobbennett commented Jun 17, 2026

What

New tooling

Content migration

Verify

Uh oh!

review-notebook-app Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Model version check

⚠ 6 item(s) to review (not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 18, 2026 •

edited

Loading