feat: migrate annotation pipeline from openadapt-ml to openadapt-evals#64
Merged
Conversation
Move annotation data classes, prompts, and utilities into openadapt_evals.annotation and consolidate three separate VLM call implementations into a shared openadapt_evals.vlm module. - New openadapt_evals/vlm.py: unified vlm_call() supporting consilium council, OpenAI, and Anthropic; extract_json() for LLM output parsing; image_bytes_from_path() helper - New openadapt_evals/annotation.py: AnnotatedStep/AnnotatedDemo data classes, ANNOTATION_SYSTEM_PROMPT/ANNOTATION_STEP_PROMPT constants, parse_annotation_response(), validate_annotations(), format_annotated_demo() - Updated scripts/record_waa_demos.py cmd_annotate_waa() to import from openadapt_evals instead of openadapt_ml - Updated scripts/refine_demo.py to use shared vlm_call/extract_json, refactored message builders to prompt+images interface - Updated scripts/convert_recording_to_demo.py to use shared vlm_call - 16 new tests in tests/test_annotation.py, all existing tests pass Closes #59 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ding_to_demo - Remove unused `import os` from openadapt_evals/vlm.py - Move `resolved_model` computation before the for-loop in convert_vlm() so it's computed once instead of redundantly inside each step's try block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- vlm.py: add timeout=120s to OpenAI/Anthropic SDK clients to prevent indefinite hangs (old code had explicit timeouts via requests) - vlm.py: pass system prompt separately to consilium council_query() instead of concatenating into user prompt - refine_demo.py: explicitly pass temperature=1.0 to vlm_call() in holistic and per-step review to match old behavior (vlm_call defaults to 0.1 which would be an unintended behavioral change) - refine_demo.py: remove dead api_key parameter from run_holistic_review, run_per_step_review, refine_recording, and main() — vlm_call() reads API keys from environment via the SDK Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openadapt_evals/vlm.py: Shared VLM call module consolidating 3 separate implementations into onevlm_call()function supporting consilium council, OpenAI, and Anthropic providers. Also includesextract_json()for robust LLM output parsing andimage_bytes_from_path()helper.openadapt_evals/annotation.py: Migrated annotation data classes (AnnotatedStep,AnnotatedDemo), prompts (ANNOTATION_SYSTEM_PROMPT,ANNOTATION_STEP_PROMPT), and utilities (parse_annotation_response,validate_annotations,format_annotated_demo) fromopenadapt_ml.experiments.demo_prompt.annotate.scripts/record_waa_demos.py:cmd_annotate_waa()now imports fromopenadapt_evalsinstead ofopenadapt_ml, removing PIL and provider abstraction dependencies.scripts/refine_demo.py: Replaced local_vlm_call(),_extract_json(),_encode_png_b64(),_image_content_block()with shared module imports. Refactored message builders to return(prompt, images)tuples.scripts/convert_recording_to_demo.py: Replaced local_vlm_call(),_vlm_call_openai(),_vlm_call_anthropic(),_encode_image()with shared module imports.tests/test_annotation.pycovering data classes, JSON roundtrip, parsing, formatting, and validation.Closes #59
Test plan
uv run pytest tests/test_annotation.py -v— 16 new tests passuv run pytest tests/test_vlm_call.py -v— 10 existing tests passuv run pytest tests/ -v— 569/576 pass (7 pre-existing failures unrelated to this change)from openadapt_evals.annotation import AnnotatedDemoworksfrom openadapt_evals.vlm import vlm_callworksuv run python scripts/convert_recording_to_demo.py --recordings waa_recordings --output /tmp/test_demos --mode text(text mode, no VLM needed)🤖 Generated with Claude Code