Conversation
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…atibility All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+. The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]` on Python 3.10 and 3.11. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add CI workflow that runs on pull requests and main branch pushes: - Tests on Python 3.10 and 3.11 - Runs on Ubuntu and macOS - Uses uv for dependency management - Runs ruff linter and formatter - Runs pytest suite Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- cluster_id: default=0 - cluster_centroid_distance: default=0.0 - internal_similarity: default=1.0 Fixes 1/14 test failures in test_segmentation.py
Phase 1 of viewer consolidation plan: Foundation Changes: - Add openadapt-viewer as local file dependency in pyproject.toml - Create openadapt_ml/training/viewer_components.py adapter module * screenshot_with_predictions() - Screenshot with human/AI overlays * training_metrics() - Training stats metrics grid * playback_controls() - Playback UI controls * correctness_badge() - Pass/fail badge component * generate_comparison_summary() - Model comparison summary - Add tests/test_viewer_screenshots.py with component validation tests - Add openadapt_ml/training/viewer_migration_example.py validation example Design: - Zero breaking changes to existing viewer.py code - Adapter pattern wraps openadapt-viewer with ML-specific context - Functions accept openadapt-ml data structures - Can be incrementally adopted in future phases Next steps (Phase 2): - Gradually migrate viewer.py to use these adapters - Replace inline HTML generation with component calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed 158 linting errors in openadapt_ml/benchmarks/cli.py: - F541: Removed extraneous f-string prefixes (150 instances auto-fixed) - E402: Moved warnings import to top of file with other imports - F841: Removed unused variables (qemu_commands, run_name, all_ready, server_process) - E741: Renamed ambiguous variable 'l' to 'line' - F821: Added missing time import to cmd_vm function Also updated README.md with documentation about openadapt-evals integration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed across all modules: - E402: Moved imports to top (benchmarks/__init__.py) - E741: Renamed ambiguous variable 'l' to 'loss_entry' (trainer.py, lambda_labs.py) - E722: Replaced bare except with specific exceptions (lambda_labs.py) - F401: Added noqa comments for re-exported imports (ingest/) - F811: Renamed shadowing variable in config.py - F821: Added Episode import to TYPE_CHECKING block (grounding.py) - F541: Removed extraneous f-string prefixes (auto-fixed 95 instances) - F841: Removed unused variables (auto-fixed 20 instances) All modules now pass ruff check without errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Applied ruff format to ensure consistent code style across all modules. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Restored and enhanced the workflow segmentation system from commit dd9a393
with new integration for openadapt-capture format.
## What's Added
### Core Segmentation Pipeline (4 stages):
1. **Stage 1 - Frame Description (VLM)**:
- Converts screenshots + actions into semantic descriptions
- Supports Gemini, Claude, GPT-4o backends
- Automatic caching for efficiency
- File: openadapt_ml/segmentation/frame_describer.py
2. **Stage 2 - Episode Extraction (LLM)**:
- Identifies coherent workflow boundaries
- Few-shot prompting for better quality
- Confidence-based filtering
- File: openadapt_ml/segmentation/segment_extractor.py
3. **Stage 3 - Deduplication (Embeddings)**:
- Finds similar workflows across recordings
- Agglomerative clustering with cosine similarity
- Supports OpenAI or local HuggingFace embeddings
- File: openadapt_ml/segmentation/deduplicator.py
4. **Stage 4 - Annotation (VLM Quality Control)**:
- Auto-annotates episodes for training data quality
- Detects failures, boundary issues, incompleteness
- Human-in-the-loop review workflow
- File: openadapt_ml/segmentation/annotator.py
### Integration Features:
- **CaptureAdapter**: Loads recordings from openadapt-capture SQLite format
- File: openadapt_ml/segmentation/adapters/capture_adapter.py
- Automatically used when capture.db is detected
- Converts events to segmentation format
- **Unified Pipeline**: Run all stages with single API
- File: openadapt_ml/segmentation/pipeline.py
- Automatic intermediate result caching
- Resume support for interrupted runs
- **CLI Interface**: Full command-line interface for all stages
- File: openadapt_ml/segmentation/cli.py
- Commands: describe, extract, deduplicate, annotate, review, export-gold
- **Comprehensive Documentation**:
- File: openadapt_ml/segmentation/README.md
- 20+ code examples
- Complete API reference
- Integration guide
- Cost estimates and performance benchmarks
## Use Cases
1. **Training Data Curation**: Extract and filter high-quality demonstration episodes
2. **Demo Retrieval**: Build searchable libraries for demo-conditioned prompting
3. **Workflow Documentation**: Auto-generate step-by-step guides from recordings
## Data Schemas
All schemas use Pydantic for type safety (openadapt_ml/segmentation/schemas.py):
- ActionTranscript: Frame-by-frame semantic descriptions
- Episode: Coherent workflow segment with boundaries
- CanonicalEpisode: Deduplicated workflow definition
- EpisodeAnnotation: Quality assessment for training data
## Example Usage
```python
from openadapt_ml.segmentation import SegmentationPipeline, PipelineConfig
config = PipelineConfig(
vlm_model="gemini-2.0-flash",
llm_model="gpt-4o",
similarity_threshold=0.85
)
pipeline = SegmentationPipeline(config)
result = pipeline.run(
recordings=["/path/to/recording1", "/path/to/recording2"],
output_dir="workflow_library"
)
print(f"Found {result.unique_episodes} unique workflows")
```
## Next Steps
See openadapt_ml/segmentation/README.md for:
- P0: Integration tests with real openadapt-capture recordings
- P0: Visualization generator for segment boundaries
- P1: Improved prompt engineering and cost optimization
- P2: Active learning and multi-modal features
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Features added: - Azure ML job tracking: Shows recent jobs from last 7 days with status - Cost tracking: Real-time uptime, hourly rate, and cost estimation - VM activity detection: Identifies what VM is currently doing - Evaluation history: Past benchmark runs and success rates (--details flag) - Enhanced UI: Structured dashboard with clear sections and icons New utility functions in vm_monitor.py: - fetch_azure_ml_jobs(): Fetch recent Azure ML jobs with filtering - calculate_vm_costs(): Calculate VM costs with hourly/daily/weekly rates - get_vm_uptime_hours(): Get VM uptime from Azure activity logs - detect_vm_activity(): Detect current VM activity (idle, running, setup) - get_evaluation_history(): Load past evaluation runs from results dir CLI enhancements: - Added --details flag for extended information - Improved output formatting with sections and separators - Better error handling and status icons - Preserved existing SSH tunnel and dashboard functionality Documentation: - Updated CLAUDE.md with new features and usage examples - Added detailed docstrings to all new functions This consolidates VM monitoring into a single enhanced command rather than creating duplicate dashboards, following the viewer consolidation strategy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update CaptureAdapter to work with actual openadapt-capture database format. Key changes: - Use screen.frame events instead of generic event types - Pair action events (mouse.down + mouse.up → single click) - Map frame events to screenshots via timestamp matching - Update event type filtering to match openadapt-capture schema - Improve frame-to-action association logic This enables the segmentation pipeline to process real capture recordings from openadapt-capture instead of requiring simulated data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enhance vm monitor command to provide complete VM usage tracking:
- Real-time VM status (size, IP, power state)
- Activity detection (idle, benchmark running, setup)
- Cost tracking (uptime hours, hourly rate, total cost)
- Azure ML jobs list (last 7 days with status)
- Evaluation history (with --details flag)
- Mock mode for testing without VM (--mock flag)
Add new API endpoints to local.py dashboard server:
- /api/benchmark/status - current job status with ETA
- /api/benchmark/costs - cost breakdown (Azure VM, API, GPU)
- /api/benchmark/metrics - performance metrics by domain
- /api/benchmark/workers - worker status and utilization
- /api/benchmark/runs - list all benchmark runs
- /api/benchmark/tasks/{run}/{task} - task execution details
Update README with VM monitor section including screenshots and
usage examples.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive test plan and results for workflow segmentation pipeline: - Test plan with 8 stages from environment setup to documentation - Test results documenting real capture processing outcomes - Test files for CaptureAdapter and segmentation pipeline Add VM monitor screenshot generation scripts and documentation: - Scripts for automated dashboard screenshot generation - Implementation plan for VM monitor screenshot feature - Analysis of screenshot capture approaches Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Archive OpenAdapter (incomplete pre-refactor cloud deployment POC) - Document key takeaways and lessons learned - Reference modern cloud infrastructure in openadapt-ml - Add guidelines for when to archive repositories OpenAdapter was an incomplete proof-of-concept from October 2024 with only 165 lines of code and no ecosystem usage. Cloud deployment is now production-ready in openadapt_ml/cloud/ and benchmarks/azure.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add search bar to viewer controls with Ctrl+F / Cmd+F keyboard shortcut - Implement advanced token-based search across step indices, action types, and text - Search filters step list in real-time with result count display - Clear button and Escape key support for resetting search - Consistent UI styling with existing viewer components - Integrates with existing step list filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
This PR has merge conflicts that need to be resolved before it can be merged. The PR contains significant work:\n\n- 15 commits with substantial changes (20,612 additions, 3,094 deletions)\n- Multiple important features: viewer consolidation, workflow segmentation, VM monitoring, CI improvements\n- All ruff linting issues resolved\n\nNext steps:\n1. Resolve merge conflicts with main branch\n2. Ensure all tests pass after conflict resolution\n3. Request review once conflicts are resolved\n\nThis is high-priority active work and should be merged soon after conflicts are addressed. |
|
Closing in favor of PR #9 which is a cleaned and rebased version of this PR. PR #9 contains only the new features from this PR (viewer consolidation, workflow segmentation, VM monitoring) while removing commits that were already merged to main via PR #6 (unified baseline adapters, CI workflow, extensive linting fixes). This resolves all merge conflicts and provides a cleaner commit history for review. |
Summary
Implements Phase 1 (Foundation) of the viewer consolidation plan. Establishes the foundation for migrating from inline HTML generation to the reusable
openadapt-viewercomponent library.Changes
uv add openadapt-vieweropenadapt_ml/training/viewer_components.pywith ML-specific wrappers:screenshot_with_predictions()- Screenshot display with human/AI action overlaystraining_metrics()- Training statistics metrics gridplayback_controls()- Playback UI controlscorrectness_badge()- Pass/fail badge componentgenerate_comparison_summary()- Model comparison summarytests/test_viewer_screenshots.pyfor component validationopenadapt_ml/training/viewer_migration_example.pydemonstrating usageDesign Principles
viewer.pycode remains unchangedTest Plan
uv run pytest tests/test_viewer_screenshots.py -vuv run python -m openadapt_ml.training.viewer_migration_exampleNext Steps (Phase 2)
Once Phase 1 is validated, Phase 2 will:
viewer.pyfunctions to use these adaptersRelated
🤖 Generated with Claude Code