-
Notifications
You must be signed in to change notification settings - Fork 1
Feat/twilio followup #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback". |
- Install Node.js 22.x in final Docker image - Add eval-server build stage and startup script - Configure supervisor to auto-start eval-server - Add nginx proxy routes for eval-server API endpoints - Update service configs with eval-server environment variables
📝 Changes: - Modified Dockerfile.cloudrun to set AUTOMATED_MODE: true in BuildConfig.ts before building DevTools - This ensures DevTools automatically enables evaluation mode and connects to ws://localhost:8082 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes: - Added --auto-open-devtools-for-tabs flag to CHROMIUM_FLAGS - Updated cloudrun-wrapper.sh, service.yaml, and service-secrets.yaml - This ensures DevTools opens automatically when browser starts, allowing immediate connection to eval-server 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes: - Added python3 and python-is-python3 to runtime dependencies in Dockerfile.cloudrun - Required for supervisor devtools-frontend service which runs: python3 -m http.server 8001 - This will allow the custom DevTools frontend to be served properly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes: - Added --auto-open-devtools-for-tabs flag to hardcoded command in chromium.conf - This will automatically open DevTools when browser starts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
For now, I can only help with PRs you've created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive evaluation server framework in Python that provides a programmatic API for creating and managing LLM agent evaluations via WebSocket connections. The framework includes both a core Python library and an extensive Node.js reference implementation.
Key changes:
- Complete Python evaluation server library with programmatic API
- WebSocket-based client-agent communication with JSON-RPC protocol
- Stack-based evaluation distribution system (LIFO queue)
- Extensive example scripts and evaluation datasets
Reviewed Changes
Copilot reviewed 206 out of 224 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| eval-server/python/src/* | Core Python library implementing EvalServer, ClientProxy, and evaluation stack |
| eval-server/python/examples/* | Example scripts demonstrating basic, stack-based, and programmatic evaluation patterns |
| eval-server/python/evals/* | Browsecomp benchmark integration with automatic scoring and dataset loading |
| eval-server/nodejs/src/* | Reference Node.js implementation with CLI, HTTP wrapper, and judge system |
| eval-server/nodejs/evals/* | Comprehensive YAML-based evaluation definitions for web task agents |
Files not reviewed (1)
- eval-server/nodejs/package-lock.json: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| """Main function for the browsecomp evaluation server.""" | ||
| return asyncio.run(async_main()) |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The main() function is unnecessary - it only wraps asyncio.run(async_main()). Consider removing this wrapper and calling asyncio.run(async_main()) directly in the if name == 'main' block.
| "partial_results": partial_count, | ||
| "results_file": results_file if 'results_file' in locals() else None |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'partial_count' is referenced but not defined in this scope. It should likely be calculated from completed_evaluations similar to how it's done earlier in the code.
No description provided.