Skip to content

Conversation

@olesho
Copy link
Contributor

@olesho olesho commented Sep 10, 2025

No description provided.

@olesho olesho requested a review from tysonthomas9 September 10, 2025 16:00
@tysonthomas9
Copy link
Contributor

@codex review

@tysonthomas9 tysonthomas9 requested review from Copilot and tysonthomas9 and removed request for Copilot and tysonthomas9 September 11, 2025 18:34
@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Breezy!

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

- Install Node.js 22.x in final Docker image
- Add eval-server build stage and startup script
- Configure supervisor to auto-start eval-server
- Add nginx proxy routes for eval-server API endpoints
- Update service configs with eval-server environment variables
olesho and others added 4 commits October 5, 2025 14:54
📝 Changes:
- Modified Dockerfile.cloudrun to set AUTOMATED_MODE: true in BuildConfig.ts before building DevTools
- This ensures DevTools automatically enables evaluation mode and connects to ws://localhost:8082

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes:
- Added --auto-open-devtools-for-tabs flag to CHROMIUM_FLAGS
- Updated cloudrun-wrapper.sh, service.yaml, and service-secrets.yaml
- This ensures DevTools opens automatically when browser starts, allowing immediate connection to eval-server

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes:
- Added python3 and python-is-python3 to runtime dependencies in Dockerfile.cloudrun
- Required for supervisor devtools-frontend service which runs: python3 -m http.server 8001
- This will allow the custom DevTools frontend to be served properly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
📝 Changes:
- Added --auto-open-devtools-for-tabs flag to hardcoded command in chromium.conf
- This will automatically open DevTools when browser starts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@tysonthomas9 tysonthomas9 requested review from Copilot and removed request for tysonthomas9 October 9, 2025 11:12
@tysonthomas9
Copy link
Contributor

@codex

@chatgpt-codex-connector
Copy link

For now, I can only help with PRs you've created.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive evaluation server framework in Python that provides a programmatic API for creating and managing LLM agent evaluations via WebSocket connections. The framework includes both a core Python library and an extensive Node.js reference implementation.

Key changes:

  • Complete Python evaluation server library with programmatic API
  • WebSocket-based client-agent communication with JSON-RPC protocol
  • Stack-based evaluation distribution system (LIFO queue)
  • Extensive example scripts and evaluation datasets

Reviewed Changes

Copilot reviewed 206 out of 224 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
eval-server/python/src/* Core Python library implementing EvalServer, ClientProxy, and evaluation stack
eval-server/python/examples/* Example scripts demonstrating basic, stack-based, and programmatic evaluation patterns
eval-server/python/evals/* Browsecomp benchmark integration with automatic scoring and dataset loading
eval-server/nodejs/src/* Reference Node.js implementation with CLI, HTTP wrapper, and judge system
eval-server/nodejs/evals/* Comprehensive YAML-based evaluation definitions for web task agents
Files not reviewed (1)
  • eval-server/nodejs/package-lock.json: Language not supported

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +280 to +281
"""Main function for the browsecomp evaluation server."""
return asyncio.run(async_main())
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The main() function is unnecessary - it only wraps asyncio.run(async_main()). Consider removing this wrapper and calling asyncio.run(async_main()) directly in the if name == 'main' block.

Copilot uses AI. Check for mistakes.
Comment on lines +805 to +806
"partial_results": partial_count,
"results_file": results_file if 'results_file' in locals() else None
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'partial_count' is referenced but not defined in this scope. It should likely be calculated from completed_evaluations similar to how it's done earlier in the code.

Copilot uses AI. Check for mistakes.
@olesho olesho merged commit b861033 into main Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants