Feat/twilio followup #5

olesho · 2025-09-10T16:00:25Z

No description provided.

tysonthomas9 · 2025-09-11T18:34:40Z

chatgpt-codex-connector · 2025-09-11T18:37:53Z

Codex Review: Didn't find any major issues. Breezy!

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

- Install Node.js 22.x in final Docker image - Add eval-server build stage and startup script - Configure supervisor to auto-start eval-server - Add nginx proxy routes for eval-server API endpoints - Update service configs with eval-server environment variables

📝 Changes: - Modified Dockerfile.cloudrun to set AUTOMATED_MODE: true in BuildConfig.ts before building DevTools - This ensures DevTools automatically enables evaluation mode and connects to ws://localhost:8082 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

📝 Changes: - Added --auto-open-devtools-for-tabs flag to CHROMIUM_FLAGS - Updated cloudrun-wrapper.sh, service.yaml, and service-secrets.yaml - This ensures DevTools opens automatically when browser starts, allowing immediate connection to eval-server 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

📝 Changes: - Added python3 and python-is-python3 to runtime dependencies in Dockerfile.cloudrun - Required for supervisor devtools-frontend service which runs: python3 -m http.server 8001 - This will allow the custom DevTools frontend to be served properly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

📝 Changes: - Added --auto-open-devtools-for-tabs flag to hardcoded command in chromium.conf - This will automatically open DevTools when browser starts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

tysonthomas9 · 2025-10-09T11:12:30Z

@codex

chatgpt-codex-connector · 2025-10-09T11:12:43Z

For now, I can only help with PRs you've created.

Copilot

Pull Request Overview

This PR introduces a comprehensive evaluation server framework in Python that provides a programmatic API for creating and managing LLM agent evaluations via WebSocket connections. The framework includes both a core Python library and an extensive Node.js reference implementation.

Key changes:

Complete Python evaluation server library with programmatic API
WebSocket-based client-agent communication with JSON-RPC protocol
Stack-based evaluation distribution system (LIFO queue)
Extensive example scripts and evaluation datasets

Reviewed Changes

Copilot reviewed 206 out of 224 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
eval-server/python/src/*	Core Python library implementing EvalServer, ClientProxy, and evaluation stack
eval-server/python/examples/*	Example scripts demonstrating basic, stack-based, and programmatic evaluation patterns
eval-server/python/evals/*	Browsecomp benchmark integration with automatic scoring and dataset loading
eval-server/nodejs/src/*	Reference Node.js implementation with CLI, HTTP wrapper, and judge system
eval-server/nodejs/evals/*	Comprehensive YAML-based evaluation definitions for web task agents

Files not reviewed (1)

eval-server/nodejs/package-lock.json: Language not supported

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-09T11:13:25Z

eval-server/python/evals/browsecomp_eval_server.py

+    """Main function for the browsecomp evaluation server."""
+    return asyncio.run(async_main())


[nitpick] The main() function is unnecessary - it only wraps asyncio.run(async_main()). Consider removing this wrapper and calling asyncio.run(async_main()) directly in the if name == 'main' block.

Copilot · 2025-10-09T11:13:26Z

eval-server/python/evals/browsecomp_eval_server.py

+                    "partial_results": partial_count,
+                    "results_file": results_file if 'results_file' in locals() else None


The variable 'partial_count' is referenced but not defined in this scope. It should likely be calculated from completed_evaluations similar to how it's done earlier in the code.

eval-server/nodejs/evals/web-task-agent/web-task-agent-iframe-001.yaml

eval-server/nodejs/package.json

olesho and others added 21 commits August 27, 2025 09:10

Added local docker deployment

a37ec31

Neko + Devtools run but separately

a21844b

Local setup running with devtools

50a4a21

Custom Devtools working on CloudRun

0ad4e72

Local setup

95b40da

Using Twilio + CloudRun to deploy devtools frontend with Chromium

2552285

Merge branch 'main' into feat/twilio-devtools

b6b6be1

Removed obsolete cloudrun script

432a863

Fix Makefile

5998959

Fix the build

f908215

Fixed conlict in the run-local.sh

6ff623d

Remove test script

f7a49cb

Remove test file

457f882

Remove test files

fbad484

Remove hardcoded credentials

202b50b

Fixed cloubuild.yaml

40b5590

Switching to TCP instead UDP for WebRTC

5f9e3a0

fix the deployment issue

0a3ab42

Fixed hardcoded project name; removed google as a start page

ca565e4

Modified example env file

7c6253d

A follow up changes to improve CloudRun deployment

c541c53

olesho requested a review from tysonthomas9 September 10, 2025 16:00

Merge branch 'main' into feat/twilio-followup

6de5332

tysonthomas9 requested review from Copilot and tysonthomas9 and removed request for Copilot and tysonthomas9 September 11, 2025 18:34

olesho and others added 4 commits October 5, 2025 14:54

tysonthomas9 requested review from Copilot and removed request for tysonthomas9 October 9, 2025 11:12

Copilot AI reviewed Oct 9, 2025

View reviewed changes

tysonthomas9 approved these changes Oct 10, 2025

View reviewed changes

olesho merged commit b861033 into main Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/twilio followup #5

Feat/twilio followup #5

Uh oh!

olesho commented Sep 10, 2025

Uh oh!

tysonthomas9 commented Sep 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Sep 11, 2025

Uh oh!

tysonthomas9 commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"""Main function for the browsecomp evaluation server."""
		return asyncio.run(async_main())

		"partial_results": partial_count,
		"results_file": results_file if 'results_file' in locals() else None

Feat/twilio followup #5

Feat/twilio followup #5

Uh oh!

Conversation

olesho commented Sep 10, 2025

Uh oh!

tysonthomas9 commented Sep 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Sep 11, 2025

Uh oh!

tysonthomas9 commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants