Skip to content

Conversation

@olesho
Copy link
Contributor

@olesho olesho commented Oct 22, 2025

No description provided.

olesho and others added 22 commits October 15, 2025 15:28
- Add VisionJudge class to evals/lib/judge.py for visual evaluation with screenshots
- Add /page/content endpoint to api-server.js for retrieving page HTML/text
- Add /page/screenshot endpoint to api-server.js for capturing screenshots
- Fixes ImportError: cannot import name 'VisionJudge'

These were lost during cleanup but are required for the eval runner to work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Created separate example config files for different model providers:
- config.example.openai.yml: OpenAI models (now default)
- config.example.openrouter-gemini.yml: OpenRouter with Gemini
- config.example.openrouter-gpt.yml: OpenRouter with GPT

Main config.yml now defaults to OpenAI provider for reliability.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…nt into feat/manage-tabs

# Conflicts:
#	evals/config.yml
@olesho olesho requested a review from tysonthomas9 October 22, 2025 02:29
@tysonthomas9 tysonthomas9 requested a review from Copilot October 22, 2025 05:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive tab management system for the browser automation framework, transitioning from an eval-server architecture to a browser-agent-server approach. The changes remove the eval-server components and introduce new browser-agent-server functionality with improved tab handling, initialization scripts, and configuration management.

Key changes:

  • Removed eval-server Python and NodeJS implementations
  • Introduced browser-agent-server configuration for both standard and Cloud Run deployments
  • Added container initialization scripts for cleanup of lock files
  • Implemented comprehensive evaluation framework with LLM-as-a-judge capabilities
  • Updated Chromium configuration to remove auto-devtools flag

Reviewed Changes

Copilot reviewed 112 out of 270 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
supervisor/services/browser-agent-server.conf Added new supervisor configuration for browser-agent-server service
supervisor/services-cloudrun/browser-agent-server.conf Updated Cloud Run supervisor config to use browser-agent-server instead of eval-server
supervisor/services/chromium.conf Removed CHROMIUM_FLAGS environment variable for auto-opening devtools
scripts/init-container.sh New initialization script for container startup that cleans lock files
scripts/test-browser-agent-server.sh New test script for browser-agent-server Docker build
evals/run.py New universal evaluation runner with screenshot capture and vision judge support
evals/lib/*.py New evaluation framework library modules (config, judge, api client, eval loader)
evals/config*.yml New configuration files for evaluation framework with model settings
evals/data/**/*.yaml Updated evaluation definitions with corrected field names and structure
Files not reviewed (2)
  • browser-agent-server/nodejs/package-lock.json: Language not supported
  • eval-server/nodejs/package-lock.json: Language not supported

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

docker run --rm browser-agent-server-test ls -la /browser-agent-server/node_modules | head -5

echo ""
echo "✅ All checks passed! Eval-server build is working."
Copy link

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message refers to 'Eval-server' but should refer to 'browser-agent-server' to match the actual component being tested.

Suggested change
echo "✅ All checks passed! Eval-server build is working."
echo "✅ All checks passed! browser-agent-server build is working."

Copilot uses AI. Check for mistakes.
@tysonthomas9
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1 to 6
#!/usr/bin/env node

// Custom eval-server startup script for Cloud Run
// Custom browser-agent-server startup script for Cloud Run
// Uses environment variables for port configuration

import { EvalServer } from './src/lib/EvalServer.js';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update Cloud Run start script to use BrowserAgentServer

The startup script still imports ./src/lib/EvalServer.js and constructs EvalServer, but that module was removed in this commit when the implementation was renamed to BrowserAgentServer.js. Executing this script will throw ERR_MODULE_NOT_FOUND before the server even starts, so Cloud Run deployments will fail. Replace the import (and class name) with BrowserAgentServer to match the new file.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be tackled now.

@olesho olesho merged commit 355b80a into main Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants