Skip to content

ci: switch integration tests to use eval proxy#1985

Merged
neubig merged 1 commit intomainfrom
use-eval-proxy-for-integration-tests
Feb 10, 2026
Merged

ci: switch integration tests to use eval proxy#1985
neubig merged 1 commit intomainfrom
use-eval-proxy-for-integration-tests

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented Feb 10, 2026

Summary

Switch integration tests from using the app proxy to the eval proxy.

Changes

  • Change LLM_BASE_URL from llm-proxy.app.all-hands.dev to llm-proxy.eval.all-hands.dev
  • Change LLM_API_KEY secret to EVAL_LLM_API_KEY

Motivation

Some models in resolve_model_config.py are only available on the eval proxy (e.g., jade-spark-2862). Using the app proxy causes integration tests to fail with "Invalid model name" errors for these models.

Required Action

The EVAL_LLM_API_KEY secret needs to be configured in the repository settings with a key that has access to the eval proxy.

@neubig can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:db248ea-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-db248ea-python \
  ghcr.io/openhands/agent-server:db248ea-python

All tags pushed for this build

ghcr.io/openhands/agent-server:db248ea-golang-amd64
ghcr.io/openhands/agent-server:db248ea-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:db248ea-golang-arm64
ghcr.io/openhands/agent-server:db248ea-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:db248ea-java-amd64
ghcr.io/openhands/agent-server:db248ea-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:db248ea-java-arm64
ghcr.io/openhands/agent-server:db248ea-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:db248ea-python-amd64
ghcr.io/openhands/agent-server:db248ea-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:db248ea-python-arm64
ghcr.io/openhands/agent-server:db248ea-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:db248ea-golang
ghcr.io/openhands/agent-server:db248ea-java
ghcr.io/openhands/agent-server:db248ea-python

About Multi-Architecture Support

  • Each variant tag (e.g., db248ea-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., db248ea-python-amd64) are also available if needed

Change LLM_BASE_URL from llm-proxy.app.all-hands.dev to
llm-proxy.eval.all-hands.dev and use LLM_API_KEY_EVAL secret.

This enables testing models that are only available on the eval proxy.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the use-eval-proxy-for-integration-tests branch from 2faedde to f490f2d Compare February 10, 2026 15:12
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.00
Models Tested: 1
Timestamp: 2026-02-10 15:15:19 UTC

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_jade_spark_2862 100.0% 7/7 1 8 $0.00 194,152

📋 Detailed Results

litellm_proxy_jade_spark_2862

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.00
  • Token Usage: prompt: 189,197, completion: 4,955, cache_read: 156,294, reasoning: 1,717
  • Run Suffix: litellm_proxy_jade_spark_2862_f490f2d_jade_spark_2862_run_N8_20260210_151338
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@neubig neubig marked this pull request as ready for review February 10, 2026 15:16
@neubig neubig merged commit cfe52af into main Feb 10, 2026
26 of 28 checks passed
@neubig neubig deleted the use-eval-proxy-for-integration-tests branch February 10, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants