For rebuttal, long context eval by Ki-Seki · Pull Request #92 · SculptAI/GIMBench

Ki-Seki · 2026-04-08T12:16:36Z

No description provided.

Copilot

Pull request overview

Adds long-context CV parsing evaluation artifacts for the KDD rebuttal, including a reproducible eval script and aggregated metrics output.

Changes:

Added aggregated_results_cv.csv containing aggregated CV-parse evaluation metrics and run metadata.
Added a shell script to download the model and run gimbench.cv.cv_parse for the long-context rebuttal setting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
results/260408-kdd-rebuttal-cv-long-context/aggregated_results_cv.csv	Adds aggregated CV evaluation row with config/run metadata and accuracy metrics.
results/260408-kdd-rebuttal-cv-long-context/*eval.sh	Adds a runnable script to download the model and execute the CV parse evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T12:21:04Z

+set -x
+
+export API_KEY=XXX
+export API_BASE=https://openrouter.ai/api/v1


set -x will echo the full python command line, which includes --api_key $API_KEY; this can leak the API key into logs/CI output and shell history. Consider disabling xtrace for the sensitive section (e.g., set +x before invoking python and re-enable after), or avoid passing secrets via CLI flags (prefer env var expected by the client if supported).

Copilot · 2026-04-08T12:21:04Z

+export API_KEY=XXX
+export API_BASE=https://openrouter.ai/api/v1


Avoid hardcoding API_KEY in the script (even as XXX), since it encourages editing/committing secrets and makes the script non-reproducible across environments. Prefer reading API_KEY from the environment and failing fast with a clear message if it is unset (and similarly quote API_BASE).

Copilot · 2026-04-08T12:21:04Z

@@ -0,0 +1,17 @@
+#!/bin/bash
+


This script doesn’t enable strict error handling, so failures from hf download or the evaluation run may be missed while the script continues (potentially followed by shutdown). Consider adding set -euo pipefail (and optionally IFS=$'\n\t') near the top so the run fails fast on errors/unset variables.

Suggested change

set -euo pipefail

IFS=$'\n\t'

Copilot · 2026-04-08T12:21:05Z

+    "Sculpt-AI/2604071-rebuttal-long-context"
+)
+for model in "${local_gim_models[@]}"; do
+    python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE


Unquoted variable expansions here can cause word-splitting/globbing and make the command brittle. Quote $model, $API_KEY, and $API_BASE when passing them as arguments.

Suggested change

python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE

python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name "$model" --output_type cfg --api_key "$API_KEY" --base_url "$API_BASE"

Copilot · 2026-04-08T12:21:05Z

+    python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE
+done
+
+shutdown -h +3


Unconditionally scheduling a shutdown at the end is risky when this script is run outside the intended environment (e.g., local dev/CI), and it will run even if earlier steps partially fail unless strict mode is enabled. Consider gating this behind an explicit env flag (e.g., AUTO_SHUTDOWN=1) or removing it from the committed script.

For rebuttal, long context eval

3645f65

Copilot AI review requested due to automatic review settings April 8, 2026 12:16

Copilot started reviewing on behalf of Ki-Seki April 8, 2026 12:18 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Ki-Seki merged commit 6e301f9 into feat/eval-results Apr 9, 2026
5 of 6 checks passed

Ki-Seki deleted the rebuttal/long-context-eval branch April 9, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For rebuttal, long context eval#92

For rebuttal, long context eval#92
Ki-Seki merged 1 commit into
feat/eval-resultsfrom
rebuttal/long-context-eval

Ki-Seki commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		export API_KEY=XXX
		export API_BASE=https://openrouter.ai/api/v1

	python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE
	python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name "$model" --output_type cfg --api_key "$API_KEY" --base_url "$API_BASE"

Conversation

Ki-Seki commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants