Skip to content

For rebuttal, long context eval#92

Merged
Ki-Seki merged 1 commit into
feat/eval-resultsfrom
rebuttal/long-context-eval
Apr 9, 2026
Merged

For rebuttal, long context eval#92
Ki-Seki merged 1 commit into
feat/eval-resultsfrom
rebuttal/long-context-eval

Conversation

@Ki-Seki
Copy link
Copy Markdown
Member

@Ki-Seki Ki-Seki commented Apr 8, 2026

No description provided.

Copilot AI review requested due to automatic review settings April 8, 2026 12:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds long-context CV parsing evaluation artifacts for the KDD rebuttal, including a reproducible eval script and aggregated metrics output.

Changes:

  • Added aggregated_results_cv.csv containing aggregated CV-parse evaluation metrics and run metadata.
  • Added a shell script to download the model and run gimbench.cv.cv_parse for the long-context rebuttal setting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
results/260408-kdd-rebuttal-cv-long-context/aggregated_results_cv.csv Adds aggregated CV evaluation row with config/run metadata and accuracy metrics.
results/260408-kdd-rebuttal-cv-long-context/*eval.sh Adds a runnable script to download the model and execute the CV parse evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3 to +6
set -x

export API_KEY=XXX
export API_BASE=https://openrouter.ai/api/v1
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -x will echo the full python command line, which includes --api_key $API_KEY; this can leak the API key into logs/CI output and shell history. Consider disabling xtrace for the sensitive section (e.g., set +x before invoking python and re-enable after), or avoid passing secrets via CLI flags (prefer env var expected by the client if supported).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +6
export API_KEY=XXX
export API_BASE=https://openrouter.ai/api/v1
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid hardcoding API_KEY in the script (even as XXX), since it encourages editing/committing secrets and makes the script non-reproducible across environments. Prefer reading API_KEY from the environment and failing fast with a clear message if it is unset (and similarly quote API_BASE).

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,17 @@
#!/bin/bash

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script doesn’t enable strict error handling, so failures from hf download or the evaluation run may be missed while the script continues (potentially followed by shutdown). Consider adding set -euo pipefail (and optionally IFS=$'\n\t') near the top so the run fails fast on errors/unset variables.

Suggested change
set -euo pipefail
IFS=$'\n\t'

Copilot uses AI. Check for mistakes.
"Sculpt-AI/2604071-rebuttal-long-context"
)
for model in "${local_gim_models[@]}"; do
python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unquoted variable expansions here can cause word-splitting/globbing and make the command brittle. Quote $model, $API_KEY, and $API_BASE when passing them as arguments.

Suggested change
python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE
python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name "$model" --output_type cfg --api_key "$API_KEY" --base_url "$API_BASE"

Copilot uses AI. Check for mistakes.
python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE
done

shutdown -h +3
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally scheduling a shutdown at the end is risky when this script is run outside the intended environment (e.g., local dev/CI), and it will run even if earlier steps partially fail unless strict mode is enabled. Consider gating this behind an explicit env flag (e.g., AUTO_SHUTDOWN=1) or removing it from the committed script.

Copilot uses AI. Check for mistakes.
@Ki-Seki Ki-Seki merged commit 6e301f9 into feat/eval-results Apr 9, 2026
5 of 6 checks passed
@Ki-Seki Ki-Seki deleted the rebuttal/long-context-eval branch April 9, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants