For rebuttal, long context eval#92
Conversation
There was a problem hiding this comment.
Pull request overview
Adds long-context CV parsing evaluation artifacts for the KDD rebuttal, including a reproducible eval script and aggregated metrics output.
Changes:
- Added
aggregated_results_cv.csvcontaining aggregated CV-parse evaluation metrics and run metadata. - Added a shell script to download the model and run
gimbench.cv.cv_parsefor the long-context rebuttal setting.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| results/260408-kdd-rebuttal-cv-long-context/aggregated_results_cv.csv | Adds aggregated CV evaluation row with config/run metadata and accuracy metrics. |
| results/260408-kdd-rebuttal-cv-long-context/*eval.sh | Adds a runnable script to download the model and execute the CV parse evaluation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| set -x | ||
|
|
||
| export API_KEY=XXX | ||
| export API_BASE=https://openrouter.ai/api/v1 |
There was a problem hiding this comment.
set -x will echo the full python command line, which includes --api_key $API_KEY; this can leak the API key into logs/CI output and shell history. Consider disabling xtrace for the sensitive section (e.g., set +x before invoking python and re-enable after), or avoid passing secrets via CLI flags (prefer env var expected by the client if supported).
| export API_KEY=XXX | ||
| export API_BASE=https://openrouter.ai/api/v1 |
There was a problem hiding this comment.
Avoid hardcoding API_KEY in the script (even as XXX), since it encourages editing/committing secrets and makes the script non-reproducible across environments. Prefer reading API_KEY from the environment and failing fast with a clear message if it is unset (and similarly quote API_BASE).
| @@ -0,0 +1,17 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
This script doesn’t enable strict error handling, so failures from hf download or the evaluation run may be missed while the script continues (potentially followed by shutdown). Consider adding set -euo pipefail (and optionally IFS=$'\n\t') near the top so the run fails fast on errors/unset variables.
| set -euo pipefail | |
| IFS=$'\n\t' |
| "Sculpt-AI/2604071-rebuttal-long-context" | ||
| ) | ||
| for model in "${local_gim_models[@]}"; do | ||
| python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE |
There was a problem hiding this comment.
Unquoted variable expansions here can cause word-splitting/globbing and make the command brittle. Quote $model, $API_KEY, and $API_BASE when passing them as arguments.
| python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE | |
| python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name "$model" --output_type cfg --api_key "$API_KEY" --base_url "$API_BASE" |
| python -m gimbench.cv.cv_parse --model_type vllm-offline --model_name $model --output_type cfg --api_key $API_KEY --base_url $API_BASE | ||
| done | ||
|
|
||
| shutdown -h +3 |
There was a problem hiding this comment.
Unconditionally scheduling a shutdown at the end is risky when this script is run outside the intended environment (e.g., local dev/CI), and it will run even if earlier steps partially fail unless strict mode is enabled. Consider gating this behind an explicit env flag (e.g., AUTO_SHUTDOWN=1) or removing it from the committed script.
No description provided.