For rebuttal, AR baseline by Ki-Seki · Pull Request #95 · SculptAI/GIMBench

Ki-Seki · 2026-04-09T05:49:35Z

No description provided.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds AR-baseline evaluation artifacts for a KDD rebuttal by committing aggregated metrics and the script used to run the evaluations.

Changes:

Added aggregated PPL results CSV for AR baseline runs (cfg/json output types).
Added aggregated regex-match results CSV for AR baseline runs (cfg/json output types).
Added a bash script to download the model and run the evaluation commands.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 4 comments.

File	Description
results/260409-kdd-rebuttal-ar-baseline/aggregated_results_ppl.csv	Stores aggregated PPL metrics and run metadata for AR baseline.
results/260409-kdd-rebuttal-ar-baseline/aggregated_results_match.csv	Stores aggregated regex-match metrics and run metadata for AR baseline.
results/260409-kdd-rebuttal-ar-baseline/*eval.sh	Script to reproduce the AR baseline evaluations and generate the result artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T05:52:58Z

@@ -0,0 +1,3 @@
+filename,eval_env,evaluator_type,args,start_time,end_time,elapsed_minutes,total,evaluates,errors,avg_lwnp,avg_rwnp,avg_wnp,avg_inp,avg_query_tags,avg_result_tags,avg_infilling_ratio,avg_query_len,avg_response_len
+Sculpt-AI_2604081-ar-baseline_Sculpt-AI_GIM-SFT_260409-110733.json,"{'exec_command': '/root/autodl-tmp/GIMBench/.venv/bin/python /root/autodl-tmp/GIMBench/src/gimbench/ppl/gim_sft.py --model_type vllm-offline --model_name Sculpt-AI/2604081-ar-baseline --use_gim_prompt --output_type cfg --ref_model_device cpu --first_n 100', 'gimbench_version': '0.4.0', 'gimbench_file': '/root/autodl-tmp/GIMBench/src/gimbench/__init__.py', 'gimkit_version': '0.1.1', 'gimkit_file': '/root/autodl-tmp/GIMBench/.venv/lib/python3.13/site-packages/gimkit/__init__.py', 'git_repo': '/root/autodl-tmp/GIMBench', 'git_branch': 'kdd/rebuttal-ar', 'git_commit_id': '1378edc39a301feac83897e71952e2d4dfb99599'}",ppl,"{'use_gim_prompt': True, 'output_type': 'cfg', 'model_type': 'vllm-offline', 'model_name': 'Sculpt-AI/2604081-ar-baseline', 'api_key': '', 'base_url': 'http://localhost:8000/v1', 'max_model_len': 8192, 'temperature': 0.0, 'top_p': 1.0, 'presence_penalty': 1.0, 'max_tokens': 8192, 'seed': 16, 'first_n': 100, 'num_proc': 1, 'output_dir': 'results', 'ref_model_name': 'google/gemma-3-270m', 'ref_model_device': 'cpu', 'norm_ppl_alpha': 0.2, 'ppl_window_k': 16, 'golden_truth_only': False, 'no_gimkit': False, 'reason_budget': 0, 'auto_budget': False, 'auto_budget_prompt': ""I'll show you a couple of questions. Decide how many reasoning steps are needed to answer each accurately.\n\nConsider a plausible reasoning workflow first (you may use reasoning, reflection, trial and error, and parallel thinking by applying different approaches, plus a quick verification if needed). Then output a step budget (where each step is an atomic reasoning action taking 3–5 sentences) that allows for granular, step-by-step derivation without skipping logic, ensuring a robust and high-confidence conclusion;leave extra headroom for cross-checking and possible revision on multi-hop or tricky questions.\n\n## Question: {question}\n\nDo not be anchored by the examples above. Scale your step budget linearly with the difficulty. For complex problems, you are encouraged to assign a high budget (20, or more) to ensure there is enough room for step-by-step derivation and verification.\n\n"", 'reason_step_desc': 'A distinct, verified reasoning step building on the previous one. Write 2–3 substantial sentences (60–80 words each) to ensure depth.', 'counter_tokenizer': 'Qwen/Qwen3-4B-Instruct-2507', 'use_outlines': False, 'judge_model_name': 'google/gemini-3-flash-preview', 'dataset': {'path': 'Sculpt-AI/GIM-SFT', 'name': ['gsm8k_reasoning', 'hk_o1aw', 'lima', 'o1_journey', 'process_bench', 'uhgeval', 'cnn_daily_mail', 'magpie_reasoning', 'kaist_cot', 'numina_math'], 'split': 'train', 'max_per_subset': 200}}",2026-04-09T11:07:33.395861,2026-04-09T11:11:42.903986,4.15846875,100,100,0,0.38244086591381604,0.3812410179826515,0.3814919037286061,0.8328623369719991,2.85,0.0,1.0,813.33,1026.06


The eval_env and args columns embed Python dict-repr strings (single quotes) plus heavily-escaped multi-line text and absolute paths. This tends to be brittle for downstream consumers (e.g., non-Python tooling can’t parse it as JSON; CSV parsers can struggle with nested quotes/newlines). Consider storing these fields as valid JSON (double-quoted keys/strings) with consistent CSV escaping, or moving verbose metadata to sidecar .json files and keeping only a stable reference (e.g., git_commit_id, exec_command, config_hash) in the CSV.

upload eval results

8ca346f

Copilot AI review requested due to automatic review settings April 9, 2026 05:49

upload

caa387d

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Ki-Seki merged commit 0dac739 into feat/eval-results Apr 9, 2026
1 of 2 checks passed

Ki-Seki deleted the kdd/rebuttal-ar branch April 9, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For rebuttal, AR baseline#95

For rebuttal, AR baseline#95
Ki-Seki merged 2 commits into
feat/eval-resultsfrom
kdd/rebuttal-ar

Ki-Seki commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,3 @@
		filename,eval_env,evaluator_type,args,start_time,end_time,elapsed_minutes,total,evaluates,errors,avg_lwnp,avg_rwnp,avg_wnp,avg_inp,avg_query_tags,avg_result_tags,avg_infilling_ratio,avg_query_len,avg_response_len
		Sculpt-AI_2604081-ar-baseline_Sculpt-AI_GIM-SFT_260409-110733.json,"{'exec_command': '/root/autodl-tmp/GIMBench/.venv/bin/python /root/autodl-tmp/GIMBench/src/gimbench/ppl/gim_sft.py --model_type vllm-offline --model_name Sculpt-AI/2604081-ar-baseline --use_gim_prompt --output_type cfg --ref_model_device cpu --first_n 100', 'gimbench_version': '0.4.0', 'gimbench_file': '/root/autodl-tmp/GIMBench/src/gimbench/__init__.py', 'gimkit_version': '0.1.1', 'gimkit_file': '/root/autodl-tmp/GIMBench/.venv/lib/python3.13/site-packages/gimkit/__init__.py', 'git_repo': '/root/autodl-tmp/GIMBench', 'git_branch': 'kdd/rebuttal-ar', 'git_commit_id': '1378edc39a301feac83897e71952e2d4dfb99599'}",ppl,"{'use_gim_prompt': True, 'output_type': 'cfg', 'model_type': 'vllm-offline', 'model_name': 'Sculpt-AI/2604081-ar-baseline', 'api_key': '', 'base_url': 'http://localhost:8000/v1', 'max_model_len': 8192, 'temperature': 0.0, 'top_p': 1.0, 'presence_penalty': 1.0, 'max_tokens': 8192, 'seed': 16, 'first_n': 100, 'num_proc': 1, 'output_dir': 'results', 'ref_model_name': 'google/gemma-3-270m', 'ref_model_device': 'cpu', 'norm_ppl_alpha': 0.2, 'ppl_window_k': 16, 'golden_truth_only': False, 'no_gimkit': False, 'reason_budget': 0, 'auto_budget': False, 'auto_budget_prompt': ""I'll show you a couple of questions. Decide how many reasoning steps are needed to answer each accurately.\n\nConsider a plausible reasoning workflow first (you may use reasoning, reflection, trial and error, and parallel thinking by applying different approaches, plus a quick verification if needed). Then output a step budget (where each step is an atomic reasoning action taking 3–5 sentences) that allows for granular, step-by-step derivation without skipping logic, ensuring a robust and high-confidence conclusion;leave extra headroom for cross-checking and possible revision on multi-hop or tricky questions.\n\n## Question: {question}\n\nDo not be anchored by the examples above. Scale your step budget linearly with the difficulty. For complex problems, you are encouraged to assign a high budget (20, or more) to ensure there is enough room for step-by-step derivation and verification.\n\n"", 'reason_step_desc': 'A distinct, verified reasoning step building on the previous one. Write 2–3 substantial sentences (60–80 words each) to ensure depth.', 'counter_tokenizer': 'Qwen/Qwen3-4B-Instruct-2507', 'use_outlines': False, 'judge_model_name': 'google/gemini-3-flash-preview', 'dataset': {'path': 'Sculpt-AI/GIM-SFT', 'name': ['gsm8k_reasoning', 'hk_o1aw', 'lima', 'o1_journey', 'process_bench', 'uhgeval', 'cnn_daily_mail', 'magpie_reasoning', 'kaist_cot', 'numina_math'], 'split': 'train', 'max_per_subset': 200}}",2026-04-09T11:07:33.395861,2026-04-09T11:11:42.903986,4.15846875,100,100,0,0.38244086591381604,0.3812410179826515,0.3814919037286061,0.8328623369719991,2.85,0.0,1.0,813.33,1026.06

Conversation

Ki-Seki commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants