Skip to content

For rebuttal, AR baseline#95

Merged
Ki-Seki merged 2 commits into
feat/eval-resultsfrom
kdd/rebuttal-ar
Apr 9, 2026
Merged

For rebuttal, AR baseline#95
Ki-Seki merged 2 commits into
feat/eval-resultsfrom
kdd/rebuttal-ar

Conversation

@Ki-Seki
Copy link
Copy Markdown
Member

@Ki-Seki Ki-Seki commented Apr 9, 2026

No description provided.

Copilot AI review requested due to automatic review settings April 9, 2026 05:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds AR-baseline evaluation artifacts for a KDD rebuttal by committing aggregated metrics and the script used to run the evaluations.

Changes:

  • Added aggregated PPL results CSV for AR baseline runs (cfg/json output types).
  • Added aggregated regex-match results CSV for AR baseline runs (cfg/json output types).
  • Added a bash script to download the model and run the evaluation commands.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 4 comments.

File Description
results/260409-kdd-rebuttal-ar-baseline/aggregated_results_ppl.csv Stores aggregated PPL metrics and run metadata for AR baseline.
results/260409-kdd-rebuttal-ar-baseline/aggregated_results_match.csv Stores aggregated regex-match metrics and run metadata for AR baseline.
results/260409-kdd-rebuttal-ar-baseline/*eval.sh Script to reproduce the AR baseline evaluations and generate the result artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread results/260409-kdd-rebuttal-ar-baseline/*eval.sh
Comment thread results/260409-kdd-rebuttal-ar-baseline/*eval.sh
Comment thread results/260409-kdd-rebuttal-ar-baseline/*eval.sh
@@ -0,0 +1,3 @@
filename,eval_env,evaluator_type,args,start_time,end_time,elapsed_minutes,total,evaluates,errors,avg_lwnp,avg_rwnp,avg_wnp,avg_inp,avg_query_tags,avg_result_tags,avg_infilling_ratio,avg_query_len,avg_response_len
Sculpt-AI_2604081-ar-baseline_Sculpt-AI_GIM-SFT_260409-110733.json,"{'exec_command': '/root/autodl-tmp/GIMBench/.venv/bin/python /root/autodl-tmp/GIMBench/src/gimbench/ppl/gim_sft.py --model_type vllm-offline --model_name Sculpt-AI/2604081-ar-baseline --use_gim_prompt --output_type cfg --ref_model_device cpu --first_n 100', 'gimbench_version': '0.4.0', 'gimbench_file': '/root/autodl-tmp/GIMBench/src/gimbench/__init__.py', 'gimkit_version': '0.1.1', 'gimkit_file': '/root/autodl-tmp/GIMBench/.venv/lib/python3.13/site-packages/gimkit/__init__.py', 'git_repo': '/root/autodl-tmp/GIMBench', 'git_branch': 'kdd/rebuttal-ar', 'git_commit_id': '1378edc39a301feac83897e71952e2d4dfb99599'}",ppl,"{'use_gim_prompt': True, 'output_type': 'cfg', 'model_type': 'vllm-offline', 'model_name': 'Sculpt-AI/2604081-ar-baseline', 'api_key': '', 'base_url': 'http://localhost:8000/v1', 'max_model_len': 8192, 'temperature': 0.0, 'top_p': 1.0, 'presence_penalty': 1.0, 'max_tokens': 8192, 'seed': 16, 'first_n': 100, 'num_proc': 1, 'output_dir': 'results', 'ref_model_name': 'google/gemma-3-270m', 'ref_model_device': 'cpu', 'norm_ppl_alpha': 0.2, 'ppl_window_k': 16, 'golden_truth_only': False, 'no_gimkit': False, 'reason_budget': 0, 'auto_budget': False, 'auto_budget_prompt': ""I'll show you a couple of questions. Decide how many reasoning steps are needed to answer each accurately.\n\nConsider a plausible reasoning workflow first (you may use reasoning, reflection, trial and error, and parallel thinking by applying different approaches, plus a quick verification if needed). Then output a step budget (where each step is an atomic reasoning action taking 3–5 sentences) that allows for granular, step-by-step derivation without skipping logic, ensuring a robust and high-confidence conclusion;leave extra headroom for cross-checking and possible revision on multi-hop or tricky questions.\n\n## Question: {question}\n\nDo not be anchored by the examples above. Scale your step budget linearly with the difficulty. For complex problems, you are encouraged to assign a high budget (20, or more) to ensure there is enough room for step-by-step derivation and verification.\n\n"", 'reason_step_desc': 'A distinct, verified reasoning step building on the previous one. Write 2–3 substantial sentences (60–80 words each) to ensure depth.', 'counter_tokenizer': 'Qwen/Qwen3-4B-Instruct-2507', 'use_outlines': False, 'judge_model_name': 'google/gemini-3-flash-preview', 'dataset': {'path': 'Sculpt-AI/GIM-SFT', 'name': ['gsm8k_reasoning', 'hk_o1aw', 'lima', 'o1_journey', 'process_bench', 'uhgeval', 'cnn_daily_mail', 'magpie_reasoning', 'kaist_cot', 'numina_math'], 'split': 'train', 'max_per_subset': 200}}",2026-04-09T11:07:33.395861,2026-04-09T11:11:42.903986,4.15846875,100,100,0,0.38244086591381604,0.3812410179826515,0.3814919037286061,0.8328623369719991,2.85,0.0,1.0,813.33,1026.06
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eval_env and args columns embed Python dict-repr strings (single quotes) plus heavily-escaped multi-line text and absolute paths. This tends to be brittle for downstream consumers (e.g., non-Python tooling can’t parse it as JSON; CSV parsers can struggle with nested quotes/newlines). Consider storing these fields as valid JSON (double-quoted keys/strings) with consistent CSV escaping, or moving verbose metadata to sidecar .json files and keeping only a stable reference (e.g., git_commit_id, exec_command, config_hash) in the CSV.

Copilot uses AI. Check for mistakes.
@Ki-Seki Ki-Seki merged commit 0dac739 into feat/eval-results Apr 9, 2026
1 of 2 checks passed
@Ki-Seki Ki-Seki deleted the kdd/rebuttal-ar branch April 9, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants