[pull] main from pinchbench:main#12
Merged
pull[bot] merged 48 commits intoStars1233:mainfrom Apr 14, 2026
Merged
Conversation
GWS tasks (using gws CLI): - task_26: Email Triage (list, read, draft reply, write report) - task_27: Cross-Service Workflow (email -> calendar event -> drive share) - task_28: Task Management (read emails, extract action items, create tasks) GitHub tasks (using gh CLI): - task_29: GitHub Issue Triage (list issues/PRs, comment, write report) Runner integration: - lib_fws.py: start/stop fws server for category=gws and category=github - lib_agent.py: auto-start fws, fix transcript parsing, fix max_completion_tokens Install: npm install -g @juppytt/fws (also requires gws and gh CLIs) Ref: #119
When a task requires fws, run openclaw agent with --local so that env vars (HTTPS_PROXY, SSL_CERT_FILE, GH_TOKEN, etc.) set by lib_fws.py propagate to the agent subprocess instead of being lost to the separate gateway process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The benchmark runner already prepends assets/ to source paths, causing double-path errors (assets/assets/refactor/...) that made all models fail.
- Rename task_35_multi_file_refactoring.md → task_multi_file_refactoring.md - Update frontmatter id to task_multi_file_refactoring - Add task_multi_file_refactoring to tasks/manifest.yaml
…tion tasks Implements three new benchmark tasks: - task_36: Dockerfile optimization (automated grading) - tests DevOps best practices like layer consolidation, slim base images, and cache cleanup - task_37: Commit message writer (LLM judge) - evaluates ability to produce conventional commit messages from a diff - task_38: README generation (LLM judge) - evaluates ability to generate accurate documentation from source code Closes #141, closes #142, closes #144
…t days, finance report) Implements four new benchmark tasks using assets/csvs/apple_stock_2014.csv: - task_35: Stock trend analysis with monthly breakdown and streak detection (#208) - task_36: Volatility analysis with quarterly comparison and annualized metrics (#214) - task_37: Best/worst trading days with clustering and distribution analysis (#215) - task_38: Comprehensive finance report with risk metrics (LLM judge only) (#162) Closes #208, closes #214, closes #215, closes #162
Implement task_35_cve_security_triage with mock vulnerability scan data (10 CVEs across severity levels) and deployment context. The agent must produce a triage report with priority assignments and a remediation plan with deploy window timeline. Hybrid grading: 40% automated (11 criteria) / 60% LLM judge (4 rubric dimensions). Automated checks validate priority accuracy, PCI scope awareness, and remediation plan structure. Closes #150 Closes #292
Add 4 new benchmark tasks with corresponding assets: - task_35: CI/CD Pipeline Debug (automated, devops) - task_36: Test Generation (hybrid, developer) - task_37: K8s/IaC Debugging (automated, devops) - task_38: Test Maintenance / Selector Fix (hybrid, developer) Closes #139, #140, #149, #155 Closes #285, #286, #291
OpenClaw and Claude Code serialize tool call parameters under the 'arguments' key, while Cursor and Windsurf use 'params'. The grader only checked 'params', so read_config always scored 0 for OpenClaw/ Claude Code agents even when they correctly read config.json. Fix: fall back to 'arguments' when 'params' is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…enclaw_comprehension
When agents number their answers (e.g. "1. 5705", "2. 2999"), the
extract_number() helper matched the list prefix digit ("1", "2") instead
of the actual answer value, causing correct responses to score 0.
Fix: strip leading list markers (e.g. "1. ", "2) ") before scanning for
numbers.
Discovered during kimi-k2p5 benchmarking — agent formatted all 8 answers
as a numbered list, resulting in total_skills, filtered_skills,
top_category, and second_category all scoring incorrectly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OpenClaw and Claude Code serialize tool call parameters under 'arguments' rather than 'params'. The read_config check was always returning 0 for these agents even when the file was correctly read. Fix: fall back to 'arguments' when 'params' is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lver When a task's agent continued writing async responses after its transcript was archived, the new session file landed in sessions.json and was picked up by the NEXT task's transcript resolver (strategy 1b). This caused the next task to score 0% — it inherited the previous task's tail output and its actual workspace was never evaluated. Root cause: _find_transcript_path_from_sessions_store had no started_at guard. The glob fallback (strategy 2) already filters by mtime, but strategy 1b ran first and returned the stale path unconditionally. Fix: pass started_at into _find_transcript_path_from_sessions_store and skip any candidate whose mtime predates the task start (with 5s tolerance). Observed in the wild: task_24_polymarket_briefing wrote polymarket_briefing.md to a new async session (2276707f) after its transcript was archived. task_25_access_log_anomaly's resolver found 2276707f via sessions.json and scored 0/5 automated checks because anomaly_report.json was never written. Re-running task_25 in isolation scored 100%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GWS-powered tasks using fws mock server
Remove numeric prefixes from task files added in #172: - task_26_gws_email_triage → task_gws_email_triage - task_27_gws_cross_service → task_gws_cross_service - task_28_gws_task_management → task_gws_task_management - task_29_gh_issue_triage → task_gh_issue_triage Also updates the id field in each task and adds them to manifest.yaml.
fix: rename GWS/GH tasks to manifest convention
Add task_35: PDF to Calendar Import
Remove numeric prefix from task file added in #296: - task_35_pdf_to_calendar → task_pdf_to_calendar Updates the id field and adds to manifest.yaml.
fix: rename pdf_to_calendar task to manifest convention
Add CVE/Security Triage benchmark task
Add multi-file refactoring task with mock data
chore: bump version to 2.0.0-rc1
Add Dockerfile optimization, commit message writer, and README generation tasks
Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
Add CI/CD debug, test generation, K8s debug, and selector fix benchmark tasks
Add Apple stock 2014 CSV analysis tasks
Add an orange tasks badge and update the prose count (was outdated at 23, now 53). A new GitHub Actions workflow runs on pushes to main that touch task files, counts task_*.md, and commits the updated README.
fix(transcript): filter stale sessions by mtime to prevent cross-task contamination
fix(grader): support 'arguments' key for tool calls in task_workflow
…action fix(grader): strip list prefixes before extracting numbers in task_openclaw_comprehension
feat: add CI workflow to keep task count in README up to date
The manifest header claims a CI check verifies task files match the manifest, but this wasn't actually wired up. Now it is. Runs scripts/lint_manifest.py which checks: - Every manifest entry has a corresponding .md file - Every task_*.md file is listed in the manifest - No duplicate entries - Frontmatter id matches filename
ci: add manifest lint check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )