Skip to content

[pull] main from pinchbench:main#12

Merged
pull[bot] merged 48 commits intoStars1233:mainfrom
pinchbench:main
Apr 14, 2026
Merged

[pull] main from pinchbench:main#12
pull[bot] merged 48 commits intoStars1233:mainfrom
pinchbench:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 14, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

juppytt and others added 30 commits April 8, 2026 17:05
GWS tasks (using gws CLI):
- task_26: Email Triage (list, read, draft reply, write report)
- task_27: Cross-Service Workflow (email -> calendar event -> drive share)
- task_28: Task Management (read emails, extract action items, create tasks)

GitHub tasks (using gh CLI):
- task_29: GitHub Issue Triage (list issues/PRs, comment, write report)

Runner integration:
- lib_fws.py: start/stop fws server for category=gws and category=github
- lib_agent.py: auto-start fws, fix transcript parsing, fix max_completion_tokens

Install: npm install -g @juppytt/fws (also requires gws and gh CLIs)

Ref: #119
When a task requires fws, run openclaw agent with --local so that
env vars (HTTPS_PROXY, SSL_CERT_FILE, GH_TOKEN, etc.) set by
lib_fws.py propagate to the agent subprocess instead of being
lost to the separate gateway process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement benchmark task for renaming a function across 4 Python files,
testing cross-file refactoring of definitions, imports, call sites,
string literals, comments, and docstrings.
The benchmark runner already prepends assets/ to source paths, causing
double-path errors (assets/assets/refactor/...) that made all models fail.
- Rename task_35_multi_file_refactoring.md → task_multi_file_refactoring.md
- Update frontmatter id to task_multi_file_refactoring
- Add task_multi_file_refactoring to tasks/manifest.yaml
…tion tasks

Implements three new benchmark tasks:
- task_36: Dockerfile optimization (automated grading) - tests DevOps
  best practices like layer consolidation, slim base images, and cache
  cleanup
- task_37: Commit message writer (LLM judge) - evaluates ability to
  produce conventional commit messages from a diff
- task_38: README generation (LLM judge) - evaluates ability to generate
  accurate documentation from source code

Closes #141, closes #142, closes #144
…t days, finance report)

Implements four new benchmark tasks using assets/csvs/apple_stock_2014.csv:

- task_35: Stock trend analysis with monthly breakdown and streak detection (#208)
- task_36: Volatility analysis with quarterly comparison and annualized metrics (#214)
- task_37: Best/worst trading days with clustering and distribution analysis (#215)
- task_38: Comprehensive finance report with risk metrics (LLM judge only) (#162)

Closes #208, closes #214, closes #215, closes #162
Implement task_35_cve_security_triage with mock vulnerability scan data
(10 CVEs across severity levels) and deployment context. The agent must
produce a triage report with priority assignments and a remediation plan
with deploy window timeline.

Hybrid grading: 40% automated (11 criteria) / 60% LLM judge (4 rubric
dimensions). Automated checks validate priority accuracy, PCI scope
awareness, and remediation plan structure.

Closes #150
Closes #292
Add 4 new benchmark tasks with corresponding assets:
- task_35: CI/CD Pipeline Debug (automated, devops)
- task_36: Test Generation (hybrid, developer)
- task_37: K8s/IaC Debugging (automated, devops)
- task_38: Test Maintenance / Selector Fix (hybrid, developer)

Closes #139, #140, #149, #155
Closes #285, #286, #291
OpenClaw and Claude Code serialize tool call parameters under the
'arguments' key, while Cursor and Windsurf use 'params'. The grader
only checked 'params', so read_config always scored 0 for OpenClaw/
Claude Code agents even when they correctly read config.json.

Fix: fall back to 'arguments' when 'params' is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…enclaw_comprehension

When agents number their answers (e.g. "1. 5705", "2. 2999"), the
extract_number() helper matched the list prefix digit ("1", "2") instead
of the actual answer value, causing correct responses to score 0.

Fix: strip leading list markers (e.g. "1. ", "2) ") before scanning for
numbers.

Discovered during kimi-k2p5 benchmarking — agent formatted all 8 answers
as a numbered list, resulting in total_skills, filtered_skills,
top_category, and second_category all scoring incorrectly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OpenClaw and Claude Code serialize tool call parameters under 'arguments'
rather than 'params'. The read_config check was always returning 0 for
these agents even when the file was correctly read.

Fix: fall back to 'arguments' when 'params' is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lver

When a task's agent continued writing async responses after its transcript
was archived, the new session file landed in sessions.json and was picked
up by the NEXT task's transcript resolver (strategy 1b). This caused the
next task to score 0% — it inherited the previous task's tail output and
its actual workspace was never evaluated.

Root cause: _find_transcript_path_from_sessions_store had no started_at
guard. The glob fallback (strategy 2) already filters by mtime, but
strategy 1b ran first and returned the stale path unconditionally.

Fix: pass started_at into _find_transcript_path_from_sessions_store and
skip any candidate whose mtime predates the task start (with 5s tolerance).

Observed in the wild: task_24_polymarket_briefing wrote polymarket_briefing.md
to a new async session (2276707f) after its transcript was archived.
task_25_access_log_anomaly's resolver found 2276707f via sessions.json and
scored 0/5 automated checks because anomaly_report.json was never written.
Re-running task_25 in isolation scored 100%.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GWS-powered tasks using fws mock server
Remove numeric prefixes from task files added in #172:
- task_26_gws_email_triage → task_gws_email_triage
- task_27_gws_cross_service → task_gws_cross_service
- task_28_gws_task_management → task_gws_task_management
- task_29_gh_issue_triage → task_gh_issue_triage

Also updates the id field in each task and adds them to manifest.yaml.
fix: rename GWS/GH tasks to manifest convention
Remove numeric prefix from task file added in #296:
- task_35_pdf_to_calendar → task_pdf_to_calendar

Updates the id field and adds to manifest.yaml.
fix: rename pdf_to_calendar task to manifest convention
Resolves manifest.yaml conflict by including all tasks:
- GWS tasks from #308
- PDF to calendar task from #296 (renamed to manifest convention)
- CVE security triage task from this branch

Lint passes with 41 tasks.
Add CVE/Security Triage benchmark task
olearycrew and others added 18 commits April 14, 2026 08:46
Add multi-file refactoring task with mock data
Add Dockerfile optimization, commit message writer, and README generation tasks
Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
Add CI/CD debug, test generation, K8s debug, and selector fix benchmark tasks
Add Apple stock 2014 CSV analysis tasks
Add an orange tasks badge and update the prose count (was outdated at 23,
now 53). A new GitHub Actions workflow runs on pushes to main that touch
task files, counts task_*.md, and commits the updated README.
fix(transcript): filter stale sessions by mtime to prevent cross-task contamination
fix(grader): support 'arguments' key for tool calls in task_workflow
…action

fix(grader): strip list prefixes before extracting numbers in task_openclaw_comprehension
feat: add CI workflow to keep task count in README up to date
The manifest header claims a CI check verifies task files match the manifest,
but this wasn't actually wired up. Now it is.

Runs scripts/lint_manifest.py which checks:
- Every manifest entry has a corresponding .md file
- Every task_*.md file is listed in the manifest
- No duplicate entries
- Frontmatter id matches filename
@pull pull Bot locked and limited conversation to collaborators Apr 14, 2026
@pull pull Bot added the ⤵️ pull label Apr 14, 2026
@pull pull Bot merged commit eab0944 into Stars1233:main Apr 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants