[pull] main from pinchbench:main by pull[bot] · Pull Request #12 · Stars1233/skill

pull · 2026-04-14T17:00:32Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

GWS tasks (using gws CLI): - task_26: Email Triage (list, read, draft reply, write report) - task_27: Cross-Service Workflow (email -> calendar event -> drive share) - task_28: Task Management (read emails, extract action items, create tasks) GitHub tasks (using gh CLI): - task_29: GitHub Issue Triage (list issues/PRs, comment, write report) Runner integration: - lib_fws.py: start/stop fws server for category=gws and category=github - lib_agent.py: auto-start fws, fix transcript parsing, fix max_completion_tokens Install: npm install -g @juppytt/fws (also requires gws and gh CLIs) Ref: #119

When a task requires fws, run openclaw agent with --local so that env vars (HTTPS_PROXY, SSL_CERT_FILE, GH_TOKEN, etc.) set by lib_fws.py propagate to the agent subprocess instead of being lost to the separate gateway process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… Tj operators)

…orkspace file

…eview

Implement benchmark task for renaming a function across 4 Python files, testing cross-file refactoring of definitions, imports, call sites, string literals, comments, and docstrings.

The benchmark runner already prepends assets/ to source paths, causing double-path errors (assets/assets/refactor/...) that made all models fail.

- Rename task_35_multi_file_refactoring.md → task_multi_file_refactoring.md - Update frontmatter id to task_multi_file_refactoring - Add task_multi_file_refactoring to tasks/manifest.yaml

…tion tasks Implements three new benchmark tasks: - task_36: Dockerfile optimization (automated grading) - tests DevOps best practices like layer consolidation, slim base images, and cache cleanup - task_37: Commit message writer (LLM judge) - evaluates ability to produce conventional commit messages from a diff - task_38: README generation (LLM judge) - evaluates ability to generate accurate documentation from source code Closes #141, closes #142, closes #144

…t days, finance report) Implements four new benchmark tasks using assets/csvs/apple_stock_2014.csv: - task_35: Stock trend analysis with monthly breakdown and streak detection (#208) - task_36: Volatility analysis with quarterly comparison and annualized metrics (#214) - task_37: Best/worst trading days with clustering and distribution analysis (#215) - task_38: Comprehensive finance report with risk metrics (LLM judge only) (#162) Closes #208, closes #214, closes #215, closes #162

Implement task_35_cve_security_triage with mock vulnerability scan data (10 CVEs across severity levels) and deployment context. The agent must produce a triage report with priority assignments and a remediation plan with deploy window timeline. Hybrid grading: 40% automated (11 criteria) / 60% LLM judge (4 rubric dimensions). Automated checks validate priority accuracy, PCI scope awareness, and remediation plan structure. Closes #150 Closes #292

Add 4 new benchmark tasks with corresponding assets: - task_35: CI/CD Pipeline Debug (automated, devops) - task_36: Test Generation (hybrid, developer) - task_37: K8s/IaC Debugging (automated, devops) - task_38: Test Maintenance / Selector Fix (hybrid, developer) Closes #139, #140, #149, #155 Closes #285, #286, #291

OpenClaw and Claude Code serialize tool call parameters under the 'arguments' key, while Cursor and Windsurf use 'params'. The grader only checked 'params', so read_config always scored 0 for OpenClaw/ Claude Code agents even when they correctly read config.json. Fix: fall back to 'arguments' when 'params' is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…enclaw_comprehension When agents number their answers (e.g. "1. 5705", "2. 2999"), the extract_number() helper matched the list prefix digit ("1", "2") instead of the actual answer value, causing correct responses to score 0. Fix: strip leading list markers (e.g. "1. ", "2) ") before scanning for numbers. Discovered during kimi-k2p5 benchmarking — agent formatted all 8 answers as a numbered list, resulting in total_skills, filtered_skills, top_category, and second_category all scoring incorrectly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

OpenClaw and Claude Code serialize tool call parameters under 'arguments' rather than 'params'. The read_config check was always returning 0 for these agents even when the file was correctly read. Fix: fall back to 'arguments' when 'params' is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lver When a task's agent continued writing async responses after its transcript was archived, the new session file landed in sessions.json and was picked up by the NEXT task's transcript resolver (strategy 1b). This caused the next task to score 0% — it inherited the previous task's tail output and its actual workspace was never evaluated. Root cause: _find_transcript_path_from_sessions_store had no started_at guard. The glob fallback (strategy 2) already filters by mtime, but strategy 1b ran first and returned the stale path unconditionally. Fix: pass started_at into _find_transcript_path_from_sessions_store and skip any candidate whose mtime predates the task start (with 5s tolerance). Observed in the wild: task_24_polymarket_briefing wrote polymarket_briefing.md to a new async session (2276707f) after its transcript was archived. task_25_access_log_anomaly's resolver found 2276707f via sessions.json and scored 0/5 automated checks because anomaly_report.json was never written. Re-running task_25 in isolation scored 100%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add GWS-powered tasks using fws mock server

Remove numeric prefixes from task files added in #172: - task_26_gws_email_triage → task_gws_email_triage - task_27_gws_cross_service → task_gws_cross_service - task_28_gws_task_management → task_gws_task_management - task_29_gh_issue_triage → task_gh_issue_triage Also updates the id field in each task and adds them to manifest.yaml.

fix: rename GWS/GH tasks to manifest convention

Add task_35: PDF to Calendar Import

Remove numeric prefix from task file added in #296: - task_35_pdf_to_calendar → task_pdf_to_calendar Updates the id field and adds to manifest.yaml.

fix: rename pdf_to_calendar task to manifest convention

Resolves manifest.yaml conflict by including all tasks: - GWS tasks from #308 - PDF to calendar task from #296 (renamed to manifest convention) - CVE security triage task from this branch Lint passes with 41 tasks.

Add CVE/Security Triage benchmark task

Add multi-file refactoring task with mock data

chore: bump version to 2.0.0-rc1

Add Dockerfile optimization, commit message writer, and README generation tasks

Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>

Add CI/CD debug, test generation, K8s debug, and selector fix benchmark tasks

Add Apple stock 2014 CSV analysis tasks

Add an orange tasks badge and update the prose count (was outdated at 23, now 53). A new GitHub Actions workflow runs on pushes to main that touch task files, counts task_*.md, and commits the updated README.

fix(transcript): filter stale sessions by mtime to prevent cross-task contamination

fix(grader): support 'arguments' key for tool calls in task_workflow

…action fix(grader): strip list prefixes before extracting numbers in task_openclaw_comprehension

feat: add CI workflow to keep task count in README up to date

The manifest header claims a CI check verifies task files match the manifest, but this wasn't actually wired up. Now it is. Runs scripts/lint_manifest.py which checks: - Every manifest entry has a corresponding .md file - Every task_*.md file is listed in the manifest - No duplicate entries - Frontmatter id matches filename

ci: add manifest lint check

juppytt and others added 30 commits April 8, 2026 17:05

Add task_35: PDF to Calendar Import

da78dae

Fix: use generic MLK Elementary PDF instead of real school calendar

b799062

Add MLK Elementary generic school calendar PDF fixture

88a212c

Fix: reference PDF from local assets/ instead of private URL

cee0166

Fix PDF: regenerate with extractable text (raw PDF content stream, 23…

543fff5

… Tj operators)

Fix: rename asset to school-calendar.pdf so prompt filename matches w…

18dd9d9

…orkspace file

Fix: regenerate PDF via Chrome print — renders correctly in GitHub pr…

bc7c3ae

…eview

Add multi-file refactoring task with mock data (#151, #293)

b51bc9c

Implement benchmark task for renaming a function across 4 Python files, testing cross-file refactoring of definitions, imports, call sites, string literals, comments, and docstrings.

fix: remove assets/ prefix from workspace_files source paths

b08cbf6

The benchmark runner already prepends assets/ to source paths, causing double-path errors (assets/assets/refactor/...) that made all models fail.

Update multi-file refactoring task for manifest pattern

4713ec7

- Rename task_35_multi_file_refactoring.md → task_multi_file_refactoring.md - Update frontmatter id to task_multi_file_refactoring - Add task_multi_file_refactoring to tasks/manifest.yaml

Merge pull request #172 from juppytt/fws-gws-tasks

46a22fa

Add GWS-powered tasks using fws mock server

Merge pull request #308 from pinchbench/fix/manifest-gws-tasks

adce1fb

fix: rename GWS/GH tasks to manifest convention

Merge pull request #296 from chad-kiloclaw/task/pdf-to-calendar

3fb96cd

Add task_35: PDF to Calendar Import

fix: rename pdf_to_calendar task to manifest convention

b788fbc

Remove numeric prefix from task file added in #296: - task_35_pdf_to_calendar → task_pdf_to_calendar Updates the id field and adds to manifest.yaml.

Merge pull request #309 from pinchbench/fix/manifest-pdf-to-calendar

cf734b5

fix: rename pdf_to_calendar task to manifest convention

Merge main into joyous-passive and fix conflicts

c58f7fb

Resolves manifest.yaml conflict by including all tasks: - GWS tasks from #308 - PDF to calendar task from #296 (renamed to manifest convention) - CVE security triage task from this branch Lint passes with 41 tasks.

Merge branch 'main' into joyous-passive

ef8581d

Merge pull request #298 from pinchbench/joyous-passive

8811641

Add CVE/Security Triage benchmark task

chore: bump version to 2.0.0-rc1

89ccd2d

olearycrew and others added 18 commits April 14, 2026 08:46

Merge branch 'main' into radial-crater

00594c3

Merge pull request #299 from pinchbench/radial-crater

40619d3

Add multi-file refactoring task with mock data

chore: bump version to 2.0.0-rc1

c32de31

Merge pull request #310 from pinchbench/chore/bump-version-2.0.0-rc1

00325ce

chore: bump version to 2.0.0-rc1

Merge branch 'main' into hurricane-flood

f2ea189

Merge pull request #300 from pinchbench/hurricane-flood

afdbe84

Add Dockerfile optimization, commit message writer, and README generation tasks

Update tasks/task_cicd_pipeline_debug.md

7ad3c35

Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>

Merge branch 'main' into thrilling-coffee

986be9c

Merge pull request #301 from pinchbench/thrilling-coffee

955a9bd

Add CI/CD debug, test generation, K8s debug, and selector fix benchmark tasks

Merge branch 'main' into jasper-crow

8e2c263

Merge pull request #302 from pinchbench/jasper-crow

d668075

Add Apple stock 2014 CSV analysis tasks

feat: add CI workflow to keep task count in README up to date

348bfdd

Add an orange tasks badge and update the prose count (was outdated at 23, now 53). A new GitHub Actions workflow runs on pushes to main that touch task files, counts task_*.md, and commits the updated README.

Merge pull request #305 from mgoulart/fix/stale-session-transcript-race

07b014c

fix(transcript): filter stale sessions by mtime to prevent cross-task contamination

Merge pull request #306 from mgoulart/fix/grader-workflow-arguments-key

816b771

fix(grader): support 'arguments' key for tool calls in task_workflow

Merge pull request #307 from mgoulart/fix/grader-openclaw-number-extr…

0322e84

…action fix(grader): strip list prefixes before extracting numbers in task_openclaw_comprehension

Merge pull request #311 from pinchbench/serious-hoof

41dbf7c

feat: add CI workflow to keep task count in README up to date

Merge pull request #328 from pinchbench/ci/add-manifest-lint

eab0944

ci: add manifest lint check

pull Bot locked and limited conversation to collaborators Apr 14, 2026

pull Bot added the ⤵️ pull label Apr 14, 2026

pull Bot merged commit eab0944 into Stars1233:main Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from pinchbench:main#12

[pull] main from pinchbench:main#12
pull[bot] merged 48 commits intoStars1233:mainfrom
pinchbench:main

pull Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pull Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull Bot commented Apr 14, 2026 •

edited

Loading