ci: raise agent audit turn limit and preserve logs#571
Conversation
The Friday test-health audit hit the 30-turn cap on its first-ever run (2026-04-24) and the agent log was discarded with the self-hosted runner. Heavier recipes need more room, and the next failure should be diagnosable. - Raise --max-turns from 30 to 50 - Switch --output-format from text to stream-json so events are emitted during the run instead of only at process exit; prefix with stdbuf -oL -eL to line-buffer the pipe - Upload /tmp/claude-audit-log.txt and /tmp/audit-<suite>.md as an artifact (if: always(), 14-day retention) using the upload-artifact SHA already pinned in build-notebooks.yml Signed-off-by: Andre Manoel <amanoel@nvidia.com>
actions/upload-artifact@v4+ rejects duplicate names within a workflow, and re-running a failed run reuses the same github.run_id. Append github.run_attempt so re-runs upload successfully instead of failing at the exact moment the artifact is most useful. Found by Codex review of #571. Signed-off-by: Andre Manoel <amanoel@nvidia.com>
Raise the bar for persisting the full verbose stream-json event log: we only need it when we're actually debugging a failure, and the audit report itself still lands in the step summary on success. Shrinks the window where tool inputs, read file contents, or other verbose-stream detail could end up in a 14-day artifact. Addresses the minor privacy finding from Codex review of #571. Signed-off-by: Andre Manoel <amanoel@nvidia.com>
With --output-format stream-json the previous tail -100 of the agent log emitted raw NDJSON into the GH Actions UI summary, which is unreadable. The audit report itself (/tmp/audit-<suite>.md) already carries the human-readable payload, and the full event stream is available as an on-failure artifact, so the raw tail was redundant and worse than nothing for the summary surface. Also rewords the fallback message to point at the artifact when no report lands (typically a failure). Signed-off-by: Andre Manoel <amanoel@nvidia.com>
Greptile SummaryBumps the
|
| Filename | Overview |
|---|---|
| .github/workflows/agentic-ci-daily.yml | Turn limit raised to 50, output switched to stream-json with stdbuf line-buffering, and a failure-gated artifact upload step added — all changes are mechanically correct and well-scoped |
Sequence Diagram
sequenceDiagram
participant GHA as GitHub Actions
participant Claude as claude CLI (stream-json)
participant Log as /tmp/claude-audit-log.txt
participant Report as /tmp/audit-{suite}.md
participant Artifact as Artifact Store
GHA->>Claude: stdbuf -oL -eL claude --max-turns 50 --output-format stream-json
Claude-->>Log: NDJSON events (via tee, line-buffered)
Claude-->>Report: audit-{suite}.md (written by agent tool calls)
Claude-->>GHA: exit code (0 = success, non-zero = failure)
alt Step succeeds
GHA->>GHA: Update runner memory (last_run stamped)
GHA->>GHA: Skip Upload agent log (if: failure() not met)
GHA->>GHA: Write job summary (report from Report)
else Step fails
GHA->>GHA: Update runner memory (last_run not stamped)
GHA->>Artifact: Upload claude-audit-log.txt + audit-{suite}.md (14-day retention)
GHA->>GHA: Write job summary (fallback message)
end
Reviews (2): Last reviewed commit: "Merge branch 'main' into andreatgretel/f..." | Re-trigger Greptile
Review: PR #571 — ci: raise agent audit turn limit and preserve logsSummaryWorkflow-only change to
The diff is small (15/-14) and the changes are internally consistent: each edit is load-bearing for the failure-diagnosis goal stated in the PR. FindingsCorrectness
Potential issues / risks
Conventions
Test coverage
Security
VerdictApprove-equivalent (no blocking issues). The change is minimal, well-scoped, and directly addresses the observed failure mode with evidence from a real dispatch run. Recommend merging; the notes above (success-path log retention, turn-headroom monitoring) are worth a follow-up issue but don't need to block this PR. |
| tail -100 /tmp/claude-audit-log.txt >> "$GITHUB_STEP_SUMMARY" | ||
| echo '```' >> "$GITHUB_STEP_SUMMARY" | ||
| echo "</details>" >> "$GITHUB_STEP_SUMMARY" | ||
| echo "No report generated. See the \`claude-audit-log-*\` artifact on failures for the full event stream." >> "$GITHUB_STEP_SUMMARY" |
There was a problem hiding this comment.
Misleading fallback message on success-with-no-report
The new fallback message tells users to check the claude-audit-log-* artifact, but the artifact upload is gated on if: failure(). If the audit step exits successfully yet no /tmp/audit-${SUITE}.md is produced (e.g. the agent ran to completion without writing the report), the summary points to an artifact that was never uploaded.
| echo "No report generated. See the \`claude-audit-log-*\` artifact on failures for the full event stream." >> "$GITHUB_STEP_SUMMARY" | |
| echo "No report generated." >> "$GITHUB_STEP_SUMMARY" |
Alternatively, the message could be conditioned on the step outcome, but reverting to the neutral original text avoids the false reference entirely.
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/agentic-ci-daily.yml
Line: 230
Comment:
**Misleading fallback message on success-with-no-report**
The new fallback message tells users to check the `claude-audit-log-*` artifact, but the artifact upload is gated on `if: failure()`. If the audit step exits successfully yet no `/tmp/audit-${SUITE}.md` is produced (e.g. the agent ran to completion without writing the report), the summary points to an artifact that was never uploaded.
```suggestion
echo "No report generated." >> "$GITHUB_STEP_SUMMARY"
```
Alternatively, the message could be conditioned on the step outcome, but reverting to the neutral original text avoids the false reference entirely.
How can I resolve this? If you propose a fix, please make it concise.
📋 Summary
The Friday
test-healthsuite hit the 30-turn cap on its first-ever run (failure run 24880704245) and left no retrievable trace —/tmp/claude-audit-log.txtlives only on the self-hosted runner and the step summary came back empty. This bumps the turn budget so heavier recipes have room to finish and preserves the agent log as a workflow artifact on failures so future regressions are diagnosable.Validated end-to-end via dispatch on this branch (run 24895909677): suite completed naturally in 34 turns (confirming the old 30-turn cap was genuinely too tight, not a loop) in ~5m27s.
🔗 Related Issue
N/A — follow-up on the agentic-ci-daily workflow introduced in #543.
🔄 Changes
--max-turnsfrom 30 to 50 in theRun audit recipestep--output-formatfromtexttostream-jsonso agent events are emitted during the run instead of only at process exit; prefix the invocation withstdbuf -oL -eLto line-buffer the pipeUpload agent logstep: uploads/tmp/claude-audit-log.txtand/tmp/audit-<suite>.mdas a workflow artifact on failure (if: failure(), 14-day retention), with a name keyed byrun_idandrun_attemptso re-runs do not collide. Reuses theactions/upload-artifact@v7.0.1SHA already pinned inbuild-notebooks.yml<details>block from theWrite job summarystep — the tail would now emit unreadable NDJSON, and the audit report itself (already in the summary) carries the human-readable payload🧪 Testing
make test— N/A, workflow-only change✅ Checklist
🔍 Attention Areas
.github/workflows/agentic-ci-daily.yml— the recipe is unchanged; the turn bump alone is what unblockstest-health. The artifact upload is gated onif: failure()(the log only materializes when we actually need to debug), and the name includesrun_attemptso re-runs do not hit the upload-artifact unique-name constraint.