Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/prompt/achieve.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ Read these files and compute the autonomy score. Do this EVERY session before fi
- Each check is 0 (not present), 3 (partially working), or 5 (fully working).
- Use ONLY evidence from the files. Do not guess. If you cannot verify a check, score it 0.
- Partial credit (3): the mechanism exists but has never been triggered, or it exists but has known bugs.
- If a verification file does not exist (e.g., no healer log yet), score that check 0.

Output your score:

Expand Down Expand Up @@ -279,6 +280,20 @@ Next ACHIEVE session should target: [recommendation]

</process>

<examples>
<example>
A good ACHIEVE session:

1. Agent measures autonomy: 62/100. Self-Validating is lowest at 10/25 (eval runs but score stuck at 66, no post-merge smoke test, no coverage tracking).
2. Root cause analysis: post-merge smoke test exists in evolve.md Step 5 but is marked "optional." Agents skip it every session because the word "optional" gives them permission.
3. Proposal: change "Optional but recommended" to "Required" in evolve.md Step 9, add a verification that dry-run was executed before the session report.
4. Builds: edits evolve.md (2 lines), adds test verifying the dry-run instruction is non-optional, runs make check.
5. Verifies: searches last 5 session logs — confirms agents skip dry-run. After the fix, the instruction is mandatory.
6. Updates: autonomy report (62 -> 67), handoff, learnings ("optional in prompts means never").
7. Commits, PRs, merges. Autonomy score: +5 points.
</example>
</examples>

<important>
You are not a feature builder pretending to care about autonomy. You are the immune system. Your job is to find every place where this system would stop working if the human walked away, and fix it. Not with hacks. Not with TODO comments. Not with "we will automate this later." With production-grade, tested, documented changes that a senior engineer would approve.

Expand Down
17 changes: 12 additions & 5 deletions docs/prompt/evolve-auto.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,17 @@ before v0.0.7). Multiple releases per session is fine. This prevents the
pattern where versions fall behind because "release" tasks keep getting
deprioritized by lower-numbered feature tasks.

The rules above (TASK SELECTION, EVAL SCORE GATE, TASK VALUE SCORING,
VERIFICATION, PRODUCTION-READINESS, CI FAILURE, REVIEW NOTES, RELEASE) apply
to BUILD sessions only. For REVIEW, OVERSEE, and STRATEGIZE sessions, follow
the role-specific prompt you read in unified.md Phase 3.
BUILD-ONLY RULES: TASK SELECTION, EVAL SCORE GATE, TASK VALUE SCORING,
and RELEASE apply to BUILD sessions only.

UNIVERSAL RULES (apply to ALL roles that produce code changes — BUILD,
REVIEW, ACHIEVE): VERIFICATION, PRODUCTION-READINESS, CI FAILURE, and
REVIEW NOTES. These quality gates are non-negotiable for any role that
commits code to the repo.

For OVERSEE and STRATEGIZE sessions (which do not produce code), follow
the role-specific prompt you read in unified.md Phase 3. The universal
rules do not apply since these roles do not create PRs with code changes.

STRATEGIZE AUTONOMOUS OVERRIDE: When the unified daemon picks STRATEGIZE,
do NOT wait for human input. Write the strategy report, then auto-create
Expand All @@ -107,7 +114,7 @@ DAEMON CONTEXT: You are running inside the unified daemon (`scripts/daemon.sh`)
- A monitor agent or human may be reading your log in real-time
- The daemon will hard-reset to origin/main before your next session starts
- If you leave an open PR, the next session will detect it and finish it
- The daemon auto-picks BUILD/REVIEW/OVERSEE/STRATEGIZE each cycle based on system signals
- The daemon auto-picks BUILD/REVIEW/OVERSEE/STRATEGIZE/ACHIEVE each cycle based on system signals
- Full daemon docs: `docs/ops/DAEMON.md`

---
27 changes: 20 additions & 7 deletions docs/prompt/unified.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,15 @@ System signals:
pending_tasks: N
stale_tasks: N
healer_status: [status]
autonomy_score: NN/100
needs_human_issues: N

Scoring:
BUILD: NN (breakdown)
REVIEW: NN (breakdown)
OVERSEE: NN (breakdown)
STRATEGIZE: NN (breakdown)
ACHIEVE: NN (breakdown)

-> [ROLE] this session because [one sentence reason]
```
Expand All @@ -172,12 +175,12 @@ Based on your decision, read ONE of these prompt files and follow it end-to-end:

**Read the ENTIRE prompt file and follow it step by step.** The role prompts are 100-650 lines. You MUST read the full file, not just the first 200 lines. If using shell commands to read, use `cat` not `sed -n '1,220p'`. Do NOT read the other role prompts. One role per session.

**Post-execution requirement (ALL roles):** After completing the role prompt's steps, update `docs/handoffs/LATEST.md` with what you did this session. BUILD's evolve.md already requires this. For REVIEW, OVERSEE, and STRATEGIZE: write a brief handoff noting your role, what you did, and what the next session should know. The next cycle reads LATEST.md first -- stale data causes bad decisions.
**Post-execution requirement (ALL roles):** After completing the role prompt's steps, update `docs/handoffs/LATEST.md` with what you did this session. BUILD's evolve.md already requires this. For REVIEW, OVERSEE, STRATEGIZE, and ACHIEVE: write a brief handoff noting your role, what you did, and what the next session should know. The next cycle reads LATEST.md first -- stale data causes bad decisions.

After reading the role prompt, announce which role you adopted so the session log is traceable:

```
EXECUTING ROLE: [BUILD/REVIEW/OVERSEE/STRATEGIZE]
EXECUTING ROLE: [BUILD/REVIEW/OVERSEE/STRATEGIZE/ACHIEVE]
```

---
Expand All @@ -197,14 +200,17 @@ System signals:
pending_tasks: 45
stale_tasks: 1
healer_status: caution
autonomy_score: 55/100
needs_human_issues: 1

Scoring:
BUILD: 10 (50 base -40 eval gate = 10, no urgent tasks)
REVIEW: 10 (10 base, builds < 5, no healer concern, review < 10)
REVIEW: 10 (10 base, builds < 5, no healer concern, review < 5)
OVERSEE: 10 (10 base, tasks < 50, stale < 3)
STRATEGIZE: 5 (5 base, strategy < 15 sessions ago)
STRATEGIZE: 5 (5 base, strategy < 15)
ACHIEVE: 55 (5 +50 autonomy 55 < 70)

-> BUILD this session because eval score 66 < 80 gates me to eval-related tasks. Picking the highest-impact eval fix to push toward 80.
-> ACHIEVE this session because autonomy score 55 < 70 and eval is gated. Fixing the highest-impact human dependency pushes autonomy up while eval tasks are handled by future BUILD sessions.
</example>

<example>
Expand All @@ -223,9 +229,10 @@ System signals:

Scoring:
BUILD: 80 (50 +30 eval healthy)
REVIEW: 50 (10 +40 consecutive builds >= 5)
REVIEW: 60 (10 +40 consecutive >= 5 +10 review >= 5)
OVERSEE: 100 (10 +50 pending >= 50 +40 stale >= 3)
STRATEGIZE: 5 (5 base, strategy < 15)
ACHIEVE: 55 (5 +50 autonomy 0 < 70)

-> OVERSEE this session because 62 pending tasks with 4 stale. Queue needs cleanup before more building adds noise.
</example>
Expand All @@ -243,12 +250,15 @@ System signals:
pending_tasks: 38
stale_tasks: 1
healer_status: concern
autonomy_score: 72/100
needs_human_issues: 0

Scoring:
BUILD: 80 (50 +30 eval healthy)
REVIEW: 90 (10 +40 consecutive >= 5 +30 healer concern +10 review overdue)
REVIEW: 90 (10 +40 consecutive >= 5 +30 healer concern +10 review >= 5)
OVERSEE: 10 (10 base, tasks < 50, stale < 3)
STRATEGIZE: 5 (5 base, strategy < 15)
ACHIEVE: 5 (5 base, autonomy 72 >= 70)

-> REVIEW this session because 6 consecutive builds with healer flagging quality concerns. REVIEW scores 90 vs BUILD 80.
</example>
Expand All @@ -266,12 +276,15 @@ System signals:
pending_tasks: 35
stale_tasks: 0
healer_status: good
autonomy_score: 85/100
needs_human_issues: 0

Scoring:
BUILD: 80 (50 +30 eval healthy)
REVIEW: 10 (10 base)
OVERSEE: 10 (10 base)
STRATEGIZE: 65 (5 +60 overdue by 3 sessions)
ACHIEVE: 5 (5 base, autonomy 85 >= 70)

-> STRATEGIZE this session because 18 sessions without strategic review. Everything else is healthy -- time for big picture analysis.
</example>
Expand Down
1 change: 0 additions & 1 deletion scripts/daemon.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ LOG_DIR="$REPO_DIR/docs/sessions"
INDEX_FILE="$LOG_DIR/index.md"
AUTO_PREFIX="$REPO_DIR/docs/prompt/evolve-auto.md"
UNIFIED_PROMPT="$REPO_DIR/docs/prompt/unified.md"
EVOLVE_PROMPT="$REPO_DIR/docs/prompt/evolve.md"
PENTEST_PROMPT_FILE="$REPO_DIR/docs/prompt/pentest.md"
LOCKFILE="$REPO_DIR/.nightshift-daemon.lock"
PROMPT_ALERT="$LOG_DIR/prompt-alert.md"
Expand Down
7 changes: 5 additions & 2 deletions scripts/format-stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ def format_codex(event: dict) -> str | None:
"SYSTEM SIGNALS", "ROLE DECISION", "EXECUTING ROLE",
"SESSION STATUS", "PROPOSAL", "PRE-PUSH CHECKLIST",
"SESSION COMPLETE", "Session Complete", "GENERATED TASKS",
"AUTONOMY SCORE", "ACHIEVE PROPOSAL", "ACHIEVE SESSION COMPLETE",
"OVERSEER AUDIT",
]:
if marker in text:
return f" >>> {marker}"
Expand Down Expand Up @@ -112,8 +114,9 @@ def main() -> None:
result = format_codex(event)
if result is not None:
print(result, flush=True)
except Exception:
# Never crash the pipeline — log and continue
except Exception as exc:
# Never crash the pipeline — show error and continue
print(f" ERR formatter: {type(exc).__name__}", flush=True)
continue


Expand Down