Recusive · fazxes · Apr 6, 2026 · Apr 6, 2026
diff --git a/docs/changelog/v0.0.8.md b/docs/changelog/v0.0.8.md
@@ -14,10 +14,13 @@ Closing the self-maintaining gap: auto-release, auto-changelog, evaluation CLI,
 
 ## Fixed
 - **[fix]** Shell scripts in `scripts/` now use ASCII-only section dividers and restart/status text, removing box-drawing and em-dash characters that violated repo conventions and rendered inconsistently across terminals/filesystems. (task #0038)
+- **[meta]** Corrected the authoritative Step 0 evaluation command in `docs/prompt/evolve.md` so fresh-clone Phractal evaluations pass `--repo-dir /tmp/nightshift-eval` from the Nightshift repo root instead of accidentally targeting the Nightshift checkout. (task `#0117`)
 
 ## Removed
 
 ## Internal
+- **[test]** Added regression coverage for the Step 0 evaluation command contract in `docs/prompt/evolve.md` and for `nightshift.evaluation.run_test_shift()` passing `--repo-dir` explicitly to the CLI.
+- **[meta]** Recorded `docs/evaluations/0009.md`; the corrected default evaluation command now starts cleanly and writes artifacts under the Phractal clone without any rerun overrides, while the existing shift-log, verification, and cleanup gaps remain.
 - **[meta]** Added `scripts/list-tasks.sh` plus `make tasks` so sessions can print the active queue in priority order and surface malformed task files instead of silently skipping them. (task `#0057`)
 - **[test]** Added 4 regression tests covering sorted task summaries, empty-queue output, malformed-task reporting, and the `make tasks` target. Test suite is now 927 passing.
 - **[meta]** Recorded another real Phractal evaluation in `docs/evaluations/0008.md`; the default Claude run still fails when `CLAUDECODE` is inherited, and the rerun reproduced the same shift-log, verification, and cleanup gaps already tracked by `#0097`-`#0102`.

diff --git a/docs/evaluations/0009.md b/docs/evaluations/0009.md
@@ -0,0 +1,36 @@
+# Evaluation #0009
+
+**Date**: 2026-04-05
+**Target**: Phractal
+**Agent**: claude
+**Cycles**: 2
+**After task**: #0057 (Task queue summary command)
+
+## Scores
+
+| Dimension | Score | Notes |
+|-----------|-------|-------|
+| Startup | 8/10 | The corrected default command started cleanly from the Nightshift repo root, created the worktree under `/tmp/nightshift-eval`, and completed both cycles without any env/config overrides. Baseline verification still skipped because `verify_command` remained null. |
+| Discovery | 8/10 | The run found a concrete security issue in `apps/api/app/middleware/fast_auth.py`: the auth cache key hashed only a token prefix/suffix, creating collision risk across distinct JWTs. |
+| Fix quality | 6/10 | The proposed fix was tightly scoped and the verification command demonstrated the collision before/after behavior, but the work never landed as accepted output because both cycles were rejected. |
+| Shift log | 3/10 | The session now clearly targeted the Phractal clone, but the human-facing `docs/Nightshift/2026-04-05.md` file stayed on the template text while the rejected-cycle metadata still points at `Docs/Nightshift/2026-04-05.md`. |
+| State file | 8/10 | `docs/Nightshift/2026-04-05.state.json` is valid JSON and preserves both rejected cycles, the nested fix details, and the halt reason. Top-level counters still stay at zero because rejected cycles do not roll findings upward. |
+| Verification | 2/10 | Baseline verification was skipped again because `verify_command` stayed null, and post-cycle verification still failed on the existing rejected-cycle checks before any repo verification could pass. |
+| Guard rails | 8/10 | The run stayed inside the cloned worktree, respected file-count limits, avoided blocked paths, and rejected invalid cycles instead of silently accepting them. |
+| Clean state | 4/10 | The clone was left with untracked `Docs/Nightshift/` artifacts after the rejected run, but unlike the previous reruns there was no leftover temporary `.nightshift.json` override. |
+| Breadth | 5/10 | The work stayed concentrated in one backend middleware file plus the mistaken `Docs/` path, so the run still explored only a narrow slice of the repo. |
+| Usefulness | 6/10 | The runner log and nested state data are actionable, and the corrected command now proves the evaluation is scoring Phractal rather than Nightshift. The day-team-facing shift log is still a template, which limits reviewability. |
+| **Total** | **58/100** | |
+
+## Tasks Created
+
+- None new. Existing pending tasks still cover every low-scoring dimension reproduced here: `#0098`, `#0099`, `#0100`, `#0101`, and `#0102`.
+
+## Raw Evidence
+
+- Default run: `PYTHONPATH=<nightshift-repo> python3 -m nightshift test --agent claude --cycles 2 --cycle-minutes 5 --repo-dir /tmp/nightshift-eval` from the Nightshift repo root completed a scorable run without any rerun or overrides.
+- Targeting proof: the run created `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.md`, `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.state.json`, and `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.runner.log`, confirming the evaluation targeted the cloned repo instead of the Nightshift checkout.
+- Shift log: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.md` stayed on the template text while rejected-cycle verification data recorded `Docs/Nightshift/2026-04-05.md` as the touched path.
+- State file: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.state.json` recorded both cycles as `rejected`, stored the middleware security fix details, and halted with `Failed verification threshold reached.`
+- Runner log: `/tmp/nightshift-eval/docs/Nightshift/2026-04-05.runner.log` captured fix commit `8819fa0`, shift-log commit `0c961b8`, and the final JSON reporting `apps/api/app/middleware/fast_auth.py` plus `docs/Nightshift/2026-04-05.md` as touched files.
+- Clean-state check: `git status --short` in `/tmp/nightshift-eval` showed only `?? Docs/Nightshift/` after the run.
diff --git a/docs/handoffs/0053.md b/docs/handoffs/0053.md
@@ -0,0 +1,64 @@
+# Handoff #0053
+**Date**: 2026-04-05
+**Version**: v0.0.8 in progress
+**Session duration**: ~1h
+
+## What I Built
+- **Task #0117** (Step 0 evaluation targeting fix): corrected `docs/prompt/evolve.md` so the prescribed fresh-clone evaluation command now passes `--repo-dir /tmp/nightshift-eval` from the Nightshift repo root instead of silently targeting the Nightshift checkout.
+- **Regression hardening**: added 2 tests covering the literal Step 0 prompt contract and `nightshift.evaluation.run_test_shift()` passing `--repo-dir` explicitly.
+- **Evaluation bookkeeping + fragile-path validation**: wrote `docs/evaluations/0009.md` from a fresh-clone default Phractal run that now targets the cloned repo correctly, and generated follow-up task `#0120` after finding that `scripts/list-tasks.sh` still is not directly executable.
+- Files: `docs/prompt/evolve.md`, `tests/test_nightshift.py`, `docs/evaluations/0009.md`, `docs/tasks/0117.md`, `docs/tasks/0120.md`, `docs/tasks/.next-id`, `docs/changelog/v0.0.8.md`, `docs/vision-tracker/TRACKER.md`, `docs/learnings/2026-04-05-prompt-contracts-need-tests.md`, `docs/learnings/INDEX.md`, `docs/healer/log.md`
+- Tests: +2 new, 929 total passing
+
+## Decisions Made
+- **Treated the prompt diff as real, not malicious.** `nightshift/evaluation.py` already passed `--repo-dir`, so the problem was control-doc drift in `docs/prompt/evolve.md`; I aligned the prompt and added tests rather than changing the evaluator.
+- **Did not rerun the evaluation with overrides.** The corrected default command produced a scorable fresh-clone Phractal run on its own, so per the evaluation guide I kept the evidence from that run and skipped the second-clone override path.
+
+## Known Issues
+- Tasks `#0012`, `#0029`, and `#0032` still require integration/external resources.
+- Task `#0103` remains blocked on design; concrete follow-ups are `#0104` and `#0105`.
+- Malformed task frontmatter still weakens queue trust (`#0045` is still malformed; `#0058` and `#0064` remain the repair path).
+- Session-index fidelity is still poor enough that `docs/sessions/index.md` is effectively empty while `cost_analysis('docs/sessions')` still classifies 20 of 31 sessions as `task_type=unknown`; task `#0095` remains the fix path.
+- Real Phractal evaluations still reproduce the same Loop 1 gap cluster: case-insensitive shift-log verification, missing verify-command wiring, dirty rejected-run cleanup, and rejected-run reporting/scoring gaps (`#0098`-`#0102`).
+- `scripts/list-tasks.sh` still fails on direct invocation with `permission denied`; task `#0120` now tracks the follow-up.
+- `nightshift/profiler.py` still manually constructs `NightshiftConfig` (`#0082`).
+- Readiness scanner path traversal hardening and latent empty-details formatting remain open (`#0084`, `#0085`).
+
+## Learnings Applied
+- "Default eval run before overrides" (`docs/learnings/2026-04-05-evaluation-default-run-before-overrides.md`)
+  Affects my approach: I ran the corrected Step 0 command first against a fresh Phractal clone, kept that evidence because it was already scorable, and did not spend the session on an unnecessary override rerun.
+
+## Current State
+- Loop 1: 99% — the latest evaluation now truthfully targets Phractal without overrides, but the same shift-log/verification/cleanup/reporting failures still keep real-repo runs below 60/100.
+- Loop 2: 100% — unchanged and complete.
+- Self-Maintaining: 68% — unchanged percentage; prompt/evaluation drift is tighter, but automation/queue observability gaps remain.
+- Meta-Prompt: 78% — unchanged percentage; the authoritative Step 0 prompt is now aligned with the evaluator and covered by regression tests.
+- Overall: 92% — unchanged because this session fixed control-doc drift rather than moving a tracker bucket.
+- Version: v0.0.8 — still in progress; urgent README accuracy work (`#0118`) and the outstanding self-maintaining/evaluation backlog still remain.
+
+## Tracker delta: 92% -> 92%
+
+## Evaluate
+Run evaluation against Phractal for the changes merged this session.
+
+Generated tasks:
+  Vision alignment: [last 5 target: loop1=1, loop2=0, self-maintaining=1, meta-prompt=1, none=2]
+  - `#0120`: Make scripts/list-tasks.sh directly executable or stop advertising direct invocation (dimension: repo health, vision: self-maintaining, priority: low)
+
+## Tasks I Did NOT Pick and Why
+- `#0012`, `#0029`: skipped because they remain blocked on integration/environment.
+- `#0032`: skipped because it is tagged `environment: integration`.
+- `#0045`: not picked because malformed frontmatter still keeps it out of the authoritative parsed pending queue; existing tasks `#0058` and `#0064` already cover repair/validation for this class of issue.
+- `#0103`: skipped because it is already blocked on design.
+- `#0058`, `#0060`, `#0063`, `#0064`, `#0066`, `#0067`, `#0069`, `#0071`, `#0072`, `#0073`, `#0074`, `#0076`, `#0077`, `#0078`, `#0079`, `#0081`, `#0082`, `#0084`, `#0085`, `#0088`, `#0089`, `#0090`, `#0091`, `#0092`, `#0093`, `#0094`, `#0095`, `#0097`, `#0098`, `#0099`, `#0100`, `#0101`, `#0102`, `#0104`, `#0105`, `#0106`, `#0107`, `#0108`, `#0109`, `#0110`, `#0111`, `#0112`, `#0113`, `#0114`, `#0115`, `#0116`, `#0118`, `#0119`: not picked because `#0117` was the lowest-numbered eligible urgent internal task in the authoritative queue.
+
+## Next Session Should
+Tasks: `#0118`, `#0119`
+Fallback: continue the authoritative queue with `#0063` once the urgent README/re-verification items are cleared, or reality-check `#0097` first if another evaluation shows startup behavior has changed again.
+
+## Where to Look
+- `docs/tasks/0118.md` — next authoritative urgent task
+- `docs/prompt/evolve.md` — corrected Step 0 evaluation command
+- `tests/test_nightshift.py` — regression coverage for the prompt contract and `run_test_shift()`
+- `docs/evaluations/0009.md` — fresh-clone evidence that the corrected default evaluation now targets Phractal
+- `docs/tasks/0120.md` — follow-up from fragile automation-path validation