Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 6 additions & 9 deletions .recursive/ops/DAEMON.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ bash .recursive/engine/daemon.sh claude 60 10
Arguments are:

1. agent name
2. pause between sessions in seconds
3. max sessions (`0` means loop forever)
2. duration in hours (default: 8)
3. max sessions (`0` means unlimited, use duration limit)

### tmux

Expand Down Expand Up @@ -98,7 +98,7 @@ Valid values are `build`, `review`, `oversee`, `strategize`, `achieve`, `securit
## How Role Selection Works

At the start of every cycle, `.recursive/engine/daemon.sh` calls
[.recursive/engine/pick-role.py](/Users/no9labs/Developer/.recursive/Nightshift/.recursive/engine/pick-role.py).
`.recursive/engine/pick-role.py`.
That scorer reads the live system state and prints one winner.

Primary inputs:
Expand All @@ -111,8 +111,7 @@ Primary inputs:
- the latest report in `.recursive/autonomy/`
- open GitHub issues labeled `needs-human`

The exact math belongs in
[.recursive/ops/ROLE-SCORING.md](/Users/no9labs/Developer/.recursive/Nightshift/.recursive/ops/ROLE-SCORING.md),
The exact math belongs in `.recursive/ops/ROLE-SCORING.md`,
not in this file. Read that file when debugging "why did the daemon pick this
role?" behavior.

Expand Down Expand Up @@ -205,8 +204,7 @@ These are the authoritative runtime artifacts:
| Path | Purpose |
|------|---------|
| `.recursive/sessions/index.md` | Unified session history across all roles |
| `.recursive/sessions/*.log` | Stream-json session logs |
| `.recursive/sessions/*-pentest.log` | Pentest preflight logs |
| `.recursive/sessions/raw/*.log` | Stream-json session logs |
| `.recursive/sessions/costs.json` | Cost ledger used by budget checks |
| `.recursive/handoffs/LATEST.md` | Short-term memory for the next cycle |
| `.recursive/evaluations/*.md` | Real-repo evaluation reports |
Expand Down Expand Up @@ -321,8 +319,7 @@ resets the repo, and injects that alert into the next cycle.
The circuit breaker stops the daemon after three failed cycles. Inspect:

- `.recursive/sessions/index.md`
- the latest session log
- the latest pentest log
- the latest session log in `.recursive/sessions/raw/`
- `.recursive/handoffs/LATEST.md`

### Budget stop
Expand Down
22 changes: 20 additions & 2 deletions .recursive/ops/OPERATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,7 @@ The Python package that IS Nightshift. The overnight hardening runner.

### Dependency flow (nightshift package)
```
core/errors → settings/eval_targets → core/types → core/constants → core/shell → raven/summary → raven/coordination → infra/module_map → owl/readiness → owl/scoring → core/state → settings/config → infra/multi → raven/e2e → raven/profiler → infra/worktree → owl/cycle → raven/planner → raven/subagent → raven/decomposer → raven/integrator → raven/feature → cli
core/errors → core/types → core/constants → core/shell → raven/summary → raven/coordination → infra/module_map → owl/readiness → owl/scoring → core/state → settings/config → settings/eval_targets → owl/eval_runner → infra/multi → raven/e2e → raven/profiler → infra/worktree → owl/cycle → raven/planner → raven/subagent → raven/decomposer → raven/integrator → raven/feature → infra/release → cli
```
No circular imports. Each module only imports from modules to its left. `multi.py` receives the `run_nightshift` callable from `cli.py` via dependency injection to avoid circular deps.

Expand All @@ -472,7 +472,7 @@ Note: `cleanup.py`, `compact.py`, `costs.py`, `evaluation.py`, and `config.py` (
## System 7: Tests (`nightshift/tests/`)

### What it is
915 pytest tests covering every pure function, config, state, CLI, and integration.
1156 pytest tests covering every pure function, config, state, CLI, and integration.

### Files
| File | Purpose |
Expand Down Expand Up @@ -797,6 +797,24 @@ What defines each version. Use this to know when a release is ready.
- [x] Wave integrator module (`nightshift/integrator.py`)
- [x] `nightshift build` CLI command (`nightshift/feature.py` -- build/status/resume)

### v0.0.7 — Security Hardening (released 2026-04-05)
- [x] Prompt injection protection for target repos
- [x] Prompt self-modification guard across all daemon scripts
- [x] Cost tracking and budget ceiling for daemon sessions
- [x] Daemon log rotation and orphan branch pruning
- [x] Automated handoff compaction in daemon
- [x] Configurable model/effort/thinking per agent

### v0.0.8 — Self-Maintaining (in progress)
- [x] Auth-error circuit breaker bypass with notify_human
- [x] Auto-release module (`nightshift/infra/release.py`)
- [x] Eval runner CLI (`nightshift/owl/eval_runner.py`)
- [x] Session index writer rewrite (single-line rows, delegation-aware counters)
- [x] Worktree cleanup rewrite with self-removal guard
- [x] Eval staleness signal in dashboard
- [x] Delegation-aware sessions-since counters (signals.py + pick-role.py)
- [ ] Wire E2E eval into daemon loop automatically

### v1.0.0 — Production
- [ ] Loop 1 runs reliably overnight on real repos
- [ ] Loop 2 can build a simple feature end-to-end
Expand Down
9 changes: 7 additions & 2 deletions .recursive/ops/ROLE-SCORING.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ used when the source file is missing or unreadable.
| `tracker_moved` | Session index -- any `%` in recent status cells | false |
| `recent_security_sessions` | Session index + archived pentest tasks (dual-signal) | 0 |
| `friction_entries` | `.recursive/friction/log.md` -- count `## YYYY-MM-DD` headers | 0 |
| `pentest_framework_tasks` | `.recursive/tasks/` -- pending tasks with `source: pentest` AND `target: recursive` | 0 |
| `sessions_since_eval` | `.recursive/evaluations/` vs session index -- sessions since latest eval file was written (dashboard-only, not used in scoring) | 0 |

**Eval file validation**: `read_latest_eval_score()` validates the file before reading the score.
A file must have a `**Date**:` line and at least 3 scored dimension rows (`N/10` format) outside
Expand Down Expand Up @@ -160,9 +162,12 @@ friction_entries >= 5: +50 (lots of friction accumulated)
friction_entries >= 3
AND sessions_since_evolve >= 5: +30 (moderate friction, hasn't evolved recently)
sessions_since_evolve >= 20: +20 (overdue regardless of friction count)
pentest_framework_tasks >= 1: +40 (confirmed security vuln in .recursive/ -- security urgency)

Hard cap: capped at 5 if sessions_since_evolve < 5 (don't re-run too frequently)
Hard cap: capped at 5 if friction_entries == 0 (no friction = nothing to evolve)
Hard cap: capped at 5 if sessions_since_evolve < 5 AND pentest_framework_tasks == 0
(don't re-run too frequently unless pentest tasks pending)
Hard cap: capped at 5 if friction_entries == 0 AND pentest_framework_tasks == 0
(no friction and no pentest tasks = nothing to evolve)
```

### AUDIT -- framework quality review
Expand Down
111 changes: 111 additions & 0 deletions .recursive/reviews/2026-04-09-audit-126.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Framework Audit -- Session #126

**Date**: 2026-04-09
**Trigger**: 18 sessions since last audit (session #107)
**Auditor**: audit-agent

---

## Quality Audit

Files audited: 12
- `.recursive/ops/OPERATIONS.md`
- `.recursive/ops/DAEMON.md`
- `.recursive/ops/ROLE-SCORING.md`
- `.recursive/engine/daemon.sh`
- `.recursive/engine/lib-agent.sh`
- `.recursive/engine/signals.py`
- `.recursive/engine/dashboard.py`
- `.recursive/engine/pick-role.py`
- `.recursive/agents/brain.md`
- `.recursive/sessions/index.md`
- `.recursive/architecture/MODULE_MAP.md`
- `CLAUDE.md`

Issues found: 8
Issues fixed in this PR: 6
Tasks created for remaining issues: 2 (#0249, #0250)

### Issues Fixed

1. **OPERATIONS.md: Stale test count** -- Updated "915 pytest tests" to "1156 pytest tests". The count grew from 915 to 1156 across sessions #107-#125 with 241 new tests added.

2. **OPERATIONS.md: Missing version milestones** -- Added v0.0.7 (Security Hardening, released 2026-04-05) and v0.0.8 (Self-Maintaining, in progress) milestone entries. The doc stopped at v0.0.6 despite both versions having changelog files.

3. **DAEMON.md: Wrong argument description** -- Arg 2 was described as "pause between sessions in seconds" but daemon.sh uses it as `duration_hours` (default 8). The tmux example `daemon.sh claude 60` would mean 60 hours, not 60 second pause. Fixed description to "duration in hours (default: 8)" and arg 3 to clarify 0=unlimited.

4. **DAEMON.md: Hardcoded absolute paths** -- Lines 101 and 115 had absolute links `/Users/no9labs/Developer/.recursive/Nightshift/...` with `.recursive` misplaced in the path. Replaced with relative paths.

5. **DAEMON.md: Stale pentest log references** -- Referenced `.recursive/sessions/*-pentest.log` (v1 era artifact) in the Logs table and circuit breaker recovery section. These files do not exist in the v2 architecture. Replaced with `.recursive/sessions/raw/*.log`.

6. **ROLE-SCORING.md: Missing signals** -- The signals table was missing `pentest_framework_tasks` (added session #109, used to boost evolve +40) and `sessions_since_eval` (added session #124, used for eval staleness alert). Both signals are actively used in pick-role.py and dashboard.py. Added both to the signals table and documented the `pentest_framework_tasks` boost in the EVOLVE scoring section.

7. **sessions/index.md: Corrupted role field** -- Session 20260409-020609 had role `.*'"$LOG_FILE"2>/d` (shell injection artifact from a regex-extraction bug in daemon.sh's role extractor). Corrected to `brain`.

8. **CLAUDE.md + OPERATIONS.md: Divergent dependency flows** -- CLAUDE.md was missing `raven.summary`, `raven.coordination`, `raven.e2e`, `raven.profiler` modules and had wrong ordering of `settings.config`/`settings.eval_targets`. Neither file included `owl.eval_runner` (added session #118). Synchronized both files to a consistent flow that includes all current modules.

### Tasks Created

- **#0249**: Regenerate MODULE_MAP.md -- stale since session #0001, shows only 3 modules, actual package has 20+. Requires `make` or the CLI command which touches nightshift/ (build zone).

- **#0250**: Fix DAEMON.md cycle lifecycle description -- shows `git checkout main` and `git clean -fd` which don't appear in daemon.sh. Framework zone (evolve agent).

---

## Pattern Analysis

Sessions analyzed: 19 (sessions #107-#125)
Commitment hit rate: 19/19 = **100%** (perfect streak)
Cost trend: **stable** (~$1.5-2.2 USD/session, with outliers for complex parallel sessions)

### Decision Patterns

**Role distribution (last 19 sessions)**:
- build: 9 delegations
- evolve: 12 delegations (many sessions ran both build+evolve in parallel)
- oversee: 1 delegation (#0122)
- strategize: 1 delegation (#0123)
- security: 1 delegation (#0109, #0110)
- audit: 1 delegation (#0107, this session)
- review: 0 delegations

**Observations**:
- Build+evolve parallel pattern is the dominant strategy (used in 8 of 19 sessions). It's efficient and produces good throughput.
- Security was run twice in rapid succession (#109, #110) -- the pentest->evolve fix cycle worked well.
- Review role has not been delegated in 19 sessions. The code-reviewer/safety-reviewer sub-agents are used per-PR but the standalone review role (file-by-file quality review) has been skipped.
- Advisory overrides are common but justified (5 of 19 sessions).

**Override quality**: All overrides were justified with clear rationale. No habitual overrides observed.

### Commitment Quality

All 19 commitments were MET. Specific observations:
- Predictions are consistently calibrated -- specific and measurable
- Eval score predictions tend to be conservative (>= threshold) rather than point estimates
- Test count predictions consistently underestimate actual new tests (e.g., predicted 3+, got 25)

### Cost Analysis

Session costs range from $0.39 to $2.60 USD. Two sessions stood out:
- Session #0110 ($2.38): Most expensive -- parallel build+evolve with complex security work
- Session #0114 ($2.60): Most expensive overall -- release module with 3 fix cycles

Cost trend is stable. No drift upward or downward.

### Optimization Opportunities

1. **Review role gap**: 19 sessions without a standalone review. The review role does file-by-file deep quality checks that per-PR reviewers don't do. Consider triggering when consecutive_builds >= 10 (currently 5, but the brain often parallels build+evolve which doesn't change this counter).

2. **Eval regression tracking**: Eval dropped from 86 to 83 between #0016 and #0017. The count-only payload issue in state file is task #0247. Track whether fixing this brings eval back above 86.

3. **Queue not shrinking**: Queue stabilized at 69 after oversee in #0122 but hasn't continued to decrease. With 2 new tasks created per session on average, the queue will grow unless oversee runs more frequently.

4. **MODULE_MAP.md rot**: The module map hasn't been regenerated since session #0001 and shows only 3 modules. Dashboard and brain signals that reference the map get no useful data. Task #0249 addresses this.

---

## Verification

- `make check` passes (1156 tests)
- No framework files in nightshift/ touched
- All 8 issues identified; 6 fixed directly; 2 tasked
2 changes: 1 addition & 1 deletion .recursive/sessions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,5 +85,5 @@
| 2026-04-08 22:41 | 20260408-183951 | brain | 0 | - | - | success | #0177 eval rerun (53->86) + #0203 ROLE-SCORING v2 | [#198](https://github.com/Recusive/Nightshift/pull/198), [#197](https://github.com/Recusive/Nightshift/pull/197) |
| 2026-04-09 01:44 | 20260409-011757 | brain | 0 | 26m | $2.158 | success | - | - |
| 2026-04-09 02:06 | 20260409-014441 | brain | 0 | 21m | $2.0223 | success [PROMPT MODIFIED] [ORIGIN MODIFIED] | - | - |
| 2026-04-09 02:25 | 20260409-020609 | .*'\"$LOG_FILE\"2>/d | 0 | 18m | $2.2003 | success [PROMPT MODIFIED] | - | - |
| 2026-04-09 02:25 | 20260409-020609 | brain | 0 | 18m | $2.2003 | success [PROMPT MODIFIED] | - | - |
| 2026-04-09 02:42 | 20260409-022508 | brain | 0 | 17m | $1.3038 | success [PROMPT MODIFIED] | - | - |
2 changes: 1 addition & 1 deletion .recursive/tasks/.next-id
Original file line number Diff line number Diff line change
@@ -1 +1 @@
249
251
22 changes: 22 additions & 0 deletions .recursive/tasks/0249.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
status: pending
priority: normal
target: recursive
source: audit
created: 2026-04-09
completed:
---

# Regenerate MODULE_MAP.md (stale since session #0001)

The `.recursive/architecture/MODULE_MAP.md` was last generated in session #0001 and shows only 3 top-level modules. It is severely stale -- the package has grown to include `core/`, `settings/`, `owl/`, `raven/`, and `infra/` subpackages with 20+ modules. The stale map gives future sessions incorrect orientation data.

## Acceptance Criteria
- [ ] Run `python3 -m nightshift module-map --write` from the repo root
- [ ] Verify the new MODULE_MAP.md shows all subpackages and modules
- [ ] Verify the dependency order matches the flow in CLAUDE.md
- [ ] Commit the updated MODULE_MAP.md
- [ ] PR passes code-reviewer

## Notes
This is a build-zone task (touches nightshift/). The command auto-generates the file -- no manual editing needed.
35 changes: 35 additions & 0 deletions .recursive/tasks/0250.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
status: pending
priority: normal
target: recursive
source: audit
created: 2026-04-09
completed:
---

# Fix DAEMON.md cycle lifecycle description (git commands inaccurate)

The DAEMON.md "Cycle Lifecycle" section shows these commands:

```
git fetch origin
git checkout main
git reset --hard origin/main
git clean -fd
```

But the actual `daemon.sh` only runs:

```
git -C "$REPO_DIR" fetch origin main --quiet
git -C "$REPO_DIR" reset --hard origin/main --quiet
```

No `git checkout main` and no `git clean -fd`. The doc is misleading agents who read it to understand the reset behavior.

## Acceptance Criteria
- [ ] Update DAEMON.md "1. Reset and housekeeping" section to match actual daemon.sh reset commands
- [ ] PR passes docs-reviewer

## Notes
Framework-zone task (touches `.recursive/ops/DAEMON.md`). Delegate to evolve agent.
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ These are enforced by CI. Non-negotiable.
- **One concern per module.** If you're adding >50 lines of new logic to an existing file, it belongs in its own module. cycle.py handles cycle logic -- not scoring. cli.py handles CLI -- not business logic.
- **No hardcoded data in logic files.** Regex patterns, score maps, category weights, thresholds -- these go in `core/constants.py` or a dedicated `*_patterns.py`. Logic files import them.
- **New module checklist:** create the `.py` file in the appropriate subpackage (`core/`, `settings/`, `owl/`, `raven/`, `infra/`), add to the subpackage's `__init__.py` re-exports, add to `nightshift/scripts/install.sh` PACKAGE_FILES, add to this file's structure tree.
- **Follow the dependency flow:** `core.errors -> core.types -> core.constants -> core.shell -> core.state -> settings.config -> settings.eval_targets -> owl.cycle -> owl.scoring -> owl.readiness -> raven.planner -> raven.decomposer -> raven.subagent -> raven.integrator -> raven.feature -> infra.worktree -> infra.module_map -> infra.multi -> infra.release -> cli`. New modules slot into this chain. No circular imports. (`infra/multi.py` uses a late import of `run_nightshift` from `cli.py` to avoid circular deps.)
- **Follow the dependency flow:** `core.errors -> core.types -> core.constants -> core.shell -> raven.summary -> raven.coordination -> infra.module_map -> owl.readiness -> owl.scoring -> core.state -> settings.config -> settings.eval_targets -> owl.eval_runner -> infra.multi -> raven.e2e -> raven.profiler -> infra.worktree -> owl.cycle -> raven.planner -> raven.subagent -> raven.decomposer -> raven.integrator -> raven.feature -> infra.release -> cli`. New modules slot into this chain. No circular imports. (`infra/multi.py` uses a late import of `run_nightshift` from `cli.py` to avoid circular deps.)
- **Functions over inline code.** If a block of code does one thing and is >10 lines, extract it into a named function. The function name documents the intent.
- **Config over magic numbers.** If a value might change (thresholds, limits, timeouts), put it in `DEFAULT_CONFIG` and `core/types.py`, not inline.

Expand Down
Loading