digest: every run fails with 22021 — byte-slice title truncation splits a multi-byte UTF-8 rune (RunDigest)

## Summary

The digest scheduler fails **on every run** with `SQLSTATE 22021` (`invalid byte sequence for encoding "UTF8"`). `RunDigest()` truncates each topic-map title with a **byte** slice — `title[:67] + "..."` — which cuts in the middle of a multi-byte UTF-8 rune whenever byte index 66 begins one (any title whose 67th byte is the lead byte of a multi-byte char — e.g. an em-dash `—` or ellipsis `…`). The slice then ends with a dangling lead byte (`0xe2` for the `—`/`…` family); appending `"..."` produces the byte sequence `0xe2 0x2e 0x2e`, which is not valid UTF-8. PostgreSQL rejects the topic-map `UPSERT`, the digest job errors, and because it re-runs on the scheduler's event/debounce loop it fails **every time** — the topic-map index stays wedged.

Net effect: the topic map (`category=index`, `title=topic-map-{scope}`) is **never (re)built** once any block has a title that trips the cut. Core store/query/retrieval are on a separate path and are unaffected — ctx otherwise works normally.

Verified by live reproduction on the running daemon (real data) **and** a standalone, infra-free Go reproduction (below).

## Environment

| | |
|---|---|
| Repo checkout | `GottZ/ctx` @ `root`, commit `756f0db` = tag **`v1.6.6`** (latest release) |
| PostgreSQL | 18 — image `pgvector-timescaledb:pg18`, server encoding `UTF8` |
| Driver | pgx v5 |
| Host | MacBook Pro, Apple M2 Pro, 16 GB |
| OS | macOS 26.5 (build 25F71), arm64 |
| Container runtime | Docker API v29.4.0 |
| Go (build) | 1.26.3 |

Deterministic — not timing- or load-dependent. It triggers whenever a block whose title hits the cut is present at digest time.

## Symptom (daemon logs, verified)

Every digest tick:

```
{"level":"INFO","msg":"scheduler: running digest","scope":"private"}
{"level":"ERROR","msg":"scheduler: digest error","error":"digest: upsert topic map: store: upsert block: ERROR: invalid byte sequence for encoding \"UTF8\": 0xe2 0x2e 0x2e (SQLSTATE 22021)"}
```

The error bytes are the tell: `0xe2` is the **lead byte of a 3-byte UTF-8 sequence** (`—` = `E2 80 94`, `…` = `E2 80 A6`), and `0x2e 0x2e` are the first two `.` of the appended `"..."`. So the stored value is `<valid bytes>0xe2` + `"..."` → a lone `0xe2` followed by `0x2e` (not a valid continuation byte `0x80–0xBF`) → invalid.

## Root cause (verified)

**File:** `go/internal/digest/digest.go` — `func RunDigest(...)`, per-block title truncation (L67–71 @ `v1.6.6`):

```go
// Title truncation: max 70 chars.
title := b.Title
if len(title) > 70 {
    title = title[:67] + "..."
}
```

`len(title)` counts **bytes** and `title[:67]` slices **bytes**. When byte 66 begins a multi-byte rune (bytes 66–68 for a 3-byte char), `title[:67]` keeps only that rune's first byte and drops the rest, so the string ends in a dangling lead byte; the following `+ "..."` cannot repair it. The result is invalid UTF-8.

That value flows into `store.UpsertBlock(...)` → the topic-map `INSERT … ON CONFLICT … UPDATE`, where PostgreSQL (UTF8) validates the text and rejects it with `22021`. `RunDigest` is the **only** truncation site whose output is written to Postgres, which is why it is the only one that surfaces as a crash (see *Related*).

## Reproduction

**A — standalone, no infra (deterministic):**

```go
package main

import (
	"fmt"
	"strings"
	"unicode/utf8"
)

func main() {
	// 66 ASCII bytes, then an em-dash (E2 80 94), then tail to exceed the 70-byte guard.
	title := strings.Repeat("a", 66) + "—long tail well past seventy bytes"
	out := title[:67] + "..."            // exactly what RunDigest does today
	fmt.Println(utf8.ValidString(out))   // => false
	fmt.Printf("% x\n", out[64:])        // => 61 61 e2 2e 2e 2e  (lone e2 lead byte, then the dots)
}
```

`utf8.ValidString(out)` is `false` → Postgres rejects this exact string with `22021`.

**B — live (real path):** store any block whose title is long enough to enter the truncation branch and has a multi-byte char straddling byte 67 (titles containing `—` / `…` are common), then let the digest scheduler run (or trigger a digest). The `UPSERT` fails as above on every run.

## Fix

Make the truncation **rune-aware** so it can never split a multi-byte char — add `"unicode/utf8"` to the imports and:

```diff
-			// Title truncation: max 70 chars.
+			// Title truncation: max 70 runes (rune-aware — a byte slice can split a
+			// multi-byte char, leaving invalid UTF-8 that fails the upsert: 22021).
 			title := b.Title
-			if len(title) > 70 {
-				title = title[:67] + "..."
+			if utf8.RuneCountInString(title) > 70 {
+				title = string([]rune(title)[:67]) + "..."
 			}
```

This also corrects a pre-existing semantic mismatch: the old guard counted bytes, so the "max 70 chars" comment was already wrong for any non-ASCII title; the fix makes the threshold count runes as the comment intends. The topic map is display/index text, so a rune-count cap is the right semantics.

Verified locally: with this change `digest` rebuilds the full topic map (`success`, "digest: topic map updated") and the `22021` errors stop. Happy to open a PR if useful.

## Related (same byte-slice idiom; **not** verified to crash)

The same "truncate a string with a byte slice" pattern appears below. None currently writes its truncated output to Postgres (they feed LLM prompts or a debug log), so none triggers `22021` today — but each can emit invalid UTF-8 on a multi-byte boundary and would become a latent bug if its output is ever persisted or sent to a strict consumer:

- `go/internal/llm/synthesize.go:203` — `content[:MaxBlockChars] + "[... truncated]"` (LLM source prompt)
- `go/internal/ingest/extract.go:69` — `content[:MaxExtractionContent] + "..."` (LLM extraction prompt)
- `go/internal/ingest/extract.go:119` — `after[:200]` (prompt-injection heuristic scan)
- `go/internal/events/listener.go:32` — `payload[:200] + "..."` (`slog.Debug` only)

For contrast, these byte slices are **safe** (ASCII-only operands, cannot split a rune): `digest.go:64` `idPrefix[:8]`, `commands.go:599` `id[:12]` (block IDs). And `synthesize.go:261` `llmSources[:N]` is a slice of structs, not a string — unrelated.

A shared `truncateRunes(s string, n int) string` helper applied across the string sites would fix all of them consistently and prevent regressions.

## Impact

- Topic-map digest is **wedged** for any scope containing a title that hits the cut: the index block is never updated and every tick logs an error.
- Easy to hit unintentionally — em-dashes / ellipses in titles are common.
- **Unaffected:** core store / query / retrieval (separate path). The failed `UPSERT` is a single rejected statement — no partial or corrupt write.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

digest: every run fails with 22021 — byte-slice title truncation splits a multi-byte UTF-8 rune (RunDigest) #4

Summary

Environment

Symptom (daemon logs, verified)

Root cause (verified)

Reproduction

Fix

Related (same byte-slice idiom; not verified to crash)

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development


Repo checkout	`GottZ/ctx` @ `root`, commit `756f0db` = tag `v1.6.6` (latest release)
PostgreSQL	18 — image `pgvector-timescaledb:pg18`, server encoding `UTF8`
Driver	pgx v5
Host	MacBook Pro, Apple M2 Pro, 16 GB
OS	macOS 26.5 (build 25F71), arm64
Container runtime	Docker API v29.4.0
Go (build)	1.26.3

digest: every run fails with 22021 — byte-slice title truncation splits a multi-byte UTF-8 rune (RunDigest) #4

Description

Summary

Environment

Symptom (daemon logs, verified)

Root cause (verified)

Reproduction

Fix

Related (same byte-slice idiom; not verified to crash)

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions