Summary
The digest scheduler fails on every run with SQLSTATE 22021 (invalid byte sequence for encoding "UTF8"). RunDigest() truncates each topic-map title with a byte slice — title[:67] + "..." — which cuts in the middle of a multi-byte UTF-8 rune whenever byte index 66 begins one (any title whose 67th byte is the lead byte of a multi-byte char — e.g. an em-dash — or ellipsis …). The slice then ends with a dangling lead byte (0xe2 for the —/… family); appending "..." produces the byte sequence 0xe2 0x2e 0x2e, which is not valid UTF-8. PostgreSQL rejects the topic-map UPSERT, the digest job errors, and because it re-runs on the scheduler's event/debounce loop it fails every time — the topic-map index stays wedged.
Net effect: the topic map (category=index, title=topic-map-{scope}) is never (re)built once any block has a title that trips the cut. Core store/query/retrieval are on a separate path and are unaffected — ctx otherwise works normally.
Verified by live reproduction on the running daemon (real data) and a standalone, infra-free Go reproduction (below).
Environment
|
|
| Repo checkout |
GottZ/ctx @ root, commit 756f0db = tag v1.6.6 (latest release) |
| PostgreSQL |
18 — image pgvector-timescaledb:pg18, server encoding UTF8 |
| Driver |
pgx v5 |
| Host |
MacBook Pro, Apple M2 Pro, 16 GB |
| OS |
macOS 26.5 (build 25F71), arm64 |
| Container runtime |
Docker API v29.4.0 |
| Go (build) |
1.26.3 |
Deterministic — not timing- or load-dependent. It triggers whenever a block whose title hits the cut is present at digest time.
Symptom (daemon logs, verified)
Every digest tick:
{"level":"INFO","msg":"scheduler: running digest","scope":"private"}
{"level":"ERROR","msg":"scheduler: digest error","error":"digest: upsert topic map: store: upsert block: ERROR: invalid byte sequence for encoding \"UTF8\": 0xe2 0x2e 0x2e (SQLSTATE 22021)"}
The error bytes are the tell: 0xe2 is the lead byte of a 3-byte UTF-8 sequence (— = E2 80 94, … = E2 80 A6), and 0x2e 0x2e are the first two . of the appended "...". So the stored value is <valid bytes>0xe2 + "..." → a lone 0xe2 followed by 0x2e (not a valid continuation byte 0x80–0xBF) → invalid.
Root cause (verified)
File: go/internal/digest/digest.go — func RunDigest(...), per-block title truncation (L67–71 @ v1.6.6):
// Title truncation: max 70 chars.
title := b.Title
if len(title) > 70 {
title = title[:67] + "..."
}
len(title) counts bytes and title[:67] slices bytes. When byte 66 begins a multi-byte rune (bytes 66–68 for a 3-byte char), title[:67] keeps only that rune's first byte and drops the rest, so the string ends in a dangling lead byte; the following + "..." cannot repair it. The result is invalid UTF-8.
That value flows into store.UpsertBlock(...) → the topic-map INSERT … ON CONFLICT … UPDATE, where PostgreSQL (UTF8) validates the text and rejects it with 22021. RunDigest is the only truncation site whose output is written to Postgres, which is why it is the only one that surfaces as a crash (see Related).
Reproduction
A — standalone, no infra (deterministic):
package main
import (
"fmt"
"strings"
"unicode/utf8"
)
func main() {
// 66 ASCII bytes, then an em-dash (E2 80 94), then tail to exceed the 70-byte guard.
title := strings.Repeat("a", 66) + "—long tail well past seventy bytes"
out := title[:67] + "..." // exactly what RunDigest does today
fmt.Println(utf8.ValidString(out)) // => false
fmt.Printf("% x\n", out[64:]) // => 61 61 e2 2e 2e 2e (lone e2 lead byte, then the dots)
}
utf8.ValidString(out) is false → Postgres rejects this exact string with 22021.
B — live (real path): store any block whose title is long enough to enter the truncation branch and has a multi-byte char straddling byte 67 (titles containing — / … are common), then let the digest scheduler run (or trigger a digest). The UPSERT fails as above on every run.
Fix
Make the truncation rune-aware so it can never split a multi-byte char — add "unicode/utf8" to the imports and:
- // Title truncation: max 70 chars.
+ // Title truncation: max 70 runes (rune-aware — a byte slice can split a
+ // multi-byte char, leaving invalid UTF-8 that fails the upsert: 22021).
title := b.Title
- if len(title) > 70 {
- title = title[:67] + "..."
+ if utf8.RuneCountInString(title) > 70 {
+ title = string([]rune(title)[:67]) + "..."
}
This also corrects a pre-existing semantic mismatch: the old guard counted bytes, so the "max 70 chars" comment was already wrong for any non-ASCII title; the fix makes the threshold count runes as the comment intends. The topic map is display/index text, so a rune-count cap is the right semantics.
Verified locally: with this change digest rebuilds the full topic map (success, "digest: topic map updated") and the 22021 errors stop. Happy to open a PR if useful.
Related (same byte-slice idiom; not verified to crash)
The same "truncate a string with a byte slice" pattern appears below. None currently writes its truncated output to Postgres (they feed LLM prompts or a debug log), so none triggers 22021 today — but each can emit invalid UTF-8 on a multi-byte boundary and would become a latent bug if its output is ever persisted or sent to a strict consumer:
go/internal/llm/synthesize.go:203 — content[:MaxBlockChars] + "[... truncated]" (LLM source prompt)
go/internal/ingest/extract.go:69 — content[:MaxExtractionContent] + "..." (LLM extraction prompt)
go/internal/ingest/extract.go:119 — after[:200] (prompt-injection heuristic scan)
go/internal/events/listener.go:32 — payload[:200] + "..." (slog.Debug only)
For contrast, these byte slices are safe (ASCII-only operands, cannot split a rune): digest.go:64 idPrefix[:8], commands.go:599 id[:12] (block IDs). And synthesize.go:261 llmSources[:N] is a slice of structs, not a string — unrelated.
A shared truncateRunes(s string, n int) string helper applied across the string sites would fix all of them consistently and prevent regressions.
Impact
- Topic-map digest is wedged for any scope containing a title that hits the cut: the index block is never updated and every tick logs an error.
- Easy to hit unintentionally — em-dashes / ellipses in titles are common.
- Unaffected: core store / query / retrieval (separate path). The failed
UPSERT is a single rejected statement — no partial or corrupt write.
Summary
The digest scheduler fails on every run with
SQLSTATE 22021(invalid byte sequence for encoding "UTF8").RunDigest()truncates each topic-map title with a byte slice —title[:67] + "..."— which cuts in the middle of a multi-byte UTF-8 rune whenever byte index 66 begins one (any title whose 67th byte is the lead byte of a multi-byte char — e.g. an em-dash—or ellipsis…). The slice then ends with a dangling lead byte (0xe2for the—/…family); appending"..."produces the byte sequence0xe2 0x2e 0x2e, which is not valid UTF-8. PostgreSQL rejects the topic-mapUPSERT, the digest job errors, and because it re-runs on the scheduler's event/debounce loop it fails every time — the topic-map index stays wedged.Net effect: the topic map (
category=index,title=topic-map-{scope}) is never (re)built once any block has a title that trips the cut. Core store/query/retrieval are on a separate path and are unaffected — ctx otherwise works normally.Verified by live reproduction on the running daemon (real data) and a standalone, infra-free Go reproduction (below).
Environment
GottZ/ctx@root, commit756f0db= tagv1.6.6(latest release)pgvector-timescaledb:pg18, server encodingUTF8Deterministic — not timing- or load-dependent. It triggers whenever a block whose title hits the cut is present at digest time.
Symptom (daemon logs, verified)
Every digest tick:
The error bytes are the tell:
0xe2is the lead byte of a 3-byte UTF-8 sequence (—=E2 80 94,…=E2 80 A6), and0x2e 0x2eare the first two.of the appended"...". So the stored value is<valid bytes>0xe2+"..."→ a lone0xe2followed by0x2e(not a valid continuation byte0x80–0xBF) → invalid.Root cause (verified)
File:
go/internal/digest/digest.go—func RunDigest(...), per-block title truncation (L67–71 @v1.6.6):len(title)counts bytes andtitle[:67]slices bytes. When byte 66 begins a multi-byte rune (bytes 66–68 for a 3-byte char),title[:67]keeps only that rune's first byte and drops the rest, so the string ends in a dangling lead byte; the following+ "..."cannot repair it. The result is invalid UTF-8.That value flows into
store.UpsertBlock(...)→ the topic-mapINSERT … ON CONFLICT … UPDATE, where PostgreSQL (UTF8) validates the text and rejects it with22021.RunDigestis the only truncation site whose output is written to Postgres, which is why it is the only one that surfaces as a crash (see Related).Reproduction
A — standalone, no infra (deterministic):
utf8.ValidString(out)isfalse→ Postgres rejects this exact string with22021.B — live (real path): store any block whose title is long enough to enter the truncation branch and has a multi-byte char straddling byte 67 (titles containing
—/…are common), then let the digest scheduler run (or trigger a digest). TheUPSERTfails as above on every run.Fix
Make the truncation rune-aware so it can never split a multi-byte char — add
"unicode/utf8"to the imports and:This also corrects a pre-existing semantic mismatch: the old guard counted bytes, so the "max 70 chars" comment was already wrong for any non-ASCII title; the fix makes the threshold count runes as the comment intends. The topic map is display/index text, so a rune-count cap is the right semantics.
Verified locally: with this change
digestrebuilds the full topic map (success, "digest: topic map updated") and the22021errors stop. Happy to open a PR if useful.Related (same byte-slice idiom; not verified to crash)
The same "truncate a string with a byte slice" pattern appears below. None currently writes its truncated output to Postgres (they feed LLM prompts or a debug log), so none triggers
22021today — but each can emit invalid UTF-8 on a multi-byte boundary and would become a latent bug if its output is ever persisted or sent to a strict consumer:go/internal/llm/synthesize.go:203—content[:MaxBlockChars] + "[... truncated]"(LLM source prompt)go/internal/ingest/extract.go:69—content[:MaxExtractionContent] + "..."(LLM extraction prompt)go/internal/ingest/extract.go:119—after[:200](prompt-injection heuristic scan)go/internal/events/listener.go:32—payload[:200] + "..."(slog.Debugonly)For contrast, these byte slices are safe (ASCII-only operands, cannot split a rune):
digest.go:64idPrefix[:8],commands.go:599id[:12](block IDs). Andsynthesize.go:261llmSources[:N]is a slice of structs, not a string — unrelated.A shared
truncateRunes(s string, n int) stringhelper applied across the string sites would fix all of them consistently and prevent regressions.Impact
UPSERTis a single rejected statement — no partial or corrupt write.