Skip to content

chore: merge upstream/main into native-e2e-expansion, resolve conflicts#4

Merged
JerrettDavis merged 88 commits intonative-e2e-expansionfrom
copilot/update-pr-branch-from-main
Apr 28, 2026
Merged

chore: merge upstream/main into native-e2e-expansion, resolve conflicts#4
JerrettDavis merged 88 commits intonative-e2e-expansionfrom
copilot/update-pr-branch-from-main

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 28, 2026

Upstream chopratejas/headroom main had diverged significantly (Rust rewrite, SmartCrusher port, diff_compressor retirement), leaving the native-e2e-expansion PR with merge conflicts.

Conflict resolutions

  • .github/workflows/ci.yml — Kept pytest-cov in the macOS native wrapper pip install; the downstream pytest --cov invocation and Codecov upload step require it (upstream dropped it incidentally).
  • tests/test_transforms/test_smart_crusher_bugs.py — Accepted upstream's removal of the TestHelperCoverage Python test class; the underlying Python helpers (_percentile_linear, _detect_sequential_pattern, etc.) were deleted with the Python SmartCrusher implementation in Stage 3c.1b — Rust parity fixtures and bug{1–4}_* tests in crates/headroom-core/ cover the same invariants.

chopratejas and others added 30 commits April 24, 2026 13:39
Bootstrap the Rust port of Headroom. Additive only — no existing Python
code modified. Ships the workshop, not the widgets.

Layout
  Cargo.toml (workspace) + rust-toolchain.toml
  crates/headroom-core    — transform library, stub only
  crates/headroom-proxy   — axum binary, /healthz only
  crates/headroom-py      — PyO3 cdylib, exposes headroom._core.hello()
  crates/headroom-parity  — Rust-vs-Python oracle harness + parity-run CLI

Tooling
  Makefile: test, test-parity, bench, build-proxy, build-wheel, fmt, lint
  .github/workflows/rust.yml: test, wheels (linux/mac), audit, parity-nightly
  deny.toml for cargo-deny

Parity corpus
  tests/parity/recorder.py + scripts/record_fixtures.py
  125 recorded fixtures across 5 leaf transforms (ccr, tokenizer,
  log_compressor, diff_compressor, cache_aligner)

Docs
  RUST_DEV.md — developer setup and workspace reference
  docs/spec/022-rust-migration.md — migration plan and stage breakdown

.gitignore: whitelist scripts/record_fixtures.py; ignore target/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds out crates/headroom-proxy from a /healthz stub into a transparent
reverse proxy: catch-all router that forwards every method/path/query to
--upstream verbatim, streaming both request and response bodies through
reqwest without buffering. Adds clap-based config (CLI + env), thiserror
error type with sane upstream-status mapping, JSON tracing-subscriber
logging, and graceful shutdown. The library surface (build_app, AppState,
Config) is reused by the integration tests.
Implements RFC 7230 6.1 hop-by-hop filtering on both request and response
sides (Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization,
TE, Trailers, Transfer-Encoding, Upgrade), plus the additional headers
listed inside any incoming Connection: header. Injects X-Forwarded-For
(appending to existing value if any), X-Forwarded-Proto, X-Forwarded-Host,
and X-Request-Id. The proxy module wires these in for both HTTP and WS.
When the catch-all sees an Upgrade: websocket request, hand it to the ws
module: axum upgrades the client side, tokio-tungstenite connects to the
upstream (rewriting http->ws / https->wss while preserving path + query),
and two pumps shovel messages until either side closes. Forwarded headers
exclude what tungstenite manages (Host, Upgrade, Connection, Sec-*) but
preserve Authorization, Sec-WebSocket-Protocol, etc. Supports text,
binary, ping, pong, and close frames in both directions.
/healthz returns 200 unconditionally (own health). /healthz/upstream
proxies a GET to the upstream's /healthz and returns 200 when reachable
+ 2xx, 503 otherwise. Both endpoints are intercepted in axum and never
forwarded; documented in RUST_DEV.md as reserved paths.
15 integration tests across five suites that spin up the proxy on an
ephemeral port pointed at a per-test mock upstream:

- integration_http: all 7 methods round-trip with body, status passthrough
  for 404/500/502, query strings preserved, 1MB POST streams through.
- integration_sse: a 10-event in-process hyper SSE upstream emits at 50ms
  cadence; chunks reach the client with max gap < 500ms (loose CI bound)
  and a client disconnect propagates to the upstream within 2s.
- integration_ws: 5 text + 5 binary messages echo through a tungstenite
  upstream byte-equal; client-initiated close propagates.
- integration_headers: hop-by-hop strip both directions, X-Forwarded-*
  injection, X-Forwarded-For appends to existing value, multi-valued
  response headers preserved.
- integration_body: 5MB POST round-trips byte-equal; streaming response
  yields first byte before the upstream finishes sending.
- integration_health: own /healthz always 200; /healthz/upstream is 200
  when upstream healthy and 503 when down.

The Sec-WebSocket-Protocol forwarding is exercised implicitly by the WS
tests via tungstenite handshake. The harness lives at tests/common/mod.rs
and is shared by every integration suite.
…rs (chopratejas#31)

Adds `HEADROOM_QDRANT_URL`, `_HOST`, `_PORT`, `_API_KEY`, `_HTTPS`,
`_PREFER_GRPC`, `_GRPC_PORT` support across the memory stack:

- `headroom/memory/qdrant_env.py`: shared resolver helper with
  explicit-arg > env > default precedence (URL wins over host/port;
  booleans parsed via standard truthy set).
- `memory/easy.py`, `backends/{mem0,direct_mem0}.py`,
  `proxy/memory_handler.py`: call the resolver so
  `Memory(backend="qdrant-neo4j")`, `Mem0Config`, and the proxy's
  `MemoryConfig` all honor the same env keys.
- `proxy/models.py` + `proxy/server.py`: `ProxyConfig` picks up the
  same keys so hosted Qdrant (e.g. Qdrant Cloud) works without code
  changes.
- `cli/proxy.py`: adds `--memory-qdrant-{url,host,port,api-key}`
  flags that override the env when present.
- `tests/test_memory/test_qdrant_env.py`: unit coverage for
  precedence, URL-vs-host/port, boolean parsing, and unset defaults.
- `CHANGELOG.md`: documented under [Unreleased] / Added.

Explicit constructor arguments still win; unset env keeps the existing
localhost:6333 defaults, so this is backwards-compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maturin>=1.5 requires -m to point to Cargo.toml, not pyproject.toml.
Fixes wheel build job failure in CI (all three matrix targets).
Also switches to manifest-path: action param for cleaner workflow syntax.
Applies same fix to Makefile build-wheel and develop targets.
Bug 1 (HIGH) health.rs: Url::join('healthz') used relative resolution,
stripping non-trailing-slash base paths. Fixed with set_path('/healthz').

Bug 2 (HIGH) main.rs: graceful_shutdown_timeout was configured and logged
but never enforced. Now sleeps for the configured duration after signal
before axum exits, giving in-flight LLM streams time to drain.

Bug 3 (MEDIUM) websocket.rs: WS pump half-close could hang forever if
close() on one side failed. Replaced tokio::join! on async blocks with
spawned tasks + CancellationToken so either direction cancels the other.

Bug 4 (MEDIUM) proxy.rs/websocket.rs: URL path-join logic was copy-pasted
verbatim in two places. Extracted to join_upstream_path() helper; websocket
now calls it instead of duplicating the 15-line block.

Bug 5 (MEDIUM) proxy.rs: mid-stream upstream errors were silently swallowed
by Body::from_stream. Added a .map() wrapper that logs before re-raising.

Bug 6 (LOW) websocket.rs: WS session log was missing the request path,
making it hard to correlate logs with client sessions. Added path field.

Bug 7 (LOW) websocket.rs: scheme match arm 'ws'|'wss' borrowed joined
immutably while set_scheme needed a mutable borrow. Fixed by using literal
'ws' (set_scheme on an already-ws URL is a no-op for the ws case).
- Replace full-sha image tags with type=sha,format=short (7-char) so the
  primary package versions list stops accumulating long sha-only entries.
- Route cosign signatures into a sibling GHCR package via
  COSIGN_REPOSITORY=<image>-signatures, so the main image's package
  version list stays clean. GHCR does not yet implement the OCI 1.1
  Distribution Referrers API (community discussion #163029, June 2025),
  so legacy signature mode is used here -- OCI 1.1 mode would force the
  signature manifest's subject into the same repo as the image and
  override COSIGN_REPOSITORY. Verifiers must export the same
  COSIGN_REPOSITORY value when running 'cosign verify'.
- Add a promote-latest job that runs after the variant matrix and
  re-pushes the :latest tag pointing at the root image with a unique
  index annotation. This forces a fresh manifest digest, generating a
  new GHCR package version with current timestamp so :latest sits at
  the top of the version listing instead of whichever variant happened
  to finish last.
When the docker workflow is triggered directly by release.published
(rather than via workflow_call from the Release parent), inputs.enable_ref_tags
is null and produced an empty enable= attribute that the metadata-action
rejected. Default to true on non-release triggers and skip ref/pr tags
on release events where they don't apply anyway.
CodeQL alert chopratejas#61 (CWE-275, actions/missing-workflow-permissions):
add explicit `permissions: contents: read` to the rust workflow root.
Defaults the GITHUB_TOKEN to read-only across all jobs, so even if the
repo policy changes, this workflow stays at least-privilege. No job in
this workflow needs write — wheels/audit/parity all read-only.

Add real end-to-end test suite at tests/e2e_real.rs gated behind
HEADROOM_E2E=1. Spawns the actual Python Headroom proxy as a subprocess,
runs the Rust proxy in-process in front of it, and exercises:
  - health endpoints across the full chain
  - Anthropic non-streaming (real API call)
  - Anthropic streaming SSE (real API call) with chunk-level validation
  - OpenAI non-streaming (real API call)
  - X-Request-Id generation and pass-through

Adds tokio-process feature for Command/Child usage. Loads .env at the
repo root for API keys (does not log values). Tests skip cleanly when
HEADROOM_E2E is unset, so cargo test stays fast.
Adds 0.10.7-ab46594 (root) and 0.10.7-<variant>-<sha> (variants) so
images can be referenced by an exact version+commit pair without
relying on the moving variant or :latest tags.
Previously, any package registered under the headroom.proxy_extension
entry-point group auto-loaded at proxy startup. A user pip-installing a
plugin (or pulling one in transitively) would get its middleware running
in front of all their LLM traffic with zero opt-in or visibility — the
same mechanism that masked the Shield Enterprise streaming bug.

Change: install_all() now takes an explicit enabled set (or reads
HEADROOM_PROXY_EXTENSIONS). Discovery still runs to enumerate what's
available, but only names the operator opted into actually install.
The literal '*' is a wildcard for trusted environments.

  CLI:  headroom proxy --proxy-extension shield_enterprise
        headroom proxy --proxy-extension shield_enterprise,mypkg
        headroom proxy --proxy-extension '*'
  Env:  HEADROOM_PROXY_EXTENSIONS=shield_enterprise

The startup banner now shows discovered + enabled extensions:
  Extensions:   discovered=shield_enterprise (opt-in: --proxy-extension ...)
  Extensions:   ENABLED shield_enterprise (available: shield_enterprise)
  Extensions:   ENABLED (wildcard) shield_enterprise

Names that were requested but not found are logged as warnings.

Adds proxy_extensions: list[str] | None to ProxyConfig. Plumbs it
through CLI -> ProxyConfig -> install_all(enabled=...).

This is a behavior change for users who relied on auto-loading.
Existing Shield/extension users must add --proxy-extension or set
HEADROOM_PROXY_EXTENSIONS to keep their middleware running.
Cargo.lock: pick up tokio-util added in the WS half-close fix.
RUST_DEV.md: document how to run headroom-proxy in passthrough mode
(listen + upstream flags, e2e test gate, env vars).
cargo fmt --check failed in CI: import order in proxy.rs (cfg(test)
attributes before/after non-attr imports) and a few line-wrapping
nits in e2e_real.rs. Ran cargo fmt --all to fix.

maturin-action@v1 does not have a 'manifest-path' input — the action
warned 'Unexpected input(s) manifest-path' and proceeded to invoke
maturin from the repo root, which sees the workspace Cargo.toml with
no [package] section and bails. Move -m crates/headroom-py/Cargo.toml
back inside the 'args' string.
# Conflicts:
#	headroom/proxy/server.py
GitHub Actions deprecated the macos-13 runner label. The validate-workflows
actionlint step in CI fails because macos-13 is no longer in the available
labels list. macos-15-intel is the current x86_64 macOS runner.

(Bumped from macos-14 to macos-15 for arm64 was unnecessary; macos-14 is
still valid and we keep it for cache-warmth.)
phase-0: rust workspace scaffolding + parity harness
ci(docker): clean up image tags, signatures, and Latest indicator
fix(memory): resolve Qdrant connection from HEADROOM_QDRANT_* env vars (chopratejas#31)
Stage 2 of the Rust port: a `headroom_core::tokenizer` module mirroring the
Python `headroom.tokenizers` surface, with three backends behind a single
`Tokenizer` trait.

Backends, in dispatch order:

1. HuggingFace (`HfTokenizer`) — pure-Rust `tokenizers` crate loading any
   public `tokenizer.json`. Covers the gap between OpenAI (tiktoken) and the
   Anthropic/Gemini estimator: Cohere `command-*`, Llama-3.x, Mistral, Qwen,
   BERT, T5, etc. Construct from bytes or a file path; register against a
   model-name prefix via `register_hf` for automatic dispatch. No `hf-hub`
   auto-download yet — keeps networking, auth, and `~/.cache/huggingface` out
   of core. Longest-prefix wins; lookups are RwLock-protected.
2. Tiktoken (`TiktokenCounter`) — `tiktoken-rs` 0.11 BPE for OpenAI / o-series
   families. Byte-identical to Python `tiktoken` for ordinary text. Lazy
   shared `Arc<CoreBPE>` per encoding (o200k_base, cl100k_base, p50k_base,
   r50k_base).
3. Estimation (`EstimatingCounter`) — `chars / cpt` last-resort fallback.
   Matches Python's `max(1, int(len(text) / cpt + 0.5))` round-half-up
   formula (a self-review caught and fixed an earlier `ceil`-based version
   that diverged in the middle of the range, e.g. 5 chars at 4.0 cpt).

Tests: 43 unit tests + 5 proptests; parity 40/40 byte-equal.
Bench: criterion baseline on small/medium/large inputs.
Workspace MSRV bumped 1.78 → 1.80 for `LazyLock`/`OnceLock`.

No proxy wiring. Library-only; production behavior unchanged.
The `qdrant-env-vars` change in d3c37d7 (PR chopratejas#266) added `qdrant_url` and
`qdrant_api_key` keys to the kwargs that `MemoryHandler` passes into
`DirectMem0Adapter.__init__`. The corresponding assertion in
`test_ensure_initialized_fast_paths_and_qdrant_variants` was missed in that
PR and has been failing on `main` ever since. Surfacing here because it
fails on every PR's CI; not caused by the Rust tokenizer work this branch
adds.

The two new keys are both `None` when the corresponding `HEADROOM_QDRANT_*`
env vars are unset, which is the case in this test.
Stage 2.1: closes the loop on the HuggingFace tokenizer story. Stage 2
shipped `HfTokenizer::from_bytes`/`from_file`, which required callers to
manage their own tokenizer.json files. This adds the third constructor:

    let t = HfTokenizer::from_pretrained("CohereForAI/c4ai-command-r-v01")?;
    register_hf("command-", t);

`from_pretrained` is a thin wrapper around the `hf-hub` crate's blocking
`ureq` API. First call downloads `tokenizer.json` to `~/.cache/huggingface/
hub` (or `$HF_HOME` if set); subsequent calls reuse the on-disk cache. Uses
the `main` revision; gated repos (Llama, Mistral) require `HF_TOKEN` in env
or `~/.cache/huggingface/token`.

Also adds `try_register_hf(prefix, repo)` as the obvious one-liner for
proxy startup code:

    let _ = try_register_hf("command-", "CohereForAI/c4ai-command-r-v01");
    let _ = try_register_hf("mistral-", "mistralai/Mistral-7B-v0.1");

Each call is independent — a download failure for one model (e.g. gated
without a token) does not affect others.

`HfTokenizerError` gains a new `Hub` variant so callers can distinguish
"couldn't fetch" from "fetched but malformed" — relevant when deciding
whether to retry, surface to the user, or fall back to the estimator.

Why blocking, not async: `from_pretrained` is called once at startup. A
sync API works from `main()`, from a `OnceLock` initializer, or from
`tokio::task::spawn_blocking` if a tokio caller needs it later. The async
hf-hub backend would force callers to await at startup, which doesn't fit
the `register_hf` registry pattern.

Why rustls, not native-tls: keeps the binary statically linkable for AWS
deploys (no system OpenSSL dependency).

Tests: a network-dependent integration test (`#[ignore]`d in CI; hits HF
for `gpt2`, ~1.4 MB) verifies the real download + load + count path. A
non-network negative test verifies that an invalid repo name surfaces as
`HfTokenizerError::Hub`, not a panic. 44 unit tests + 5 proptests +
1 doctest pass; parity stays 40/40 byte-equal.
…-hub

rust(stage 2.1): HfTokenizer::from_pretrained via hf-hub
Stage 3a: first real transform port. Faithful Rust port of
`headroom.transforms.diff_compressor` with byte-equal parity against all
20 recorded fixtures.

# Algorithm (matching Python)

1. Hand-rolled unified-diff parser (state machine over `diff --git`,
   `index`, `--- a/`, `+++ b/`, `@@`, mode/binary/rename markers, +/- /
   space lines, "other" lines like `\ No newline at end of file`).
2. File cap (`max_files=20`): when fired, sort by total changes (most
   first) and keep top N.
3. Per-file hunk cap (`max_hunks_per_file=10`): keep first + last + top
   relevance-scored middle, then resort by hunk-header start line to
   restore appearance order.
4. Relevance scoring: change-density base + user-query word overlap
   + priority patterns (ERROR / IMPORTANCE / SECURITY regexes —
   matches `error_detection.PRIORITY_PATTERNS_DIFF`).
5. Per-hunk context trim: keep `max_context_lines=2` lines either side
   of each `+`/`-` line.
6. CCR cache_key: `md5(original)[:24]` (matches
   `compression_store.CompressionStore.store`). Emitted only when
   compression saved >20% of lines.

Parity result: `[diff_compressor ] total=20 matched=20 skipped=0 diffed=0`.

# Information preservation hardening

Three pass-through paths inherited from Python that we keep deliberate
(would lose info if we changed them):
- Below `min_lines_for_ccr` (50): return input unchanged.
- No diff sections parsed: return input unchanged.
- Below 20% compression savings: emit compressed output but no CCR
  marker (the original is the cheaper representation anyway).

Plus a parity-bound subtlety: `compressed_line_count` is captured BEFORE
the CCR retrieval marker is appended, both for the marker text
(`compressed to N`) and the result field. The output string therefore
ends up with one more line than the field reports — by design, matching
Python exactly. An off-by-one bug from recounting after appending the
CCR marker was caught and pinned by a synthetic 8-file diff test.

# Observability — the Rust escape hatch

Python's `DiffCompressionResult` has thin observability: input/output
line counts, additions/deletions, hunks_kept/removed, files_affected,
cache_key. The Rust port adds a sidecar `DiffCompressorStats` struct
with metrics Python doesn't emit:

- `files_dropped: Vec<String>` — names (old → new path) of files
  silently discarded by the `max_files` cap. Python loses these.
- `hunks_dropped_per_file: BTreeMap<String, usize>` — per-file hunk
  drops, stable iteration via `BTreeMap`.
- `context_lines_input` / `context_lines_kept` / `context_lines_trimmed`
  — directly proxies info loss from the context trim.
- `largest_hunk_kept_lines` / `largest_hunk_dropped_lines` — outlier
  detection (a single huge dropped hunk is much worse than many small).
- `parse_warnings: Vec<String>` — surfaces malformed input rather than
  dropping silently.
- `processing_duration_us` — latency budget.
- `cache_key_emitted` + `ccr_skipped_reason: Option<String>` — explicit
  signal for "we chose not to emit CCR and this is why".

A `tracing::info!(target: "diff_compressor", ...)` event is emitted on
every call, carrying these fields for OTel scraping in prod. The
sidecar struct is returned alongside via `compress_with_stats`; the
parity-only `compress` API discards it.

# Module layout

- `crates/headroom-core/src/transforms/mod.rs` — namespace, doc comment
  with the guiding principle ("information preservation > aggressive
  compression") so future ports inherit the philosophy.
- `crates/headroom-core/src/transforms/diff_compressor.rs` — full port
  (parser, scorer, hunk selector, context trimmer, formatter, CCR layer,
  stats, tracing).

# Dependencies added to headroom-core

- `md-5 = "0.10"` — for the CCR cache_key (matches Python MD5[:24]).
- `regex = "1"` — was a transitive dep via tokenizers; now a direct
  dependency for the hunk-header parser and priority patterns.

# Tests

6 unit tests covering pass-through paths, MD5 hex truncation, the
Python `split("\n")` line-count semantics, sidecar stats emission,
and a synthetic 8-file diff that locks the byte-equal behavior found
in the parity fixtures.
chopratejas and others added 23 commits April 27, 2026 13:01
…e-arg-list-too-long

ci(docker): fix Argument list too long when signing bake outputs
…ilder

Stage 3c.2 PR1 — the public extension surface that lets Enterprise
crates plug richer components into SmartCrusher without forking. Three
traits, one builder, behavior-equivalent on every parity fixture.

The three traits:

- Scorer (re-exported from `crate::relevance::RelevanceScorer`).
  Already a trait; OSS HybridScorer (BM25 + fastembed). Enterprise
  point: per-tenant Loop-trained scorer.

- Constraint (new in `traits.rs`). `must_keep(items, item_strings)
  -> Vec<usize>` — indices the allocator must keep regardless of
  saliency. OSS defaults: `KeepErrorsConstraint`,
  `KeepStructuralOutliersConstraint` — thin wrappers around the
  existing `detect_error_items_for_preservation` and
  `detect_structural_outliers` functions. Enterprise point:
  BusinessRuleConstraint, RegulatoryConstraint::HIPAA, and so on.

- Observer (new in `traits.rs`). `on_event(&CrushEvent)` fires once
  per top-level `crush()` call with strategy + sizes + elapsed_ns.
  OSS default: TracingObserver — writes to the `tracing` crate at
  debug, zero-cost when filtered out. Enterprise point:
  AuditObserver, MetricsObserver, LoopTrainingObserver.

The builder (`builder.rs`):

`SmartCrusherBuilder::new(config)` starts EMPTY (no scorer, no
constraints, no observers — explicit composition; "no silent
fallbacks" applied to the API surface). Methods stack:
with_scorer, add_constraint, add_default_oss_constraints (appends
KeepErrors + KeepStructuralOutliers), add_observer,
with_default_oss_setup (HybridScorer + default constraints +
TracingObserver in one call).

`SmartCrusher::new(config)` is preserved as the OSS default factory
(equivalent to `SmartCrusher::builder(config).with_default_oss_setup
.build()`). Every existing caller (proxy, content_router,
integrations, evals) continues to work unchanged.

Internal refactor:

`SmartCrusherPlanner` now holds `&[Box<dyn Constraint>]` and
iterates the configured constraints via a new
`apply_constraints(items, item_strings, keep)` method. Replaces four
hardcoded `detect_structural_outliers` +
`detect_error_items_for_preservation` call sites in the four plan
methods. With the OSS default constraint stack the must-keep set is
byte-identical to pre-PR1 — verified by all 17 parity fixtures.

`SmartCrusher` gained two fields: `constraints: Vec<Box<dyn
Constraint>>` and `observers: Vec<Box<dyn Observer>>`. New
`from_parts` constructor (#[doc(hidden)]) is the builder's exit
point.

What did NOT change in this PR:

- The internal planning algorithm (lossless tabular, saliency
  scoring, structured markers — those are PR 2/3/4).
- The string/number/object/mixed-array crusher paths in
  `crushers.rs` and the `prioritize_indices` helper in
  `orchestration.rs` — they still call the detection functions
  directly. Path B from the design doc: dict-array path is the
  primary value plugin point; lifting the leaf compressors can come
  later if customers ask.

Tests:

15 new tests across `traits.rs`, `constraints.rs`, `observer.rs`,
`builder.rs`. Coverage: each constraint trait method called and
pinned (errors flagged, structural outliers detected, item_strings
cache parity, empty-array safety); builder empty-build path,
default-OSS-stack append, add_constraint order preservation,
with_default_oss_setup yields expected counts, observer fires
end-to-end on a real crush; TracingObserver name stable, on_event
doesn't panic.

Verification:
- cargo test --workspace: 403 passed (was 388, +15 new), 0 failed.
- parity: 17/17 byte-equal for smart_crusher.
- make ci-precheck: green.

Stage 3c.2 PR sequence:
- PR 1 (this commit): three traits + builder.
- PR 2 (next): improvement A — TabularCompactor.
- PR 3: improvement B — saliency scoring + structured allocator.
- PR 4: improvement C — structured marker formatter.
- PR 5: ENT-A — `headroom-enterprise` scaffold.
Stage 3c.2 PR2. Adds an opt-in compaction stage that runs BEFORE the
existing lossy pipeline. When configured, it tries to losslessly
re-shape arrays of objects into a recursive Compaction IR and renders
that to bytes via a pluggable Formatter trait. When not configured
(default OSS), behavior is byte-equal with the pre-PR2 path — all 17
SmartCrusher parity fixtures stay green.

# What lands

- Recursive Compaction IR (`compaction/ir.rs`): Table / Buckets /
  OpaqueRef / Untouched. CellValue can hold a nested Compaction so
  multi-level cases (stringified-JSON inside cells, heterogeneous
  arrays bucketed by discriminator, opaque blobs CCR-substituted)
  share one tree shape.

- Cell classifier (`compaction/classifier.rs`): per-cell decision —
  Scalar / JsonObject / JsonArray / StringifiedJson(parsed) /
  Opaque(kind). Conservative: in doubt, return Scalar.

- TabularCompactor (`compaction/compactor.rs`): array → IR. Handles
  uniform-nested flattening into dotted columns ("meta.region",
  "meta.tier"), stringified-JSON parsing + recursion, opaque-blob
  CCR-substitution (12-char SHA-256 prefix), and heterogeneous
  bucketing by discriminator. Falls through to a sparse Table when
  no clean discriminator exists, so we always do better than the
  lossy path for object arrays.

- Formatter trait (`compaction/formatter.rs`) + two impls:
  - JsonFormatter: structured JSON for debugging / programmatic use.
  - CsvSchemaFormatter: [N]{col:type,col:type} declaration + CSV
    rows. Steals TOON's row-count-and-shape declaration without
    adopting TOON's bespoke escaping. CSV is the format LLMs are
    strongest at — every model has seen millions of examples in
    training. >30% smaller than raw JSON serialization on tabular
    fixtures.

- Wiring (`crusher.rs`, `builder.rs`): SmartCrusher gains an optional
  compaction stage. Builder methods with_compaction(stage) and
  with_default_compaction() opt in. CrushArrayResult gets two new
  fields (compacted, compaction_kind) populated only when the stage
  runs. strategy_info becomes compaction kind when compaction won.

# Why this design

- Three-trait extension surface preserved. PR1 added Constraint /
  Observer / Scorer; PR2 adds Formatter as the fourth pluggable
  seam. Enterprise plug-ins land cleanly without forking core.

- Empty default builder rule held. SmartCrusherBuilder::new() still
  produces a no-compaction crusher. with_default_compaction() is
  the explicit OSS preset. No silent fallbacks.

- Recursive IR was the unlock. A flat table-of-scalars IR would have
  collapsed the moment a cell held nested JSON. Making
  CellValue::Nested hold another Compaction made stringified-JSON
  parsing + heterogeneous bucketing + opaque substitution all share
  one renderer pass.

- CCR substitution for opaque cells. Strings classified as
  base64/HTML/long-opaque become structured markers keyed by 12-char
  SHA-256 prefix. The full bytes round-trip via the CCR store (PyO3
  bridge owns actual storage; this PR emits the marker and computes
  the hash).

# Tests

- 60 new unit tests across IR / classifier / compactor / formatter /
  wiring (448 total in headroom-core, was 388).
- 17/17 SmartCrusher parity fixtures byte-equal — default-config
  path completely unchanged.
- 21/21 Python parity tests pass via PyO3 bridge.
- make ci-precheck green: ruff, mypy, cargo fmt/clippy/test
  (1.95.0), commitlint.

# Deferred to follow-up PRs

- ToonFormatter (small; ship after eval harness compares formats)
- Diff/code detection in cells → routes to DiffCompressor /
  CodeCompressor (coupled to ContentRouter Phase 4)
- Budget-aware row dropping (Constraint-respecting) when rendered
  size exceeds budget
- Format A/B eval harness
- ContentRouter unification (Phase 4)

Modules: crates/headroom-core/src/transforms/smart_crusher/compaction/*, builder.rs, crusher.rs, mod.rs
…r1-traits

feat(rust): SmartCrusher extension surface — Constraint, Observer, Builder
…r2-tabular-compactor

feat(rust): SmartCrusher PR2 — lossless-first tabular compaction
…estoration

Stage 3c.2 PR4. Restores Python's CCR-Dropped semantics on the lossy
path (the cornerstone reversibility guarantee that the port had
silently dropped) and flips the OSS default to lossless-first with a
configurable savings threshold.

# The user-visible behavior

Default `SmartCrusher::new()` now runs:

  1. Try lossless compaction.
  2. If savings >= `lossless_min_savings_ratio` (default 0.30), ship
     it — `compacted` populated, `ccr_hash = None`, nothing dropped.
  3. Otherwise fall through to the lossy path — drop rows AND
     populate `ccr_hash` so the runtime can cache the full original
     for tool-call retrieval.

**No data is ever lost.** "Lossy" means "compressed view inline; full
payload retrievable via CCR cache" — same semantics as Python's
SmartCrusher with CCR enabled. The runtime (PyO3 bridge / proxy
server) owns the cache; this crate computes the hash and emits a
marker so the prompt knows where to look.

# What changed

- `SmartCrusherConfig.lossless_min_savings_ratio: f64` (default 0.30).
  Single configurable knob — Enterprise overrides as needed. Below
  the threshold, lossless declines and lossy + CCR runs.

- `SmartCrusher::new(cfg)` flips to include the compaction stage by
  default. `SmartCrusher::without_compaction(cfg)` is the explicit
  opt-out for callers / fixtures that depend on pre-PR4 behavior.

- `crush_array` rewritten:
  - Lossless-first dispatch with savings-ratio gate
  - Lossy path now hashes the full original (12-char SHA-256 prefix)
    and emits a CCR-Dropped marker in `dropped_summary` whenever
    rows are dropped
  - `ccr_hash` field populated whenever rows were dropped
  - `process_value` substitutes the compacted string into the JSON
    tree when lossless wins, so `crush()` output reflects the win

- PyO3 bridge: `SmartCrusher.without_compaction()` static method;
  `SmartCrusherConfig` exposes the new `lossless_min_savings_ratio`
  field; Python `SmartCrusher` wrapper accepts `with_compaction=True`
  (default) and routes to the right Rust constructor.

- Parity harness: legacy 17 fixtures use `without_compaction()` so
  byte-equal coverage of the lossy path is preserved.

# Tests

- Rust: 281/281 smart_crusher unit tests pass (was 277). Six new
  tests cover: lossless wins above threshold, lossy falls through
  below threshold, CCR hash deterministic + input-dependent, lossy
  without compaction emits CCR, passthrough paths don't emit CCR,
  without_compaction yields no compacted field.
- Python parity: 21/21 (legacy fixtures via without_compaction).
- Python lossless default smoke: 3/3 new tests in
  test_smart_crusher_lossless_default.py.
- Python retention: 21/21 (updated to opt into the lossy path
  explicitly since their semantics target row-level retention).
- make ci-precheck green.

Modules:
  crates/headroom-core/src/transforms/smart_crusher/{config,crusher}.rs
  crates/headroom-parity/src/lib.rs
  crates/headroom-py/src/lib.rs
  headroom/transforms/smart_crusher.py
  tests/test_quality_retention.py
  tests/test_transforms/test_smart_crusher_{lossless_default,rust_parity}.py
…rom main)

Stage 3c.2 PR3a-redux. The original PR3a (chopratejas#286) was merged on GitHub
on 2026-04-27 but its content never reached main — its parent merge
(chopratejas#285 PR2) was squash-merged, which drops commits stacked on top of
the source branch. The walker module disappeared along with the
intermediate commits.

This PR re-lands `walker.rs` and updates the `compaction/mod.rs`
re-exports — same content as commit 4db4f46 from the lost branch.
No new functionality, no behavior change for existing crusher paths.

# What the walker does (recap)

Recursive descent over any JSON value:

  match value {
    Object(m) => recurse into each field
    Array(xs) => recurse into items, then try TabularCompactor on the array
    String(s) => parse-as-JSON-and-recurse / CCR-substitute / leave
    scalar    => unchanged
  }

Compactable spots become inline strings holding the rendered bytes.
The wrapping JSON structure is preserved.

# Why chore, not feat

This is a re-land of previously-merged-but-lost code. No new feature
shipped — fixing a regression caused by the stacked-PR squash-merge
accident. Using `chore` keeps the version from bumping for what is
effectively a parity restoration.

# Tests

- 13 walker unit tests pass.
- 462/462 headroom-core lib tests overall.
- 17/17 SmartCrusher parity fixtures byte-equal (untouched).
- 185/185 Python tests pass.
- make ci-precheck green.

Module: crates/headroom-core/src/transforms/smart_crusher/compaction/walker.rs
PR4 flipped the OSS default to lossless-first. The MCP server and
LangChain eval tests assert wire-format and row-level retention
properties that belong to the lossy path; the lossless path
substitutes a CSV+schema STRING in place of arrays, which is great
for LLM prompts but wire-incompatible with consumers that iterate
the JSON.

Pin both call sites to the lossy + CCR-Dropped path via
`with_compaction=False`. Same retention semantics as Python's
pre-PR4 SmartCrusher behavior — full payload still cached via CCR
for tool retrieval; nothing is lost.

Modules:
- headroom/integrations/mcp/server.py — runtime MCP wrapper
- tests/test_integrations/langchain/test_evals.py — eval fixture

CI run that surfaced these: actions/runs/25025161868
Two more tests broke after PR4's lossless-first default flip — same
root cause as the langchain/MCP fixes (chopratejas#287's first patch):

- tests/test_proxy_ccr.py — TestEndToEndTOINIntegration asserts
  CCR-cache state after compression. Lossless wins on the test
  fixture and skips CCR entirely (nothing dropped). Pin to lossy
  via with_compaction=False so the cache assertion holds.

- tests/test_text_compressors.py — TestSmartCrusherTextIntegration
  asserts JSON-array shape round-trip. Lossless substitutes a
  CSV+schema string. Pin to lossy + JSON shape via
  with_compaction=False. Lossless coverage exists separately in
  test_smart_crusher_lossless_default.py.

Same pattern, same fix. CI run that surfaced these:
actions/runs/25025876328
…r4-lossless-first-default

feat(rust): SmartCrusher PR4 — lossless-first default + CCR-Dropped restoration
…r3a-redux-walker

chore(rust): re-land DocumentCompactor walker (squash-merge lost it from main)
…paque strings

Stage 3c.2 PR5. Closes the gap between the public `crush()` API and
the standalone `DocumentCompactor` walker. Augments
`SmartCrusher::process_value`'s String branch to mirror the walker's
two String cases:

  1. Stringified-JSON containers: parse, recurse via process_value,
     re-emit. The wrapping field stays a string but its contents are
     processed end-to-end. Special-cases the lossless-compaction
     path (when recursion returns Value::String) to avoid double-
     JSON-encoding.

  2. Opaque blobs (long base64 / HTML / long-text strings):
     substitute with `<<ccr:HASH,KIND,SIZE>>` markers — same format
     as walker.rs and PR4's lossy CCR-Dropped markers, so downstream
     consumers can pattern-match regardless of which path emitted.

Type chore so the package version doesn't bump.

# Why this matters

After PR4, calling `SmartCrusher::new()` and then `crush(json_blob)`
gets lossless-first compaction on top-level arrays. But for tool
outputs that wrap a JSON-encoded payload INSIDE a string field, or
contain opaque blobs (base64-encoded files, HTML chunks), the public
API was a no-op. The walker handled these but only via its own
entry point; users calling `crush()` never hit it.

PR5 brings walker semantics into the public path. Same vision the
user described early in this stage:

> JSON within JSON, opaque payloads, multi-layered

now works for `crush()` callers without any extra setup.

# Implementation (~80 lines)

- `process_string` method on SmartCrusher: dispatches to JSON-recurse
  / CCR-substitute / passthrough.
- `try_parse_json_container_str`: cheap parse-only-containers helper.
- `ccr_marker_for_string` + `opaque_kind_label` + `humanize_bytes`:
  marker formatting matching walker.rs byte-for-byte.

Reuses every PR2 primitive — no new traits, no new IR, no new
abstractions.

# Tests (6 new)

- Short string passthrough (no false positives)
- Stringified-JSON array recurses (50 items inside a string field)
- Opaque base64 blob → CCR marker substitution
- Top-level plain text passthrough (crush(plain_text) unchanged)
- Short JSON-looking strings unchanged (no false-positive opaque)
- Helper parses only containers, not bare scalars

303/303 smart_crusher tests pass (was 297 → +6 PR5).
185/185 Python tests. make ci-precheck green.

Module: crates/headroom-core/src/transforms/smart_crusher/crusher.rs
CcrStore trait + InMemoryCcrStore (1000 entries, 5-min TTL, FIFO
eviction, idempotent re-store) live at the crate root. SmartCrusher's
lossy crush_array path now actually stashes the full original [items]
canonical-JSON into the configured store keyed by the same ccr_hash it
embeds in the prompt marker -- closing the no-data-loss contract that
was previously hash-only.

PyO3 surface:
- crusher.crush_array_json(items_json) -> dict with ccr_hash + kept items
- crusher.ccr_get(hash) -> Optional[str] for retrieval
- crusher.ccr_len() -> int for telemetry

Python shim passes both through. Default constructors enable the store
(matches Python's CCR-enabled default); without_compaction() also gets
it because CCR is a contract, not an opt-in extra.

Tests proving compress -> store -> retrieve -> reconstruct:
- 7 unit tests in ccr.rs (put/get/eviction/expiry)
- 9 Rust integration tests (crates/headroom-core/tests/ccr_roundtrip.rs)
- 10 Python tests including 4 explicit before/after element-equality
  assertions through both the native PyO3 surface and the Python shim

Plugin manifest versions auto-bumped by the sync-plugin-versions
pre-commit hook (unrelated to CCR but co-resident in the working tree).
….toml

The action was set to @stable, which installs whatever the latest
stable is (1.95.0 right now). Then maturin invokes cargo, which reads
rust-toolchain.toml and re-resolves to "1.95.0 + clippy + rustfmt".
rustup treats stable and 1.95.0 as distinct toolchain identities and
refuses the second install with:

  failed to install component 'clippy-preview-x86_64-unknown-linux-gnu',
  detected conflict: 'bin/cargo-clippy'

This was intermittent across the matrix (only test (3.10) tripped on
the most recent run; others got lucky on cache state). Pinning the
action ref to 1.95.0 makes both sides ask for the exact same toolchain
identity, so the second install is a no-op and the conflict can't fire.

Bump procedure stays the same: when rust-toolchain.toml's channel
changes, update these refs in lock-step.

Plugin manifests auto-bumped 0.11.0 -> 0.13.2 by sync-plugin-versions
hook (unrelated to the workflow fix).
….toml

The action was set to @stable, which installs whatever the latest
stable is (1.95.0 right now). Then maturin invokes cargo, which reads
rust-toolchain.toml and re-resolves to "1.95.0 + clippy + rustfmt".
rustup treats stable and 1.95.0 as distinct toolchain identities and
refuses the second install with:

  failed to install component 'clippy-preview-x86_64-unknown-linux-gnu',
  detected conflict: 'bin/cargo-clippy'

This was intermittent across the matrix (only test (3.10) tripped on
the most recent run; others got lucky on cache state). Pinning the
action ref to 1.95.0 makes both sides ask for the exact same toolchain
identity, so the second install is a no-op and the conflict can't fire.

Bump procedure stays the same: when rust-toolchain.toml's channel
changes, update these refs in lock-step.

Plugin manifests auto-bumped 0.11.0 -> 0.13.2 by sync-plugin-versions
hook (unrelated to the workflow fix).
…r5-walker-integration

chore(rust): walker semantics in process_value — stringified-JSON + opaque strings
…r7-ccr-store

chore(rust): SmartCrusher CCR storage layer + roundtrip verification
Closes four gaps in the Rust SmartCrusher pipeline that, together,
wire CCR storage end-to-end so the LLM can actually retrieve dropped
data:

1. CCR-Dropped marker is now injected into process_value's lossy-path
   output as a sentinel object {"_ccr_dropped": "<<ccr:HASH N_rows_offloaded>>"}
   appended to the kept-items array. Previously the store held the
   original but no pointer reached the prompt -- the retrieval contract
   was data-on-server, no-way-to-ask. Sentinel-as-object preserves the
   array-of-dicts shape so downstream iteration with x.get(...) keeps
   working.

2. Walker / process_value drift removed. process_value gains a
   Value::String arm that handles stringified-JSON containers (parse,
   recurse, re-encode) and opaque blobs (CCR marker + store) -- same
   semantics walker.rs has always had, now reachable from the main
   crush() pipeline.

3. Opaque-string CCR now stores originals. DocumentCompactor gains an
   Option<Arc<dyn CcrStore>> field; emit_opaque_ccr_marker calls
   store.put when one is configured. Same hash regardless of store
   presence -- runtime contract is stable across configurations.
   Same wiring is shared between walker.rs and process_value via the
   extracted helper.

5. PyO3 surface adds SmartCrusher.compact_document_json(doc_json) ->
   compacted-json string. Routes through the crusher's existing CCR
   store, so ccr_get resolves both row-drop and opaque-string hashes.

Tests:
- 5 new Rust integration tests in ccr_roundtrip.rs (marker visibility,
  nested-array marker, opaque-string roundtrip, stringified-JSON
  recursion, walker-with-store)
- 4 new Python tests covering the marker visible-to-LLM contract via
  both the native PyO3 surface and the Python shim
- 5 legacy parity fixtures re-recorded (dict_array_*, duplicate_dicts_40)
  -- their lossy outputs now carry the sentinel; Rust + Python both
  match the new bytes (parity-run smart_crusher: 17/17)
The PR8 marker injection appends a sentinel object
{"_ccr_dropped": "<<ccr:HASH N_rows_offloaded>>"} to the kept-items
array on the lossy path so the LLM sees the retrieval pointer in the
prompt. Tests that iterate compressed arrays via subscript access
(e["level"], r["status"], i["labels"], m["text"]) hit KeyError on
the sentinel because it doesn't share the record schema.

Same root cause as the test_quality_retention fixes in PR8 -- these
integration tests were left out of that pass.

Ship a public helper headroom.transforms.smart_crusher.strip_ccr_sentinels
so tests can use it cleanly: `for e in strip_ccr_sentinels(entries):`
and production callers iterating compressed output get a single
canonical filter instead of inlining the _ccr_dropped check.

The 7 previously-failing tests in PR chopratejas#292 CI now pass:
  - langchain test_100_percent_errors_preserved_logs
  - langchain test_errors_preserved_with_many_errors
  - langchain test_search_results_with_query_term
  - mcp test_all_log_errors_preserved
  - mcp test_slack_significant_compression_with_content
  - mcp test_database_error_status_preserved
  - mcp test_github_bugs_partial_preservation

753 tests across the integration + transforms + retention suites pass
locally. Plugin manifests auto-bumped 0.13.3 -> 0.13.4 by the
sync-plugin-versions hook (unrelated to this fix).
…r8-marker-injection-walker-unify

chore(rust): SmartCrusher CCR marker injection + walker unification
…, single-serialize CCR write

Three orthogonal hot-path fixes targeting concurrent-request throughput.
Each is independently bench-measured below; the proxy hot path benefits
from all three at once.

== 1. PyO3 GIL release on heavy compute ==

PyO3 methods (crush, smart_crush_content, crush_array_json,
compact_document_json, compress, compress_with_stats) used to hold the
GIL across the entire Rust call. Result: a 100ms compress() blocked
EVERY other Python thread for 100ms — multi-worker uvicorn deployments
serialized through SmartCrusher.

Wrap each compute call in `py.allow_threads(|| ...)`. Inputs (`&str`
from Python) are copied to owned `String` first because PyO3 ties them
to the GIL hold. PyDict construction stays on the GIL side.

Measured: 4 Python threads each running 20 crushes:
  before (GIL held): ~3.3s wall    (serialized — equivalent to 4×0.83s)
  after (allow_threads): 826ms wall (4.01x speedup, perfect parallel)

== 2. CcrStore: Mutex<HashMap> -> DashMap-backed sharded ==

Single Mutex was the dominant bottleneck under multi-worker load — every
put/get serialized through one lock. Replace with DashMap (sharded
concurrent map, lock-free reads within a shard) plus a separate
small Mutex<VecDeque> for FIFO insertion-order eviction. Reads of
distinct keys never contend; writes only contend during the brief
order-queue push or capacity-sweep.

A/B bench (200 mixed put/get ops × N threads, in benches/ccr_store.rs):
  Threads | DashMap   Legacy Mutex  Speedup
  -------------------------------------------
       1  |   63 µs        71 µs       1.13x
       2  |   98 µs       194 µs       2.0x
       4  |  178 µs       707 µs       4.0x
       8  |  342 µs      1267 µs       3.7x

Legacy degrades ~linearly with thread count; DashMap stays near-flat
per-thread. Real multi-worker scaling.

== 3. Single-serialize the lossy CCR payload ==

The lossy `crush_array` path used to serialize the full array TWICE:
once in `hash_array_for_ccr` (allocates `Value::Array(items.to_vec())`,
deep-clones every Value subtree, then serializes), and a second time
in the store-write site. For a 50-item dict array that's ~MB of
allocator pressure per crushed array.

Introduce `canonical_array_json` (serializes `&[Value]` directly — same
bytes as `Value::Array(items.to_vec())` but no wrapper allocation +
no tree clone), call it ONCE per lossy path, then both hash and store
from those same bytes. Hash-format stable — all 17 parity fixtures
match byte-for-byte.

== Tests ==

- 8 ccr.rs unit tests including a new concurrent-stress test (8 threads
  × 200 puts/gets, every key readable afterwards)
- 14 ccr_roundtrip integration tests stay green
- parity-run smart_crusher: 17/17 fixtures match
- 479 lib + 14 integration + 185 Python tests all pass
- New benches/ccr_store.rs runs the A/B and is committed for regression
  visibility

== Dependencies added ==

- dashmap v6  (mature, widely-used in tokio/linkerd ecosystem)
…il-dashmap-single-serialize

perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write
Co-authored-by: JerrettDavis <2610199+JerrettDavis@users.noreply.github.com>
@JerrettDavis JerrettDavis marked this pull request as ready for review April 28, 2026 17:01
@JerrettDavis JerrettDavis merged commit f6ddbcf into native-e2e-expansion Apr 28, 2026
@JerrettDavis JerrettDavis deleted the copilot/update-pr-branch-from-main branch April 28, 2026 17:01
JerrettDavis pushed a commit that referenced this pull request Apr 30, 2026
`headroom/transforms/tag_protector.py` was a regex-driven scan-and-
replace loop that ran on every kompress call from ContentRouter
(`content_router.py:1089`). The Python implementation had five real
bugs we now fix in the port — the most consequential being a
`str.replace(.., .., 1)` first-occurrence-replace bug that silently
collapsed two identical custom-tag blocks in the same input to a
single placeholder + a stray duplicate of the second block.

# Bug fixes (each pinned by a `fixed_in_3e4` test)

* **#1: O(n²) on nested custom tags.** Python's `while changed` loop
  restarted a full regex scan after every replacement. Rust walks
  once in linear time on input length.
* **#2: First-occurrence replace bug.** `result.replace(orig, ph, 1)`
  replaces the FIRST textual match, not the matched offset. Two
  identical custom-tag blocks collapsed to one placeholder + a stray
  duplicate of the second block. The Rust walker stitches output by
  offset so distinct blocks always get distinct placeholders.
* **#3: Silent 50-iteration cap.** Python had a hard `max_iterations
  = 50` safety limit that quietly truncated tag protection on deeply
  nested input. The Rust walker is bounded by input length only.
* **#4: Self-closing pass duplicate-replace risk.** Python ran a
  second loop with the same `replace_first` bug for self-closers.
  Rust handles self-closers in the same single pass.
* **#5: Placeholder collision.** If the input contained a literal
  `{{HEADROOM_TAG_…}}` substring, Python silently let the collision
  break restoration. Rust salts the prefix and reports it in stats.

# Architecture

Two-phase walker:
* Phase 1 (`identify_spans`): linear scan over input bytes, hand-
  rolled tag-open / tag-close lexer (no regex). Maintains a stack of
  open custom tags; on a matching close, collapses the inner span
  into a single `Span { start, end, Block }`. Self-closing custom
  tags become `Span { ..., SelfClosing }` immediately. Marker-only
  mode (`compress_tagged_content=true`) emits Open/CloseMarker spans
  instead. Orphan opens stay un-protected (matches Python behavior).
  Orphan closes are emitted verbatim and counted in stats.
* Phase 2 (`emit_output`): walks `text` once, splicing placeholders
  for span ranges and copying everything else verbatim. Offset-based,
  never `str.replace`.

PyO3 surface: `protect_tags`, `restore_tags`, `is_html_tag`,
`known_html_tag_names`. The Python shim retires the regex internals
and re-exports `KNOWN_HTML_TAGS` (rebuilt from the Rust list) +
`_is_html_tag` for backwards compat with `content_router.py` and the
existing test surface.

# Test plan

* 25 Rust unit tests including 4 `fixed_in_3e4_*` bug-fix tests
* 27 Python tests (23 existing + 4 new `fixed_in_3e4` parity tests)
* 5 integration tests in `test_tag_protection_integration.py` pass
* `make ci-precheck` clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants