Skip to content

feat(rust): self-lift provekit-canonicalizer (substrate dogfoods itself)#38

Merged
TSavo merged 1 commit into
mainfrom
feat/self-lift-canonicalizer
May 2, 2026
Merged

feat(rust): self-lift provekit-canonicalizer (substrate dogfoods itself)#38
TSavo merged 1 commit into
mainfrom
feat/self-lift-canonicalizer

Conversation

@TSavo
Copy link
Copy Markdown
Owner

@TSavo TSavo commented May 2, 2026

Sir's framing

The framework writes lifters and uses them on test fixtures; this is the first time it lifts itself. The canonicalizer is the right starting point because every CID in the system hashes over JCS-canonical bytes the canonicalizer produces. If the canonicalizer is wrong, everything is wrong; it is the substrate's first invariant.

What the Rust lifter actually does

provekit-lift (crate implementations/rust/provekit-lift/) walks a Rust workspace, parses every .rs file with syn, dispatches each parsed file to nine adapter crates, collects ContractDecl values, mints each as a signed contract memento via provekit_claim_envelope::mint_contract, and bundles them into a single .proof envelope addressed by its BLAKE3-512 CID.

Annotation sources the lifter reads:

  • proptest! { #[test] fn ... } blocks (adapter proptest)
  • #[contracts::requires] / #[contracts::ensures] attrs (contracts)
  • kani::proof / prusti / creusot / flux / verus attrs (one adapter each)
  • #[quickcheck] (quickcheck)
  • Plain #[test] and #[tokio::test] functions whose bodies contain assert!/assert_eq!/assert_ne!/assert_matches! (rust-tests Layer 0)
  • for x in lo..hi { assert!(...) } bounded loops, helper-function inlining, multi-assertion characterization (rust-tests Layer 2)

Outputs: a <cid>.proof file (deterministic CBOR) under the requested target dir. CLI flags:

  • provekit-lift --workspace <dir> --target-dir <out> (direct invocation)
  • cargo provekit-lift ... (Cargo subcommand form)
  • provekit-lift --rpc (NDJSON-on-stdio plugin mode)

The lifter has no --config flag; it picks up every .rs file under --workspace.

How I ran it

Exact command:

implementations/rust/target/release/provekit-lift \
  --workspace implementations/rust/provekit-canonicalizer \
  --target-dir .provekit/self-lifts/canonicalizer

Output (verbatim):

provekit-lift: scanned 9 .rs files
  adapter `proptest`: seen 0, lifted 0, skipped 0
  adapter `contracts`: seen 0, lifted 0, skipped 0
  adapter `kani`: seen 0, lifted 0, skipped 0
  adapter `prusti`: seen 0, lifted 0, skipped 0
  adapter `creusot`: seen 0, lifted 0, skipped 0
  adapter `flux`: seen 0, lifted 0, skipped 0
  adapter `quickcheck`: seen 0, lifted 0, skipped 0
  adapter `verus`: seen 0, lifted 0, skipped 0
  adapter `rust-tests`: seen 82, lifted 13, skipped 69
  adapter `rust-tests-layer2`: seen 11, lifted 0, skipped 11
provekit-lift: wrote .provekit/self-lifts/canonicalizer/blake3-512:79ef1067621f7a3bb3ad31acd157be9ec9e087eb6aa585acd9ea6c9d0965adb02e7f1037a3146dcbc3f6a4c284796d6eadee7e0265d3eeab75791b79628632de.proof (13 members)

Determinism: confirmed across two runs into separate temp dirs (default seed [0x42; 32]); both produced the same CID byte for byte.

What got lifted (13 contract mementos)

Each name is <test_fn>::<assert_index>. Atom shape is a single binary comparison whose operands are identifiers, integer/string literals, or single-arg ctor calls.

Contract Asserts (informally)
prefix_constant_matches_spec::0 BLAKE3_512_PREFIX = PREFIX (the spec literal "blake3-512:")
cid_string_form_for_empty_is_well_known::0 the pinned BLAKE3-512 empty-input vector
cid_regex_compliance::0 count = 128 (hex character count)
deterministic_across_calls::0 a = b (identity of two same-input calls)
cid_distinguishes_byte_strings_from_text::0 blake3_512_hex(s) = blake3_512_of(b) (str/bytes parity)
mixed_ascii_and_unicode_preserved::0 encoded = "x \u{2265} 0" (UTF-8 round-trip witness)
unicode_in_object_key_and_value::0 encoded = '{"name":"\u{2265}"}'
non_ascii_strings_round_trip_byte_faithful::0, ::2 per-symbol round-trip witnesses for \u{2265} and \u{65E5}\u{672C}\u{8A9E}
realistic_envelope_shape_round_trips::1, ::2, ::3 envelope-shape determinism witnesses
single_bit_flip_changes_most_output_bits::0 identifier equality from the avalanche regression test

All 13 survive provekit verify clean (zero callsites, zero load errors; this is a pure-contract bundle, no bridges).

What couldn't be lifted (69 + 11 = 80 skips, honest gap-finding)

rust-tests adapter (69 skips):

  • 53 hits expression shape outside the v0 whitelist. The canonicalizer's JCS tests are uniformly shaped as assert_eq!(encode_jcs(&Value::object([...])), "<canonical>"). The lhs is a method call; the rhs's operand is a multi-arg constructor. Both fall outside the "identifier, literal, single-arg ctor" whitelist.
  • 14 hits assert! body must be a binary comparison. The CID-shape tests use assert!(matches!(...)) and assert!(h.starts_with(PREFIX)) heavily.
  • 2 hits byte-string literals (b"") are not in the v0 literal set, so assert_eq!(blake3_512_hex(s), blake3_512_of(b"")) skips.

rust-tests-layer2 adapter (11 skips):

  • 3 hits bounded-loop pattern with a multi-statement body. for input in [...] { let h = ...; let hex = ...; assert_eq!(...) } exceeds the v0 single-stmt cap.
  • 8 hits characterization-conjunction with zero of N atoms liftable (every assert in the body is an encode_jcs(&v) == "..." shape, none of which Layer 0 can lift; Layer 2 then releases all eight tests back to Layer 0, which skips them).

Spec coverage against protocol/specs/2026-04-30-canonicalization-grammar.md:

Spec property Lifter reach
RFC 8785 §3.2.3 sorted object keys NONE (every test's lhs is a method call)
No whitespace NONE
\u00XX escapes for U+0000..U+001F NONE
Integer plain decimal NONE
BLAKE3-512 length 11 + 128 PARTIAL (count=128, empty-vector pin)
Self-identifying prefix blake3-512: PARTIAL (constant matches spec)
Determinism (same input -> same output) YES
Distinct inputs distinct hashes NO (b"" not whitelisted)
UTF-8 byte-faithful for non-ASCII PARTIAL (per-symbol witnesses only)

Full per-skip-reason taxonomy with file paths is in .provekit/self-lifts/canonicalizer/lift-report.txt.

Make target

make self-lift-canonicalizer

builds provekit-lift (release), wipes any prior .proof under .provekit/self-lifts/canonicalizer/, re-runs the lift, and writes the same CID. Idempotent. NOT wired into make conformance or make ci; this is an experiment, not core conformance, and the task explicitly cautioned against pinning a self-lift CID into the gate while the lifter is still v0.

Cross-impl leverage

The same skip patterns will surface on the TS, Go, and C++ canonicalizer test files when their respective lifters point at them, because the test idiom is the same across kits: expect(encodeJcs(...)).toEqual("..."), assert.Equal(t, EncodeJcs(...), "..."), EXPECT_EQ(encode_jcs(...), "..."). The shape gap is in the lifter's expression whitelist, not in any one language's idioms. Concrete next moves once the v0 whitelist grows (single-arg method-call lhs, multi-arg ctor rhs):

  • TS: provekit-lift-vitest against implementations/typescript/src/canonicalizer/
  • Go: provekit-lift-go-tests against implementations/go/.../canonicalizer/
  • C++: future C++ lifter against implementations/cpp/provekit/canonicalizer/

Each will recover roughly the same shape of mementos this run produced and skip roughly the same shape of tests with the same structured warnings.

Don't-touch list (followed)

  • No changes to provekit-canonicalizer/ source. The point was to surface what the lifter reaches as-is.
  • No changes to provekit-self-contracts (it ships a pinned CID under make conformance; conflating the experiment with core conformance would have wrecked it).
  • No changes to other kits.
  • No changes to protocol/specs/ text.
  • No changes to .github/workflows/.

Test plan

  • make self-lift-canonicalizer succeeds and prints the same CID across two runs
  • provekit verify <proof> loads the lifted catalog clean (0 errors, 0 violations)
  • git status clean after the make target re-runs (idempotent)
  • Reviewer sanity-check: open .provekit/self-lifts/canonicalizer/lift-report.txt and skim the skip taxonomy

The framework writes lifters and uses them on test fixtures; this is
the first time it lifts itself. `provekit-lift` runs against the
`provekit-canonicalizer` crate and emits a durable `.proof` plus a
honest skip taxonomy at `.provekit/self-lifts/canonicalizer/`.

Outcome (rust-tests adapter on canonicalizer source):

  files_scanned: 9
  rust-tests       seen=82  lifted=13  skipped=69
  rust-tests-l2    seen=11  lifted=0   skipped=11
  proptest/contracts/kani/prusti/creusot/flux/quickcheck/verus: zero
    (the crate uses no external annotation library)

13 contract mementos lifted: prefix_constant, deterministic_across_calls,
empty-input vector pin, count=128, Unicode round-trip witnesses, etc.
All survive `provekit verify` clean.

Skip causes (honest gap-finding):
  53 hits  expression shape outside v0 whitelist (method calls,
            constructors, multi-arg calls)
  14 hits  assert! body not a binary comparison
   2 hits  byte-string literals (b"") not in the v0 literal set

Spec coverage (canonicalization-grammar.md): partial. The lifter
reaches counter and identifier equalities, the prefix constant, and
a handful of literal pins. The byte-faithful claims (sorted keys, no
whitespace, escape sequences, integer formatting) all skip because
every test on the lhs is encode_jcs(&v), a method call. That is the
operationally enforced bulk of the canonicalizer's invariants and it
is currently unreachable by the auto-lifter; surfacing the gap is
the point.

Idempotent: default seed [0x42; 32] produces a byte-deterministic
catalog CID (verified across two runs into separate temp dirs).

  make self-lift-canonicalizer

regenerates the .proof at the same CID. NOT wired into make
conformance because this is an experiment, not core conformance.

Cross-impl leverage: the same skip patterns will hit the TS, Go, C++
canonicalizers when their respective lifters point at their
canonicalizer crates. The shape gap is in the lifter's expression
whitelist, not in any one language's idioms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 2, 2026 17:55
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Warning

Rate limit exceeded

@TSavo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 9 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c40a69bf-1501-4629-91d5-4cd55af3244c

📥 Commits

Reviewing files that changed from the base of the PR and between c2fe48e and 94aebc3.

📒 Files selected for processing (4)
  • .gitignore
  • .provekit/self-lifts/canonicalizer/blake3-512:79ef1067621f7a3bb3ad31acd157be9ec9e087eb6aa585acd9ea6c9d0965adb02e7f1037a3146dcbc3f6a4c284796d6eadee7e0265d3eeab75791b79628632de.proof
  • .provekit/self-lifts/canonicalizer/lift-report.txt
  • Makefile
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/self-lift-canonicalizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 27 minutes and 9 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@TSavo TSavo merged commit 279fee0 into main May 2, 2026
5 of 6 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a self-lift experiment for the Rust canonicalizer crate, checking in the resulting lifted .proof artifact and a human-readable report so the repository can track what provekit-lift currently reaches on first-party canonicalizer tests. It extends the top-level build orchestration with a dedicated make target and updates ignore rules so the experiment artifact can live in source control.

Changes:

  • Adds make self-lift-canonicalizer and help text for running the Rust lifter against implementations/rust/provekit-canonicalizer.
  • Checks in a generated self-lift .proof bundle plus a companion lift-report.txt summarizing lifted and skipped assertions.
  • Updates .gitignore to allow versioning self-lift .proof artifacts under .provekit/self-lifts/.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File Description
Makefile Adds help text and a new self-lift experiment target for the Rust canonicalizer crate.
.provekit/self-lifts/canonicalizer/lift-report.txt Adds a checked-in textual report describing lifted contracts, skip taxonomy, and spec coverage.
.provekit/self-lifts/canonicalizer/blake3-512:79ef1067621f7a3bb3ad31acd157be9ec9e087eb6aa585acd9ea6c9d0965adb02e7f1037a3146dcbc3f6a4c284796d6eadee7e0265d3eeab75791b79628632de.proof Adds the generated proof envelope produced by the self-lift run.
.gitignore Unignores self-lift .proof artifacts and documents that they are intended to be committed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile
Comment on lines +264 to +271
@out=$$($(PROVEKIT_LIFT) \
--workspace implementations/rust/provekit-canonicalizer \
--target-dir $(SELF_LIFT_DIR) --quiet); \
echo " cid: $$out"; \
test -f $(SELF_LIFT_DIR)/$$out.proof || \
(echo "FAIL: lifter did not write $(SELF_LIFT_DIR)/$$out.proof" && exit 1); \
echo " proof: $(SELF_LIFT_DIR)/$$out.proof"
@echo " report: $(SELF_LIFT_DIR)/lift-report.txt"
(object-key Unicode preservation witness)

non_ascii_strings_round_trip_byte_faithful::0
i = "≥"
Comment on lines +76 to +80
jcs = ?
realistic_envelope_shape_round_trips::2
cid = ?
realistic_envelope_shape_round_trips::3
cid_again = ?
Comment on lines +84 to +85
a = b
(avalanche regression test, lifted as identifier-equality)
Comment on lines +187 to +189
bodies. Three of the canonicalizer's parameterized loops use a
three-line shape (input -> let -> assert) and skip cleanly with
a structured warning.
Comment on lines +18 to +27
proptest seen=0 lifted=0 skipped=0
contracts seen=0 lifted=0 skipped=0
kani seen=0 lifted=0 skipped=0
prusti seen=0 lifted=0 skipped=0
creusot seen=0 lifted=0 skipped=0
flux seen=0 lifted=0 skipped=0
quickcheck seen=0 lifted=0 skipped=0
verus seen=0 lifted=0 skipped=0
rust-tests seen=82 lifted=13 skipped=69
rust-tests-l2 seen=11 lifted=0 skipped=11
TSavo added a commit that referenced this pull request May 23, 2026
…from lift IR

Proves the substrate's lift IR carries enough data to emit a complete
java compilation unit (interface, constants, @boundary stubs) WITHOUT
hand-writing the wrapper at integration time.

Reads provekit lift --library-bindings output, extracts bind-lift-entry
signatures, maps rust types via a small source-aliases table (the
substrate-honest version walks the kit's catalog), emits:
  - package + imports
  - class header + static fields/constants
  - AdapterLifter interface
  - @boundary primitive stubs with byte-identical signatures to source

The substrate-honest version of this script is the java realize plugin
emitting the same when invoked over the whole IR (via the
provekit.plugin.assemble RPC method — pending task #38). This Python
script is the proof of concept: the IR carries the data; only the
emission needs threading through cmd_lower or a new assemble RPC.

What this retires: the hand-written CrossPlatform.java wrapper in the
demo's test harness — the LAST hand-written java code in the
rust → java → rust cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 23, 2026
…es hand-written boundary code

The python wrapper-emitter is now retired. cmd_lower --target java
auto-emits the full java compilation unit when boundary entries are
present in the lift IR:
  package com.provekit.crossplatform;
  imports (Jackson, runtime carriers, java.nio.file.Path)
  public final class CrossPlatform {
    static final ObjectMapper MAPPER;
    static String PLUGIN_VERSION; ... (constants)
    public interface AdapterLifter { name(); surface(); lift(...); }
    // @boundary primitives auto-emitted from rust @boundary declarations:
    public static <Type> json_parse(String s) { throw UnsupportedOp; }
    ... (9 primitives, signatures derived via map_rust_type_to_java)
    // @sugar functions follow (each as a static method inside the class):
    public static JsonNode ok_response(...) { ... }
    ...
  }

CHANGES:
- NamedTermDocument gains boundary_entries: Vec<Json> field (defaults
  empty, populated from bind-lift-entry records in the IR).
- named_term_document_from_ir_document collects bind-lift-entry items
  that aren't also @sugar (i.e. true @boundary functions).
- lower_named_document detects target=="java" + non-empty boundary_entries
  and emits emit_java_module_preamble() before, '}' after.
- map_rust_type_to_java handles &str, Value, [u8;N], Result<T,E>, etc.
- strip_transported_class_wrapper peels per-function 'final class XxxTransported'
  wrappers so the @sugar methods sit directly inside CrossPlatform.

EMPIRICAL: provekit lower --target java now emits a single compilation
unit with package + imports + CrossPlatform class + 9 @boundary stubs +
all @sugar methods. The python wrapper-emitter script is retired (kept
as proof-of-concept demonstration).

Substrate-honest note: this lives in rust cmd_lower because the
provekit.plugin.assemble RPC method-shape is still pending (#38). The
ideal home is the java realize plugin (it owns the target's file
layout). Moving this code there is a future refactor; the current shape
is functional and demonstrates the substrate emits its own wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants