Skip to content

introduce base64 encoding and decoding for all nodes#299

Merged
apoelstra merged 11 commits intomasterfrom
2025-07/base64
Jul 29, 2025
Merged

introduce base64 encoding and decoding for all nodes#299
apoelstra merged 11 commits intomasterfrom
2025-07/base64

Conversation

@apoelstra
Copy link
Copy Markdown
Collaborator

This also refactors our bit encoding code to no longer be specific to CommitNode and RedeemNode; it introduces some string display methods and from_str` methods which are able to parse hex witness data alongside base64 programs.

...and use the dependency exposed by rust-bitcoin, rather than defining our
own, so that we don't have two versions that we need to keep in sync.
…ing node

In NamedNode<Construct> and NamedNode<Commit> we had set the sharing ID to
be the name of the node. The idea here was that if two nodes had different
names in the human-readable encoding, even if they were identical, that
they would not be shared since they were "conceptually different".

This idea was at best confusing, and at worst dangerous. "Sharing" is a
property of the bit encoding, not the human-readable encoding, and
pretending otherwise is just wrong.

I'm not actually sure what I intended by making the sharing ID be the name.
Notably, in the bit encoding we didn't actually use the sharing ID! We
override it in NamedCommitNode::encode, where we first convert the named
commit node to a normal commit node before encoding it. So AFAICT this
confused notion of sharing ID was never actually used. Some evidence for
this is that this commit doesn't break any unit tests..

However, in the next commit I will make encoding generic, and I won't be
able to use the "convert to a commitnode" hack anymore.

Anyway this is all moot since the human-readable encoding has been almost
completely superceded by SimplicityHL.
Right now we have encoding methods on RedeemNode and CommitNode. They are
not generic because RedeemNode has witness data while CommitNode does not.

This is a problem because it means that we cannot use the bit encoding to
implement generic fmt::Display impls. Instead we iterate and implement
fmt::Display in terms of a dump of each of the nodes in the program, with
newlines and everything, which is not really useful as a Display impl.

It's also an API annoyance; it means that we have to re-implement the
encoding method on every single node type individualy; there is no way
to encode "just the program" for redeem nodes, there are no encoding
functions for ConstructNode or NamedConstructNode even though it's
totally possible to encode these by calling encode_program directly, etc.
(I think it was a semi-deliberate design choice not to foreground the
encoding of ConstructNodes. But this is dumb. It's no less confusing
to encode ConstructNodes than to encode CommitNodes; in both cases it
is impossible to share "correctly" but there is still a natural way to
do so.)

While we are at it, stop taking a BitWriter and just take a generic Write.
There is no circumstance where the user should be bit-encoding a program
into the middle of another bitstream, since we split the witness data off
from the program data. (In fact, arguably there is no circumstance where
a user needs to work with bitwriters *at all*.) We weren't even using it
nicely as a BitWriter, since we'd always call `flush` after writing. Just
acknowledge that we will always write a whole number of bytes by taking
an io::Write.

Also while we're at it, return the count of program and witness bits
separately when encoding to a generic writer. We were returning the sum,
which since the split between witness and program, is a basically useless
quantity. (Not that we were using the return value of this function
anywhere..)

f deal with the "sharing id" of namedcommitnode being the name
@apoelstra
Copy link
Copy Markdown
Collaborator Author

cc @canndrew this is the last PR before we can start cutting releases

Generally speaking users don't want the "full decode" representation of
arbitrary nodes except for debugging, and in that case they probably want
a more structured representation. It's certainly not what you'd expect
for the string representation.

Simplicity programs are canonically encoded in base 64, although there is
not a canonical encoding for witness data. I decided to use hex since this
makes programs and witnesses visually distinct and also it makes the
witness data convenient for pasting into json blobs for witness scripts.

The next commit will add some unit tests. Then we will add string parsing
and some round-trip fuzz tests.
These tests have been an albatross around the neck of this library for
too long. They are the reason we have the `decode_natural` function even
though it is called exactly once, from its thin wrapper `read_natural`
that has a more sensible API.
Several changes here:

* Split out the natural-number parsing errors into `DecodeNaturalError`
* Elaborate `BadIndex` to give the actual and maximum number read
* Make `EarlyEndOfStreamError` non-exhaustive to prevent direct construction,
  and store it directly in DecodeNaturalError rather than replacing it with
  a dummy variant.
* Drop the `TooManyNodes` error variant. This only existed to prevent
  allocating huge vecs. Just cap the value passed to `Vec::with_capacity`
  instead.
* Drop the `EmptyProgram` error variant which was impossible to hit.
* Inline `decode_natural` into `read_natural`; make this generic over the
  integer types to eliminate casts to/from u32; expand the existing unit
  test to cover excess values.
Essentially, DecodeError is used when constructing a program from the
bit encoding and FinalizeError is used when constructing a program using
the library.

Along the way, tighten up some error returns for functions that return
e.g. crate::Error when they actually only ever yield type errors.

We drop the Error::Policy variant which was never used. The old Error
enum now has a single variant, which annoyingly is used by the
Haskell-generated code so we can't remove it without roundtripping
through upstream. Will try to do this before the next release but it's
not a huge priority.

In the next commit we'll add a ParseError which wraps DecodeError and
adds some string-encoding-related variants.
It bugs me to have two versions of hex-conservative.

Our lockfile retains hex-conservative 0.1, because our fuzztests depend
on an old released version of this crate.
We cannot use the std from_str trait because these all construct an Arc<Self>
rather than a Self. And we cannot implement this generically on Node because
the rules for decoding are different for each node type (we have progressively
tighter correctness requirements, and with RedeemNode we need extra witness
data).
Copy link
Copy Markdown
Collaborator Author

@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 96839b7 successfully ran local tests

@canndrew
Copy link
Copy Markdown
Collaborator

ACK 96839b7

@apoelstra apoelstra merged commit 082e104 into master Jul 29, 2025
39 checks passed
@apoelstra apoelstra deleted the 2025-07/base64 branch July 29, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants