Skip to content

Conversation

apoelstra
Copy link
Collaborator

@apoelstra apoelstra commented Aug 31, 2025

I recently learned about the GhostCell crate, and its associated paper from ICFP 2021. There are a couple of similar projects from that era, notably qcell which predates the published version of ghostcell by a bit, but none of these are used much, as evidenced by the "Dependents" tab on crates.io.

I learned about them from a blog post about "generativity" which has a pattern that makes GhostCells a little more ergonomic-looking at the cost of using macros. (I don't think this is worthwhile and I didn't use them).

But the macro syntax isn't really the point. The key quote from the blog post is this

The fundamental purpose of the generativity pattern is not necessarily to improve performance, but to statically require that individual values come from or refer to the same source. The performance benefits are a symptom of this requirement.

Anyway, GhostCells are extremely powerful and I don't know why they are not more popular. In effect, they allow synchronizing a single zero-sized token, and gating access to an arbitrarily-sized set of otherwise-disconnected objects on them. So if you have a &mut Token you can get mutable access to all of your objects.

This is exactly the situation we have with our type inference context and our union bound algorithm: we have a Context object which has a Mutex controlling access to a slab of allocated bounds, and then we have a whole pile of UbElements which are connected in a forest structure that the Rust compiler doesn't really understand, but where we morally have exclusive access exactly when we have exclusive access to the slab. GhostCell lets us tell the compiler this.

This PR

  • Eliminates all the mutexes in the type checker except the one in the Context object that holds the slab
  • Fixes multiple bugs related to inconsistent contexts, including one in the policy module which made it impossible to actually use the asm fragment functionality
  • Adds an invariant lifetime 'brand which acts as a tag on all Types, ConstructNodes and WitnessNodes which prevent them being combined if they come from different type inference contexts
  • Requires the user to do node construction inside a closure which is passed to Context::with_context, rather than constructing a context object which then sorta leaks into the ether by being contained in an Arc carried by each node

The latter item sounds like a big ergonomic hit, but in practice it's not a big deal because it doesn't affect CommitNode or RedeemNode.

The use of a lifetime rather than a marker type means that while every type context has a "distinct type" for Type, Arrow, ConstructNode, etc., the distinction is erased after typechecking. So compilation speed and program size will not be adversely affected. But conversely, the compiler error messages are pretty unclear when you use this incorrectly.

@apoelstra apoelstra force-pushed the 2025-08/ghostcell branch 2 times, most recently from 2ddbf3f to 0a2341b Compare August 31, 2025 22:47
@uncomputable
Copy link
Collaborator

This looks very promising. Are we ready to say that the type inference code is deadlock free, at this point?

@apoelstra
Copy link
Collaborator Author

@uncomputable yep!

We are going to change a bunch of stuff about type inference, so add a benchmark
for decoding a fixed program that's available. This is not the best benchmark,
but it's one that's available.

On my system the current results is: 70,972.00 ns/iter (+/- 662.39)
This one is coauthored by Ralf Jung and has a paper (which I read) and a
formal proof (which I did not). It would be hard to find a more trustworthy
dependency.

Also, we have a clear need for this -- currently we have Mutexes all over
our type checker code which are both slow and hard to reason about.
This impl was never used and will be impossible to have once we put a
GhostToken into the context.
The Context object contains an Arc, which is probably the right design
decision since these are supposed to be "handles" to a type inference
context which can be cheaply cloned everywhere. Without this there
would be a ton of Arc::new and Arc::clone noise throughout the codebase.

But we want to make this structure more elaborate, which requires putting
more stuff in the Arc, which requires a new struct. So do that in its
own commit, to try to reduce the amount of noise in the next one.
Right now our types::Error enum has a couple variants that include bare
`Type`s in them which are intended to represent invalid types. It isn't
good hygeine to do this, because these Types carry with them information
about the type context (which is supposed to be an ephemeral thing
created during parsing and destroyed before parsing completes).

It also limits our ability to provide useful representation of incomplete
types in error messages, because the bare Bound type does not contain any
data about why a bound was invalid. The user is also forced to be careful
with these invalid objects, especially in the case of an occurs-check
failure, when any attempt to iterate on the type will infinite-loop.

This commit does the bare minimum to define a new type for use in error
messages. In a followup we probably want to clean things up a bit -- at
the very least we should try to construct some sort of rich "occurs check
failure" information. We may also want to define an IncompleteArrow,
which would make the code in human_encoding a bit tidier.

The attentive reviewer may notice calls to `lock.drop()` inserted before
calls to Incomplete::from_bound_ref. Without these calls, the code deadlocks.
There is no compiler help or anything to keep track of this. This kind of
bullshit is why I'm so excited to use GhostCell, where the compiler (and
borrowck) will do all the work for us.

Finally, in the next commit when we add a 'brand lifetime marker to Type,
we will be forced to avoid saving this type in our Error enum, because
then Type will be unable to outlive its type context (and errors must
outlive the context, since the context is ephemeral and the error is
something we return to the user).
All this commit does is patch a 'brand lifetime (the name comes from the
GhostCell paper) into `types::Context`. In the next commit this lifetime
will be associated with a GhostToken but for now it's unused.

However, putting an invariant lifetime onto `types::Context` means that
we need an invariant lifetime on `Type`, which means one on `ConstructNode`
and the `*Constructible` traits, and on any functions which create these
things.

In fact, this is exactly what we want! All these types and functions are
ones that should be tied to a single type-inference context, and by use
of GhostCells we will get compile-time assurance that users are not trying
to mix types from one program with types from another. As a second benefit,
this should get rid of the mutex mess in the typechecker and union-bound
code.

IOW this is an extremely large commit, but it doesn't do anything at all
except add lifetime annotations, and they're all good.
@apoelstra apoelstra force-pushed the 2025-08/ghostcell branch 2 times, most recently from 7fe6588 to 3999983 Compare September 1, 2025 19:45
In the new GhostCell paradigm it will be impossible for a Type (and
therefore a ConstructNode) to outlive its context. This means that
all the methods which currently return a disembodied ConstructNode
need to accept a context as a parameter.

In effect, this puts "direct constructors" and parsers/decoders of
ConstructNodes on even footing: people doing direct construction of
programs have always needed to create a context and keep track of it,
since the constructors iden(), unit(), witness() and jets all need
one. The other constructors simply copy the context Arc from the
objects.

This design had a couple goals:

* Minimizing pain for users by minimizing the number of places they had
  to pass a context object
* Minimizing errors by using "the correct context object" automatically
  by storing an Arc and cloning it behind the scenes
* Allowing users of decode_expression and ConstructNode::from_str to not
  think about contexts at all.

However, I observe that the "minimizing pain" goal is not that valuable
since this only applies to ConstructNode (since CommitNode and RedeemNode
have complete types in them, and these are much more commonly used than
ConstructNode). Furthermore:

* The current logic actually makes it impossible to decode two expressions
  and then combine them, since they'll have different associated type
  contexts. (I guess nobody has tried this?? They would get a panic.)
* The "minimizing errors" goal we can achieve by GhostCell, at compile
  time and much more thoroughly, and in a way that "feels Rustic" in
  that it levers the existing borrowchecker.
Very short change, because this function is never called or tested anywhere.
Same as before, quick change, no tests..
Same justification as before. If you are constructing a ConstructNode (or
a WitnessNode, which is basically the same thing), you should keep explicit
track of your context objects.

Also, prior to this commit, if you were converting named nodes to witness
nodes for the purpose of combining them, you could not do this because
to_witness_node was making a fresh type context for each conversion.
Turns out the asm fragment is basically unusable. Oops. The next commit
will fix this, at the cost of some ugly API changes.
This is a bit annoying for users of the Satisfier trait: they need to
construct a type inference and keep it around for the satisfaction logic
to use. If the satisfier supports assembly this is necessary.

If it doesn't support assembly, which is probably the more common case,
then there's no need for this ... but because of lack of specialization
and a general lack of expressivity in Rust's type system, they gotta do
it anyway.
Prior to this commit we were using pointer values and assertions to
check that users were using type contexts consistently (i.e. not
using different contexts for the same program). This worked reasonably
well, but was (a) slow, (b) revealed errors only at runtime, and (c)
provided almost no help in guiding the user to using the correct
context in the correct places.

Instead, we can use a ghost_cell::GhostToken as an identity, which is
a special type which can be used to assign a unique invariant "brand"
lifetime to every type context. Then everything that holds a context
reference, i.e. every type bound and ConstructNode, inherits this
brand, and the compiler will ensure that you don't mix brands *and*
that you don't fail to mix brands. And it will do it all using the
borrowchecker, meaning that your code won't compile if you're wrong,
but the markers get erased prior to codegen so that the compilation
time and code size aren't meaningfully increased.

Furthermore, the use of invariant lifetimes ensures that we catch
mistakes in both directions: if the user uses two contexts where they
should've used one, that's an error (which our existing assertions
will probably catch); *and* if the user uses one context where they
should've used two, that's probably also an error.

The tradeoff here is that you can no longer construct context objects
directly and then pass them around. You need to construct them using
Context::with_context, which provides a fresh context object to a
closure, and the context won't outlive the closure. (This is how we
catch "if you use one context where you should've used two" errors;
to make this mistake and still have your code compile, you need to
put two program constructions into the same closure, which hopefully
will be visibly wrong.)

So far, based on my updating of our unit tests and all our code, this
limitation doesn't seem to be a big deal. It does require a bunch of
reindentation but it hasn't made the code any harder to write or
structure -- and where it did force restructuring (e.g. in the Satisfy
trait in the previous commit) it was because the original code was
actually wrong.

Another tradeoff is that when you get stuff wrong the compiler errors
are pretty rough. But I think we'll have to live with this, and
hopefully future compilers will learn to understand GhostCell and
GhostToken constructions better.

Definitely want to review this using
    git show --color-moved-ws=ignore-all-space
because the bulk of the diff is just me re-indenting unit tests to
wrap them in Context::with_context.
In the next couple commits we will replace the mutexes in the union-bound
algorithm with GhostCells. Happily, we can reuse the existing GhostToken
in types::Context as an access-control token.

However, to do this we need to refactor a couple algorithms so that we
can fish the token recursively through unify() and bind(), which naively
would require the union-bound algorithm to know about our ContextInner
type. To avoid this abstraction violation, we introduce the generic
WithContextToken wrapper, so the union-bound algorithm can work with "a
token and some arbitrary data" without knowing or understanding anything
about the arbitrary data.
Now that we have the dedicated ContextInner type, we don't need a
dedicated LockedContext type. It's also conceptually simpler to have
a "Context, which mutex-wraps a ContextInner, which implements all
the real methods" rather than having this auxiliary "locked context"
type which isn't directly related to Context.

(And ofc there is WithGhostToken in the middle, but the reader of
the code can hopefully mostly ignore this.)

This gets rid of a lot of `.inner`s which is nice.
This eliminates all the locking code from union_bound.rs, turning any
lifetime mistake into compiler errors rather than deadlocks :). It also
lets us simplify a fair bit of code, removing all the drops, all the
lock().unwrap()s, and one shallow_clone.

There are some borrow()s and borrow_mut()s, but if these compile then
they are no-ops, so the reader can basically ignore them.

The previous benchmark was: 70,972.00 ns/iter (+/- 662.39)
            The new one is: 69,947.54 ns/iter (+/- 286.68)

which is a ~1.5% improvement, if these numbers are to be believed.
@apoelstra
Copy link
Collaborator Author

Went over this in detail in person with @canndrew. He will ACK when he gets to his computer, I think.

@apoelstra
Copy link
Collaborator Author

One followup item is that Incomplete::from_bound_ref takes a Context but it should take a LockedContext (or rather, a ContextInner). This is a private method and requires a locked context, so better to require the caller to have already obtained the lock than to have lots of .lock sites. I think.

@canndrew
Copy link
Collaborator

canndrew commented Sep 1, 2025

ACK 060455a

Copy link
Collaborator Author

@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 060455a successfully ran local tests

@apoelstra apoelstra merged commit 6f245ec into BlockstreamResearch:master Sep 1, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants