Skip to content

fix: ntx builder adheres to note limit and store apply_block race condition fixed#1508

Merged
Mirko-von-Leipzig merged 9 commits intomainfrom
mirko/fix/ntx-builder
Jan 15, 2026
Merged

fix: ntx builder adheres to note limit and store apply_block race condition fixed#1508
Mirko-von-Leipzig merged 9 commits intomainfrom
mirko/fix/ntx-builder

Conversation

@Mirko-von-Leipzig
Copy link
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig commented Jan 13, 2026

Caution

This PR targets main

Fixes several issues:

  • Race condition in the store's apply_block
    • A request cancellation (e.g. due to timeout from gRPC client) halts the process in some arbitrary location.
    • Solved by moving the process to a spawned task so it isn't cancelled by the gRPC method.
  • Ntx builder not adhering to the protocol note limit in the checker
  • Ntx builder not notes from error'd txs as failed
  • Improve telemetry more

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Than you! I left a few comments/questions inline.

Comment on lines +147 to 150
// Always mark notes as failed. They can get retried eventually.
state.notes_failed(candidate, notes.as_slice(), block_num);

state.candidate_failed(candidate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for this PR, but there seems to be some inconsistency in docs.

Specifically, doc comments for State::candidate_failed() say "All notes in the candidate will be marked as failed" - but I'm not sure that's true (otherwise, we probably wouldn't need to call State::notes_failed() right before State::candidate_failed() - right?).

Also, I'm curious what the rationale was for skipping certain error types before. Was the idea that if, for example, proving failed, then it is likely not the notes fault and there, we shouldn't penalize these notes as "failed"? If so, I would probably expand the comment to indicate that regardless of the source of the failure, we mark notes as failed so that they could be removed from the pending note set in case of unexpected failures (which is not ideal, and may need to be revisited later).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for this PR, but there seems to be some inconsistency in docs.

Yes, though on next none of this code exists because its all refactored into the actor model. This is still running the centralized version so I'm.. less concerned.

Also, I'm curious what the rationale was for skipping certain error types before. Was the idea that if, for example, proving failed, then it is likely not the notes fault and there, we shouldn't penalize these notes as "failed"?

I think it was a conservative implementation, where we only marked notes that actually failed the check. And that likely all other errors were caused by other external factors.

Comment on lines +73 to +79
// We perform the apply_block work in a separate task. This prevents the caller cancelling
// the request and thereby cancelling the task at an arbitrary point of execution.
//
// Normally this shouldn't be a problem, however our apply_block isn't quite ACID compliant
// so things get a bit messy. This is more a temporary hack-around to minimize this risk.
let this = self.clone();
tokio::spawn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this makes it so that apply_block() always completes, even if the gRPC requests is canceled/times out - right? And the goal of this is to prevent the database getting stuck in the locked state - correct? If so, I'd maybe mention the last part ore explicitly in the comment.

Also, AFAICT, this doesn't prevent the block producer from "de-syncing" from the store. Basically, what we could have is:

  1. Block producer builds a block and send it to the store.
  2. apply_block() takes too long.
  3. Block producer's gRPC request times out.
  4. The tasks doesn't get canceled and the block gets inserted into the DB.

At this point, the store will be at block $n$ while the block producer will be at block $n-1$. If we set reasonable timeouts, this should be extremely rare, but the block producer should be able to recover from this. I'm assuming that's not the case yet - and if so, let's create an issue for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this makes it so that apply_block() always completes, even if the gRPC requests is canceled/times out - right?

Yes, though its important to note that this only removes this problem for gRPC request cancellation. This doesn't prevent bugs within this task from placing us in a weird state. For example, its still possible that the in-memory data is updated but the database commit fails because we don't have ACID wrapping both. Guaranteeing this requires a bit of a rethink.

And the goal of this is to prevent the database getting stuck in the locked state - correct?

Not quite, this will likely just make the error different. The database "locked" was a result of having multiple block submission requests running concurrently. Request x times out, task is cancelled but the database portion is still running since it hasn't hit an await point yet. Request y is then submitted, and gets rejected since x is still holding the db write lock.

aka the error is a result of gRPC timeouts being exceeded, and we can't really change beyond changing the error message to something more palatable like "block already in-progress". I'd probably prefer to think this through more to incorporate the ACID guarantees as well, somehow.

Also, AFAICT, this doesn't prevent the block producer from "de-syncing" from the store.
...
I'm assuming that's not the case yet - and if so, let's create an issue for this.

Correct; and this is simply an artifact of our decentralized component design here. #1513.

@Mirko-von-Leipzig Mirko-von-Leipzig merged commit 50c9ed1 into main Jan 15, 2026
7 checks passed
@Mirko-von-Leipzig Mirko-von-Leipzig deleted the mirko/fix/ntx-builder branch January 15, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants