fix: ntx builder adheres to note limit and store apply_block race condition fixed#1508
fix: ntx builder adheres to note limit and store apply_block race condition fixed#1508Mirko-von-Leipzig merged 9 commits intomainfrom
Conversation
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Than you! I left a few comments/questions inline.
| // Always mark notes as failed. They can get retried eventually. | ||
| state.notes_failed(candidate, notes.as_slice(), block_num); | ||
|
|
||
| state.candidate_failed(candidate); |
There was a problem hiding this comment.
Probably not for this PR, but there seems to be some inconsistency in docs.
Specifically, doc comments for State::candidate_failed() say "All notes in the candidate will be marked as failed" - but I'm not sure that's true (otherwise, we probably wouldn't need to call State::notes_failed() right before State::candidate_failed() - right?).
Also, I'm curious what the rationale was for skipping certain error types before. Was the idea that if, for example, proving failed, then it is likely not the notes fault and there, we shouldn't penalize these notes as "failed"? If so, I would probably expand the comment to indicate that regardless of the source of the failure, we mark notes as failed so that they could be removed from the pending note set in case of unexpected failures (which is not ideal, and may need to be revisited later).
There was a problem hiding this comment.
Probably not for this PR, but there seems to be some inconsistency in docs.
Yes, though on next none of this code exists because its all refactored into the actor model. This is still running the centralized version so I'm.. less concerned.
Also, I'm curious what the rationale was for skipping certain error types before. Was the idea that if, for example, proving failed, then it is likely not the notes fault and there, we shouldn't penalize these notes as "failed"?
I think it was a conservative implementation, where we only marked notes that actually failed the check. And that likely all other errors were caused by other external factors.
| // We perform the apply_block work in a separate task. This prevents the caller cancelling | ||
| // the request and thereby cancelling the task at an arbitrary point of execution. | ||
| // | ||
| // Normally this shouldn't be a problem, however our apply_block isn't quite ACID compliant | ||
| // so things get a bit messy. This is more a temporary hack-around to minimize this risk. | ||
| let this = self.clone(); | ||
| tokio::spawn( |
There was a problem hiding this comment.
My understanding is that this makes it so that apply_block() always completes, even if the gRPC requests is canceled/times out - right? And the goal of this is to prevent the database getting stuck in the locked state - correct? If so, I'd maybe mention the last part ore explicitly in the comment.
Also, AFAICT, this doesn't prevent the block producer from "de-syncing" from the store. Basically, what we could have is:
- Block producer builds a block and send it to the store.
apply_block()takes too long.- Block producer's gRPC request times out.
- The tasks doesn't get canceled and the block gets inserted into the DB.
At this point, the store will be at block
There was a problem hiding this comment.
My understanding is that this makes it so that apply_block() always completes, even if the gRPC requests is canceled/times out - right?
Yes, though its important to note that this only removes this problem for gRPC request cancellation. This doesn't prevent bugs within this task from placing us in a weird state. For example, its still possible that the in-memory data is updated but the database commit fails because we don't have ACID wrapping both. Guaranteeing this requires a bit of a rethink.
And the goal of this is to prevent the database getting stuck in the locked state - correct?
Not quite, this will likely just make the error different. The database "locked" was a result of having multiple block submission requests running concurrently. Request x times out, task is cancelled but the database portion is still running since it hasn't hit an await point yet. Request y is then submitted, and gets rejected since x is still holding the db write lock.
aka the error is a result of gRPC timeouts being exceeded, and we can't really change beyond changing the error message to something more palatable like "block already in-progress". I'd probably prefer to think this through more to incorporate the ACID guarantees as well, somehow.
Also, AFAICT, this doesn't prevent the block producer from "de-syncing" from the store.
...
I'm assuming that's not the case yet - and if so, let's create an issue for this.
Correct; and this is simply an artifact of our decentralized component design here. #1513.
Caution
This PR targets
mainFixes several issues:
apply_block