Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: remove production usages of futures_executor::block_on #12301

Merged
merged 2 commits into from
May 10, 2022

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented May 6, 2022

This can cause deadlocks when used with tokio Runtimes.

Touches MaterializeInc/database-issues#3543

Motivation

  • This PR fixes a recognized bug.

Testing

  • This PR has adequate test coverage / QA involvement has been duly considered.

Manually verified that stressing the concurrency test works with a tokio::sync::Mutex in MemConsensus.

Release notes

This PR includes the following user-facing behavior changes:

  • N/A

@danhhz danhhz requested a review from aljoscha May 6, 2022 18:15
@danhhz
Copy link
Contributor Author

danhhz commented May 6, 2022

Manually verified that stressing the concurrency test works with a tokio::sync::Mutex in MemConsensus.

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I had one nit and one comment.

|| format!("data-generator-{}", idx),
move || {
let data_generator_handle =
mz_ore::task::spawn(|| format!("data-generator-{}", idx), async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ruchir was using a blocking task on purpose here, to ensure that the data generator got it's own CPU thread that it can use use as it wants. Not sure that reasoning is sound, though, and it's definitely nicer when we can use async code in the block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm then maybe it should get its own Runtime or even just a Thread? The old impl was busy-waiting which also has its downsides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and neither of those seem to be the "tokio way" (which I'm gradually learning to just accept and embrace). I think the problem we might hit here in practice is that the data generation call doesn't have any yield points. So, probably just that bit needs to be moved into a "blocking" call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning, move everything after this one async call of into a spawn_blocking? The whole generator code only had this one await on the barrier. Side note: I don't know how useful having the barriers for synchronization is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only had one await call, but it should have had two :). One for the barrier, and one for the sleep that I've now added in this PR (this was the busy waiting "bug" I mentioned above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and pushed just the data generation call into a spawn_blocking

@@ -343,17 +347,13 @@ where
pub struct ReadHandle<K, V, T, D>
where
T: Timestamp + Lattice + Codec64,
// TODO: Only the T bound should exist, the rest are a temporary artifact of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

{
pub(crate) reader_id: ReaderId,
pub(crate) machine: Machine<K, V, T, D>,
pub(crate) blob: Arc<dyn BlobMulti + Send + Sync>,

pub(crate) since: Antichain<T>,
pub(crate) politely_expired: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicitly_expired? I see what you mean, but I don't know if others will.

Keep in mind: this is a nit from a non-native English speaker... 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@danhhz
Copy link
Contributor Author

danhhz commented May 10, 2022

@aljoscha did you want to take a look at the open_loop changes I made in response to your comment or is this ready to go?

@aljoscha
Copy link
Contributor

I did take a look, and the changes seem good!

This can cause deadlocks when used with tokio Runtimes.

Touches #12231
It was being used in the open loop benchmark because the data generator
task busy-waited. Remove the busy waiting and the block_on.

Touches #12231
@danhhz
Copy link
Contributor Author

danhhz commented May 10, 2022

TFTR!

@danhhz danhhz enabled auto-merge May 10, 2022 16:55
@danhhz danhhz merged commit dac1b04 into MaterializeInc:main May 10, 2022
@danhhz danhhz deleted the persist_block_on branch May 10, 2022 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants