Make long running CPU operations async friendly #122

dan-da · 2024-03-26T21:38:49Z

We should not let CPU intensive functions block the tokio executor for long as this harms concurrency, which can potentially stall other tasks: p2p connections, rpc calls, mining, etc. [1]

The task here will be to identify non-async functions in neptune-core that perform lengthy operations, and then wrap them with tokio's spawn_blocking() or possibly block_in_place().

"lengthy" is of course a vague term. As a starting point I would consider 10 ms lengthy, 100 ms very lengthy, and 1+ seconds as falling over dead. [2]

Known candidates:

mining loop, see Mining should be multi-threaded #123
generating transaction proof. see: GlobalStateLock::create_transaction_from_data()
executing triton-vm scripts [3]

Add others above as we identify them.

note: At this time, I'm really most interested in dealing with the very worst offenders, such as proving, which I'm told may take minute(s). I think others are low priority and can be dealt with as we notice them through testing. So I'm not proposing with this issue that we spend time now with profiling and optimization.

[1] as an alternative, or in addition, we might consider creating an OS thread for the rpc server, mining, and possibly each p2p connection, which are presently all tokio tasks in tokio's threadpool. async/await uses cooperative multi-tasking, while OS threads use preemptive multi-tasking.

[2] places where triton-vm scripts are run:

$ find src -type f | xargs grep "\.run(" | grep -v peer_loop | grep -v currency.rs | grep -v time_lock.rs | grep -v integrity.rs | grep -v lib.rs | grep -v "//"
src/models/consensus/mod.rs:                            .run(claim.input.clone().into(), raw_witness.clone().into())
src/models/consensus/mod.rs:                            .run(claim.input.clone().into(), raw_witness.clone().into())
src/models/consensus/mod.rs:                TimeLock.run(&public_input.individual_tokens, non_determinism)
src/models/consensus/mod.rs:                NativeCurrency.run(&public_input.individual_tokens, non_determinism)
src/models/blockchain/transaction/mod.rs:            match lock_script.program.run(
src/models/blockchain/transaction/mod.rs:                .run(public_input.into(), NonDeterminism::new(secret_input))

[3] I'm not sure there is any clear definition of what constitutes "blocking". This quote seems insightful:

I'm not defining what "blocking" means, I'm just pointing out that the standard library specifies that anything implementing Future (this includes async code) should not block. As far as I can tell, they don't define what blocking means either. You could ask them for clarification.

Practically though, a function is blocking if it hogs one bottleneck resource (cpu/io/the network/a lock/etc) for long enough that other code that has a different bottleneck resource could make significant progress in the meantime. As you point out this is not a black and white definition, it's a bit subjective.

I found an interesting discussion about blocking async code here written in response to this. Apparently async_std in 2019 had a prototype executor able to detect lengthy blocking code and start a new thread for it automatically (but it was never merged). And there are various comments about what constitutes blocking code or not.

The text was updated successfully, but these errors were encountered:

dan-da · 2024-03-30T18:09:27Z

This article is very relevant and highly recommended reading:
https://ryhl.io/blog/async-what-is-blocking/

In particular it addresses:

when to use spawn_blocking vs starting an OS thread vs rayon
how long is too long to block in async code (10 - 100 microseconds, rule of thumb)
how many threads in tokio blocking threadpool (up to 500)
when/how to use rayon (CPU intensive work)
how to keep async code from blocking when invoking rayon thread (and by extension a regular OS thread)

dan-da · 2024-04-02T21:16:30Z

a reddit discussion today is also very relevant:
https://old.reddit.com/r/rust/comments/1btggxp/number_of_tokio_threads_vs_number_of_rayon/

Sword-Smith · 2024-04-08T10:04:49Z

I read the "What is blocking" article.

I'll try to add some context that I think is relevant for neptune-core:

A regular user that is mining will spawn 2 threads for the mining operation, 1 for the main loop, 1 to 2 for RPC calls (dashboard + CLI), and N for peer threads, where N is the number of connected peers which defaults to 10. This is 15 tokio threads, so on a regular desktop, there will be enough CPUs available (the number that htop or lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' reports) for this program alone. In a case where other programs run concurrently on the same machine, neptune-core can of course not use all system resources, so maybe only 8 CPUs will be available for neptune-core. Importantly, neptune-core does not function like a webserver, where a new thread is spawned to serve each incoming request. There is a predictable max number of tokio threads that will be spawned. 15 for regular users, and maybe 25 for users which use the RPC CLI very heavily.

That's a quantitative consideration of the number of tokio threads. How about bottlenecks? I see two computational situations that are specific to neptune-core, and that should lie at the center of our considerations:

Updating the state with a new block
Creating proofs through Triton VM

When the state is updated with a new block, a write-lock is held over GlobalState preventing most if not all CLI operations of interest from executing. Even if we were to go back to having multiple locks over GlobalState for more fine-grained control, we would still need to ensure atomic state updates upon receiving a new block, meaning that (pretty much) all locks would need to be taken before a state update could be applied. So we need to accept the fact that most read access is blocked as long as a state update is being performed. If we want to optimize anything in this regard it should be to reduce the amount of time that a state update takes, therefore feat: log duration of adding new block #130.

Practically though, a function is blocking if it hogs one bottleneck resource (cpu/io/the network/a lock/etc) for long enough that other code that has a different bottleneck resource could make significant progress in the meantime.

Going with the above definition of blocking, it seems to me that the state-updating will always be blocking (as it holds a write-lock over GlobalState), and that no amount of await will change that. Please correct me if you disagree.

Of equally high importance is the generation of Triton VM proofs. These proofs are how consensus is defined, and they're thus needed for both issuing a new transaction and for creating a new block. Currently, generating a proof for a transaction takes tens of seconds, and for a block tens of minutes (and we don't even have all the logic for blocks defined yet). When a user wants to create a transaction, a lock needs to be held over GlobalState (probably a write lock) The lock must be held over the GlobalState since all the cryptographic data used to construct the transaction must be obtained from a consistent view of the GlobalState. If e.g. the mutator set accumulator and the removal records come from two different versions of GlobalState, say accumulator at height $n$ and removal records at height $n+1$, the generated transaction will be invalid. I just generated a 3-inputs-1-output-transaction on a powerful core-I9 laptop. It took 20 seconds, and the proving part of the transaction generation used all my 16 CPUs for those 20 seconds. Triton VM uses Rayon multi-threading internally, without Rayon the proof generation would have taken more than 120 seconds.

These two operations, consensus and transaction generation are the core and purpose of any blockchain client, and everything else that's going on must be subject to their whims. I'd happily sacrifice the ability to perform CLI calls and mine while proving takes place if that means that the prover will be faster. I think we can assume that only one proof is being generated at a time, as the transaction creation logic holds a write-lock over GlobalState.

None of this is to say that I disagree with any of your above considerations. But going forward, it's crucial to me that we agree on priorities.

dan-da · 2024-04-09T21:11:08Z

edit: I wrote the below and then was thinking more about the general problem of lengthy atomic updates, and it occurs to me there may be some solutions available that enable lock-free readers or possibly even concurrent writers. I am thinking of something like arc-swap or mvcc. I will explore in a follow-up comment.

Thanks for the writeup.

Perhaps I can summarize the thrust of your comments as: new-block-updating and proving are slow and require the global write-lock, so everything else will be blocked anyway, and thus we shouldn't waste any time on speeding up things that aren't even executing until these tasks finishe. Is that a fair summary?

Anyway I think that is a valid point, worth consideration. Please consider the bulk of this post analysis/consideration of the point.

edit: As of now, I'm much more worried about proving than new-block-updating, but that could change if we see it get very slow, as it affects more use-cases. My below comments are mainly about proving, but some apply equally to block-updates.

First I want to clear up some terminology and tokio behavior so we can be on the same page talking about these things going forward.

You state we spawn 2 mining threads, and so on. In point of fact, we spawn tokio tasks, NOT threads. Tokio tasks execute on tokio's worker threadpool and multiple tasks share each tokio worker thread. Tokio normally has one worker thread per CPU. So on my older laptop with 4 cores, there are only 4 tokio worker threads. All tokio tasks execute on these 4 threads and are scheduled cooperatively. They cooperate each time they call await. Here is a very short summary that explains it better than I can.

If we ignore proving for the moment and consider all other operations, then I hope we will agree we want those to be as concurrent as possible and not block eachother any more than necessary for atomicity. You give an example of a "regular user", but I'm not sure who that is. Is it a solo hodler? Is it a miner? Is it a small merchant? Is it a huge merchant? Is it a block explorer? Is it automated trading software? Is it a web-app? Is it an exchange? When I consider bitcoin-core, I alone have used it for several different applications, and usually those were automated. Some were read-only, and never sent any Tx. Typically a web-app has both online and batch activities occurring via RPC calls, and the frequency of online calls will increase with the number of website visitors. So while neptune-core is not a web-server, it is almost certain to be used as a backend oracle/gateway for such. Further, some users/miners are likely to want to be connected to more than the default number of peers, and can do so with a simple config change. Perhaps 100+. So that's a lot more tasks cooperating, all sharing the same 4 (or 8+) tokio worker threads. I provide these as a sketch of the types of use-cases I expect when I think about concurrency.

I think it is worthwhile to recognize that some use-cases never involve mining a single block or sending a single transaction. Consider a block-explorer. And others involve sending transactions only very infrequently, eg a Merchant. Such use cases are dominated by tasks that do not call tx.prove().

The other day I performed a count of RPC calls we presently offer, divided up by whether they acquire write-lock, read-lock, or neither.

For the record, here they are:

read-lock: 18.
get_balance, own_instance_id, block_height, tip_digest, latest_tip_digests, peer_info, all_sanctioned_peers, amount_leq_synced_balance, synced_balance, wallet_status, tip_header, header, own_receiving_address, mempool_tx_count, mempool_size, history, dashboard_overview_data, list_own_coins.

write-lock: 4.
clear_all_standings, clear_standings_by_ip, send, prune_abandoned_monitored_utxos,

no-lock: 6
network, own_listened_address_for_peers, validate_address, validate_amount, shutdown, pause_miner, restart_miner

Now let's talk about proving.

afaik, proving presently occurs in two situations:

the send RPC is called, which entails proving a user-Tx. (slow, minutes)
a new block is mined, which entails proving a block-Tx. (very slow, tens of minutes)

Ok, so proving a block-tx is limited to miners, and to a single task and OS thread. And only when a miner actually finds a block, which is a happy event. It is true it will block all tasks that acquire read or write lock for a very long time when it occurs. Perhaps most importantly responding to P2P messages! But beyond that I think we don't need to worry about it too much because miners are dedicating machines for that task and it doesn't affect anyone else.

I am more concerned about (1) which affects anyone who sends a transaction.

That is absolutely problematic as it is blocking all the read-lock and write-lock RPCs. It is also blocking some peer messages from being responded to. I'm not certain which of them, or how many. That would perhaps be a useful exercise to quantify, as I did for the RPCs. So anything that can bring that proving time down is a win. Full agreement!

This issue #122 is partially about adding spawn_blocking() around prove(), so let's consider the case where we do vs if we do not. I will use my 4-core laptop as example.

With no spawn-blocking(), one of the 4 tokio threads is blocked while prove is executing. This leaves 3 remaining tokio threads to process tasks. Let's say the send RPC gets called again. It tries to acquire write-lock and it must wait on a tokio lock. That's ok, await is called on the tokio lock and the tokio executor on that thread continues processing tasks. Most other interesting RPC or p2p events that come in also require the read or write lock, and so the same thing happens with those and we still have the 3 threads available for processing further events. Thus, we can at least respond to no-lock requests, perhaps the most important being shutdown.

The case where we do call spawn_blocking is similar. In this case the prove() call executes on tokio's blocking threadpool and all 4 worker threads remain available for processing tasks. So the only difference is that we have 4 threads available for processing any tasks that no not require read-lock or write-lock. It's a win, but perhaps not a meaningful one. Which I believe is your point. ;-)

For completeness, let's say that one day we decide to run with tokio's single-thread executor instead. Perhaps just for giggles, I dunno. In this case, the version that calls spawn_blocking() would still be able to process a shutdown call, for example, while the non-spawn_blocking would not respond to the RPC client, because the thread is stuck executing prove().

Okay, so I acknowledge your point that Proving is an important gating issue/problem, and that other efforts to improve concurrency in the async realm may not bring much relative benefit, at present.

Even so, I think that #133 should be merged, for these reasons:

I still believe the code to be an improvement, if only a small one. The changes don't hurt anything. No drawbacks that I'm aware of.
The code is already written, so it's a sunk cost. This code was written in parallel with make storage async, attempt 3. #124 and so was just waiting for it to be merged.
It contains a small change for pausing/resuming the minor and running the mining loop in non-async fashion without acquiring the global lock that will be helpful for, ahem, "future improvements".
The changes to Transaction may potentially help tasks when prove() is not being called at all.
If proving should ever become fast, or we find a way to decouple from the write-lock, then the async blocking issues become more important.
I've been viewing make long-running CPU tasks run on tokio's blocking threadpool. #133 as kind of finishing up the overall task of making neptune-core async-friendly. It takes aim at some low hanging fruit and as such did not require any time for profiling and analysis. I was already planning this to be my final PR in the async-concurrency realm for the time being, so if that's the concern, no need to worry. Plenty of other things to do.

So those are my thoughts. I'm not too hung up on #133 -- if it doesn't get approved for merge, so be it. 😄

dan-da · 2024-04-09T21:39:44Z

Ok, let's brainstorm a bit.

regarding tx-proving inside send RPC: Does the prove() have to be inside a write-lock? What if we gathered all the input with a read-lock, release it, and then call prove(). If global state has not changed in the meantime, it should still be safe to write, correct? So we grab a write-lock and quickly finish up generating the Tx. And if the global-state has been modified, eg a new block came in, then whoops, go back and try it again.

related: Pls remind me why creating a Tx needs to modify the global state at all. In bitcoin et al, that's not the case and Tx can be generated entirely offline, and then broadcast by anyone. If we could do that then tx-proving could operate on its own OS thread and not block anybody.

If we think about the general problem of having shared state that takes a long time to update, one starts to consider working on multiple versions of the state. If the writer can be writing to a new state version while the readers see the old version, then the readers do not need to lock at all. In rust, the arc-swap crate provides this functionality. Going further, many databases, such as postgresql allow multiple writers to write concurrently, each on their own version and their own view of the state, using multi-version-concurrency-control (mvcc). mvcc libraries and database/bindings are available for rust.

So in theory, it seems like it should be possible to solve this problem in an elegant fashion. A practical difficulty lies in the way however because our GlobalState represents the MutatorSet db, wallet db, and the blockchain-files. So for this to work well, I believe we would need to use a persistent datastore that supports mvcc. That would of course be a big change. However, sometimes practical realities dictate change is necessary, and this could be one of those times.

That's it for now. I'm not intending an exhaustive writeup here. Just thinking outside the box a little.

Sword-Smith · 2024-04-10T04:49:10Z

Ok, let's brainstorm a bit.

regarding tx-proving inside send RPC: Does the prove() have to be inside a write-lock? What if we gathered all the input with a read-lock, release it, and then call prove(). If global state has not changed in the meantime, it should still be safe to write, correct? So we grab a write-lock and quickly finish up generating the Tx. And if the global-state has been modified, eg a new block came in, then whoops, go back and try it again.

related: Pls remind me why creating a Tx needs to modify the global state at all. In bitcoin et al, that's not the case and Tx can be generated entirely offline, and then broadcast by anyone. If we could do that then tx-proving could operate on its own OS thread and not block anybody.

You're right that we might not need a write lock during proving. My thoughts were that the wallet should mark the UTXOs used as input to the transaction somehow, such that if a later transaction was made, before the 1st one is mined, the UTXOs would not be used twice. Note that this marking of spent-but-not-yet-mined UTXOs is not being done yet.

This marking would change the global state, but it doesn't have to be done with any atomicity, so I don't think a write lock is needed during proving, even after we add this feature of preventing the reuse of UTXOs.

You should be aware though, that the validity of a transaction is always proved in relation to a specific block. The accumulator scheme (the mutator set) changes for each block, and it is what allows us to prove membership of the UTXO set (all unspent, historical transaction outputs) without having to know all historical transactions.

Sword-Smith · 2024-04-10T04:58:58Z

Perhaps I can summarize the thrust of your comments as: new-block-updating and proving are slow and require the global write-lock, so everything else will be blocked anyway, and thus we shouldn't waste any time on speeding up things that aren't even executing until these tasks finish. Is that a fair summary?

That's a part of it. A perhaps bigger problem is our novelty budget¹². The novelty of the Neptune blockchain comes from its cryptography (new hash function, new accumulator scheme, new STARK engine with associated VM, and tons of assembler for this VM). The people working with this part of the stack need to be allowed to dedicate all their brain cycles there, so the rest of the codebase should be kept as simple as possible.

Associated Hacker News discussion: (https://news.ycombinator.com/item?id=28395328) ↩
From the above link, I also came across this blogpost. ↩

dan-da · 2024-04-15T06:26:24Z

You're right that we might not need a write lock during proving.

Cool, so I did a little experiment and made create_transaction() take &self instead of &mut self. This also means that the send RPC only needs to acquire a read lock. There were several sub-fn of create_transaction() that also take &mut self, but in the end it was only for the call to GlobalState::add_change(), which has this:

        // Add change UTXO to pool of expected incoming UTXOs
        let receiver_preimage = own_spending_key_for_change.privacy_preimage;
        let _change_addition_record = self
            .wallet_state
            .expected_utxos
            .add_expected_utxo(
                change_utxo.clone(),
                change_sender_randomness,
                receiver_preimage,
                UtxoNotifier::Myself,
            )
            .expect("Adding change UTXO to UTXO notification pool must succeed");

I commented this code out, and everything builds just fine with &self and no write lock. hooray!

This code is modifying wallet_state, so it truly does need &mut self. But....

Is it really necessary to add this UTXO at time of Tx creation? Inside create_transaction() we can't even say with any certainty that this Tx will ever be broadcast. In other words, I see a separation of concerns between create_transaction() and send(). So I would think it makes sense to call add_expected_utxo() later in send() RPC after create_transaction() completes. Do you see any problem with this?

Alternatively, we could move the write-lock acquisition deep inside where this mutation actually occurs, and it would not longer be wrapping the call to prove(). That seems a lot uglier to me though.

You should be aware though, that the validity of a transaction is always proved in relation to a specific block. The accumulator scheme (the mutator set) changes for each block, and it is what allows us to prove membership of the UTXO set (all unspent, historical transaction outputs) without having to know all historical transactions.

To a specific block or to the previous block? In other words, let's say we start creating a transaction when block with height 56 is the tip, and by the time we are done with proving and creating the Tx based on 56, now block 57 is the tip. In this case, will our Tx still be valid, or we need to recreate it based on block 57?

If the Tx based on 56 will still validate, then it seems like the change I propose above should be fine. And logically I think that must be the case, else it is already possible for the Tx to be invalid for other nodes, as there are potential time delays.

But I'm a little out of my depth here, so will toss it back to you for your thoughts @Sword-Smith.

Sword-Smith · 2024-04-15T08:17:03Z

Is it really necessary to add this UTXO at time of Tx creation? Inside create_transaction() we can't even say with any certainty that this Tx will ever be broadcast. In other words, I see a separation of concerns between create_transaction() and send(). So I would think it makes sense to call add_expected_utxo() later in send() RPC after create_transaction() completes. Do you see any problem with this?

I don't really see a problem with that, other than that the client could be shut down/stop running between the broadcast of a transaction and when an expected incoming transaction is added. So I would suggest flipping the order: Let's store the expected incoming UTXO before the transaction is broadcast, but the write lock can still be held much shorter than it is now. Would that work?

Maybe that's what you're referring to here?

Alternatively, we could move the write-lock acquisition deep inside where this mutation actually occurs, and it would not longer be wrapping the call to prove(). That seems a lot uglier to me though.

Yeah, we might have to do something like that in order to get the ordering correct and prevent the potential loss of the change UTXO.

In this case, will our Tx still be valid, or we need to recreate it based on block 57?

The transaction would either need to be recreated, or its proof would have to be updated. The logic for the latter logic, we havent' written yet.

Addresses Neptune-Crypto#122 Enables caller of create_transaction() to choose whether the Change UTXO will use OnChain or OffChain notification. Note that caller could already do this since the last commit by manually creating the Change UTXO so that sum(inputs) == sum(outputs). Therefore, this change only makes it a bit easier, and is more consistent in that the caller can easily control the notification behavior of all outputs. The default behavior is still OffChain notification and doc-comments and example usage recommend this for normal usage. Also, this fine level of control is not available in the RPC send and send_to_many endpoints. Only in the rust API. Changes: * add change_notify_method arg to create_transaction() * update tests to use extra arg. * rename UtxoReceiverData --> UtxoReceiver * rename TransactionData --> TransactionDetails * fix: backwards logic in UtxoRecipient::auto() * improve doc-comment for create_transaction() * add ChangeNotifyMethod enum * add some missing doc-comments

dan-da mentioned this issue Apr 7, 2024

make long-running CPU tasks run on tokio's blocking threadpool. #133

Merged

dan-da mentioned this issue Apr 15, 2024

perf: avoid holding write-lock across prove() #136

Draft

dan-da linked a pull request Jun 7, 2024 that will close this issue

refactor create_transaction, avoid holding lock across prove() #159

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make long running CPU operations async friendly #122

Make long running CPU operations async friendly #122

dan-da commented Mar 26, 2024 •

edited

Loading

dan-da commented Mar 30, 2024

dan-da commented Apr 2, 2024

Sword-Smith commented Apr 8, 2024 •

edited

Loading

dan-da commented Apr 9, 2024 •

edited

Loading

dan-da commented Apr 9, 2024 •

edited

Loading

Sword-Smith commented Apr 10, 2024

Sword-Smith commented Apr 10, 2024 •

edited

Loading

dan-da commented Apr 15, 2024

Sword-Smith commented Apr 15, 2024 •

edited

Loading

Make long running CPU operations async friendly #122

Make long running CPU operations async friendly #122

Comments

dan-da commented Mar 26, 2024 • edited Loading

dan-da commented Mar 30, 2024

dan-da commented Apr 2, 2024

Sword-Smith commented Apr 8, 2024 • edited Loading

dan-da commented Apr 9, 2024 • edited Loading

dan-da commented Apr 9, 2024 • edited Loading

Sword-Smith commented Apr 10, 2024

Sword-Smith commented Apr 10, 2024 • edited Loading

Footnotes

dan-da commented Apr 15, 2024

Sword-Smith commented Apr 15, 2024 • edited Loading

dan-da commented Mar 26, 2024 •

edited

Loading

Sword-Smith commented Apr 8, 2024 •

edited

Loading

dan-da commented Apr 9, 2024 •

edited

Loading

dan-da commented Apr 9, 2024 •

edited

Loading

Sword-Smith commented Apr 10, 2024 •

edited

Loading

Sword-Smith commented Apr 15, 2024 •

edited

Loading