Conversation
|
I'm currently running the benchmarks to update the results in the crate's README. The previous results were wrong since we weren't actually creating public accounts. |
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you! I left one small comment inline. Also, I believe we are still planning to update the benchmark results, right?
|
|
||
| let (id, vault, sorage, code, nonce) = faucet.clone().into_parts(); | ||
| let updated_faucet = Account::from_parts(id, vault, sorage, code, nonce + ONE); | ||
| faucet.increment_nonce(ONE).unwrap(); |
|
Update: the check_nullifiers_by_prefix was panicking in every run. the issue is that we were syncing the 100k accounts in one call but the DB is capped at 1k items per query. |
Was this during data seeding or during sync tests? For sync tests, we should probably not query for more than a handful accounts. |
This was during sync tests. We were requesting a sync for all the accounts at once. |
|
ATM I changed it to do a request every 1k elements. Should I change it to do just 1 requests? with at most 1k accounts for sync state and 1k IDs for |
I'm trying to think of a "typical request". I think this would involve a handful of accounts (maybe 2 - 3) and a small number of note tags (let's say 10 - 15). So, each request should be roughly like this. For
What does "every 1k elements" mean here? |
Ok, awesome. I was talking about this with @igamigo and he say basically the same. I will be reducing the number of accounts and notes.
In the case of sync_state i meant to account, since are translated into rows in the query. And for the |
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you! Not a full review yet - but I left some questions/comments inline.
| Average request latency: 70.498732ms | ||
| P95 request latency: 7.286208ms |
There was a problem hiding this comment.
Why is average 10x higher than P95? Does this mean that there are some requests that take super long?
There was a problem hiding this comment.
I'm checking this, expanded a bit the percentiles shown and this are the current results for sync state:
Average request latency: 68.031821ms
P50 request latency: 1.027125ms
P95 request latency: 1.615ms
P99 request latency: 2.838417ms
P99.9 request latency: 41.842534834s
Average notes per response: 1.323
There was a problem hiding this comment.
Ah - interesting! So, almost all requests complete in under 3ms, but there are some requests that take 40+ seconds?
Not related to this PR, but I wonder if such requests should just time out somehow. Like if we can return within 1 second, we should just cancel the request.
There was a problem hiding this comment.
Ah - interesting! So, almost all requests complete in under 3ms, but there are some requests that take 40+ seconds?
Yes! That's right. I didn't manage to determine which requests are the ones taking that long yet.
Not related to this PR, but I wonder if such requests should just time out somehow. Like if we can return within 1 second, we should just cancel the request.
Sounds good, will open an issue.
There was a problem hiding this comment.
I would imagine/hope that its not actually the db request that's taking so long, but rather that the request was waiting on being allocated a database connection from the pool.
There was a problem hiding this comment.
There is https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.disable_lifo_slot
tl;dr If we have a task that does some IO / CPU in sync mode, and blocks that slot indefinitely, a re-used connection might be blocked for a very long time before every polled again. The scheduler of tokio has a heuristic for message passing patterns, bit it bytes if anyone ever does sync heavy workloads, so disabling is a good thing. The underlying issue is, that effectively introeduces a second queue, that cannot be stolen but has priority over the stealable task-queue: tokio-rs/tokio#4941
There was a problem hiding this comment.
It is possible that the first request gets stuck waiting for the nullifier tree to be loaded from the DB. For example, this is the result of only 1 sync state:
2025-07-31T21:45:03.102847Z INFO load: miden_node_store::state: crates/store/src/state.rs:1004: Loaded nullifier tree, num_of_leaves: 99960, tree_construction: 21, COMPONENT: "miden-store"
2025-07-31T21:45:24.308338Z INFO load: miden-store: crates/store/src/state.rs:97: close, time.busy: 42.6s, time.idle: 55.5ms
2025-07-31T21:45:24.309338Z INFO miden-store: crates/store/src/server/mod.rs:133: Database loaded
2025-07-31T21:45:24.310547Z INFO store.rpc.rpc/SyncState: miden_node_utils::tracing::grpc: crates/utils/src/tracing/grpc.rs:90: close, time.busy: 387µs, time.idle: 559µs rpc.service: "store.rpc.rpc", rpc.method: "SyncState"
Average request latency: 42.624749833s
The request latency matches with the time that takes to load the nullifier tree.
Tried adding a request to the store status before starting to sync the state (and not counting the time the it takes), and the results are this for 10k iteration:
Average request latency: 323.768µs
P50 request latency: 297.042µs
P95 request latency: 414.209µs
P99 request latency: 1.1775ms
P99.9 request latency: 2.198292ms
Average notes per response: 1.3128
There was a problem hiding this comment.
Ah - interesting! I would have though the that request would have timed out. Generally though, we should only start the benchmarks once the store has been fully loaded. Could we use the Status endpoint for this? That is, make a request to the status endpoint and only start the benchmark once we receive a response.
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you. I left one more question inline.
Also, I think we should still track down what's happening in #1113 (comment) - but this could be done in a follow-up PR.
closes #1110