Skip to content

fix: stress test accounts creation#1113

Merged
bobbinth merged 16 commits intonextfrom
santiagopittella-fix-stress-test
Aug 2, 2025
Merged

fix: stress test accounts creation#1113
bobbinth merged 16 commits intonextfrom
santiagopittella-fix-stress-test

Conversation

@SantiagoPittella
Copy link
Copy Markdown
Collaborator

closes #1110

@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

I'm currently running the benchmarks to update the results in the crate's README. The previous results were wrong since we weren't actually creating public accounts.

Copy link
Copy Markdown
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left one small comment inline. Also, I believe we are still planning to update the benchmark results, right?

Comment thread CHANGELOG.md Outdated
Copy link
Copy Markdown
Collaborator

@TomasArrachea TomasArrachea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


let (id, vault, sorage, code, nonce) = faucet.clone().into_parts();
let updated_faucet = Account::from_parts(id, vault, sorage, code, nonce + ONE);
faucet.increment_nonce(ONE).unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

Update: the check_nullifiers_by_prefix was panicking in every run. the issue is that we were syncing the 100k accounts in one call but the DB is capped at 1k items per query.

@bobbinth
Copy link
Copy Markdown
Contributor

Update: the check_nullifiers_by_prefix was panicking in every run. the issue is that we were syncing the 100k accounts in one call but the DB is capped at 1k items per query.

Was this during data seeding or during sync tests? For sync tests, we should probably not query for more than a handful accounts.

@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

Update: the check_nullifiers_by_prefix was panicking in every run. the issue is that we were syncing the 100k accounts in one call but the DB is capped at 1k items per query.

Was this during data seeding or during sync tests? For sync tests, we should probably not query for more than a handful accounts.

This was during sync tests. We were requesting a sync for all the accounts at once.

@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

ATM I changed it to do a request every 1k elements. Should I change it to do just 1 requests? with at most 1k accounts for sync state and 1k IDs for get_notes_by_id? cc @bobbinth

@bobbinth
Copy link
Copy Markdown
Contributor

with at most 1k accounts for sync state and 1k IDs for get_notes_by_id? cc @bobbinth

I'm trying to think of a "typical request". I think this would involve a handful of accounts (maybe 2 - 3) and a small number of note tags (let's say 10 - 15). So, each request should be roughly like this.

For get-notes_by_id - I'd say 1K is probably too much. I'd again go with something around 20.

ATM I changed it to do a request every 1k elements. Should I change it to do just 1 requests?

What does "every 1k elements" mean here?

@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

I'm trying to think of a "typical request". I think this would involve a handful of accounts (maybe 2 - 3) and a small number of note tags (let's say 10 - 15). So, each request should be roughly like this.

For get-notes_by_id - I'd say 1K is probably too much. I'd again go with something around 20.

Ok, awesome. I was talking about this with @igamigo and he say basically the same. I will be reducing the number of accounts and notes.

What does "every 1k elements" mean here?

In the case of sync_state i meant to account, since are translated into rows in the query. And for the get_notes_by_id i meant notes, for the same purpose.

Copy link
Copy Markdown
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! Not a full review yet - but I left some questions/comments inline.

Comment thread bin/stress-test/README.md Outdated
Comment thread bin/stress-test/README.md Outdated
Comment thread bin/stress-test/README.md Outdated
Comment thread bin/stress-test/README.md Outdated
Comment on lines +94 to +95
Average request latency: 70.498732ms
P95 request latency: 7.286208ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is average 10x higher than P95? Does this mean that there are some requests that take super long?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking this, expanded a bit the percentiles shown and this are the current results for sync state:

Average request latency: 68.031821ms
P50 request latency: 1.027125ms
P95 request latency: 1.615ms
P99 request latency: 2.838417ms
P99.9 request latency: 41.842534834s
Average notes per response: 1.323

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - interesting! So, almost all requests complete in under 3ms, but there are some requests that take 40+ seconds?

Not related to this PR, but I wonder if such requests should just time out somehow. Like if we can return within 1 second, we should just cancel the request.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - interesting! So, almost all requests complete in under 3ms, but there are some requests that take 40+ seconds?

Yes! That's right. I didn't manage to determine which requests are the ones taking that long yet.

Not related to this PR, but I wonder if such requests should just time out somehow. Like if we can return within 1 second, we should just cancel the request.

Sounds good, will open an issue.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine/hope that its not actually the db request that's taking so long, but rather that the request was waiting on being allocated a database connection from the pool.

Copy link
Copy Markdown
Contributor

@drahnr drahnr Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.disable_lifo_slot

tl;dr If we have a task that does some IO / CPU in sync mode, and blocks that slot indefinitely, a re-used connection might be blocked for a very long time before every polled again. The scheduler of tokio has a heuristic for message passing patterns, bit it bytes if anyone ever does sync heavy workloads, so disabling is a good thing. The underlying issue is, that effectively introeduces a second queue, that cannot be stolen but has priority over the stealable task-queue: tokio-rs/tokio#4941

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the first request gets stuck waiting for the nullifier tree to be loaded from the DB. For example, this is the result of only 1 sync state:

2025-07-31T21:45:03.102847Z  INFO load: miden_node_store::state: crates/store/src/state.rs:1004: Loaded nullifier tree, num_of_leaves: 99960, tree_construction: 21, COMPONENT: "miden-store"
2025-07-31T21:45:24.308338Z  INFO load: miden-store: crates/store/src/state.rs:97: close, time.busy: 42.6s, time.idle: 55.5ms
2025-07-31T21:45:24.309338Z  INFO miden-store: crates/store/src/server/mod.rs:133: Database loaded
2025-07-31T21:45:24.310547Z  INFO store.rpc.rpc/SyncState: miden_node_utils::tracing::grpc: crates/utils/src/tracing/grpc.rs:90: close, time.busy: 387µs, time.idle: 559µs rpc.service: "store.rpc.rpc", rpc.method: "SyncState"
Average request latency: 42.624749833s

The request latency matches with the time that takes to load the nullifier tree.

Tried adding a request to the store status before starting to sync the state (and not counting the time the it takes), and the results are this for 10k iteration:

Average request latency: 323.768µs
P50 request latency: 297.042µs
P95 request latency: 414.209µs
P99 request latency: 1.1775ms
P99.9 request latency: 2.198292ms
Average notes per response: 1.3128

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - interesting! I would have though the that request would have timed out. Generally though, we should only start the benchmarks once the store has been fully loaded. Could we use the Status endpoint for this? That is, make a request to the status endpoint and only start the benchmark once we receive a response.

Comment thread bin/stress-test/README.md Outdated
Comment thread bin/stress-test/src/seeding/metrics.rs
Comment thread bin/stress-test/src/store/mod.rs
Comment thread bin/stress-test/src/store/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you. I left one more question inline.

Also, I think we should still track down what's happening in #1113 (comment) - but this could be done in a follow-up PR.

Comment thread bin/stress-test/README.md
Copy link
Copy Markdown
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good! Thank you! Everything else we can do in follow-ups - i.e., in #1111 and #1124.

@bobbinth bobbinth merged commit 0043aee into next Aug 2, 2025
8 checks passed
@bobbinth bobbinth deleted the santiagopittella-fix-stress-test branch August 2, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stress-test is broken

5 participants