Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_delegators_for_validator #3219

Merged
merged 5 commits into from
May 9, 2024

Conversation

vicsn
Copy link
Contributor

@vicsn vicsn commented Apr 16, 2024

Motivation

Adds an endpoint to get all delegators for a particular validator.

Test Plan

See the sister PR: https://github.com/AleoHQ/snarkVM/pull/2434

@vicsn vicsn requested a review from raychu86 April 16, 2024 14:01
@vicsn vicsn marked this pull request as ready for review April 17, 2024 16:26
Copy link
Contributor

@iamalwaysuncomfortable iamalwaysuncomfortable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time it takes for this endpoint to turn the response into JSON should be benchmarked here as this potentially more costly than the actual RockDB access.

@iamalwaysuncomfortable
Copy link
Contributor

iamalwaysuncomfortable commented Apr 23, 2024

This is awaiting a benchmark. We will perform this soon.

@howardwu
Copy link
Contributor

What's the status on this PR? I see failing CIs across the board and no benchmarks.

Copy link
Contributor

@iamalwaysuncomfortable iamalwaysuncomfortable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 6-hour network test was run against this PR.

Network Configuration

10 Validators: m5.8xlarge (32vCPU, 128 MB RAM)
30 Clients: m5.4xlarge (16vCPU, 64 MB RAM)
17000 Delegators bonded to a single delegator

Tests

The following tests were run in serial.

Test 1: 1 request cannon sent 100 requests at 1 request per second at a single client
Test 2: 30 request cannons attempting to send 100 blocking HTTP requests at 1 request per second against a single client
Test 3: 50 request cannons attempting to send 100 blocking HTTP requests at 1 request per second at the load balancer distributing traffic among all client nodes equally.

Test 1 Results:

REQUEST STATISTICS:
  requests: 100
  request-cannons: 1
  avg request payload size: 1.4 mb

TIME:
  min: 4228 ms
  max: 6597 ms
  mean: 5460 ms
  std dev: 1234 ms
Number of Errors: 0
Number of Requests: 11

NOTES: 
This test caused the node to fall behind in its peers in sync rate by a small amount during the test.

Test 2 Results

REQUEST STATISTICS:
  requests: 3000 (via 30 cannons)
  request-cannons: 30
  avg request payload size (mb): 1.4 mb

TIME:
  request roundtrip min: 44483 ms
  request roundtrip max: 109171 ms
  request roundtrip mean: 52374 ms
  request roundtrip std dev: 19126 ms
Number of Requests: 100
Number of Errors: 0

NOTES: 
This test caused the node to stop syncing during the requests.

It also caused calls to any other endpoint to extend to the range of `50000 ms`. 

After the requests stopped the node continued to sync.

Test 3 Results

REQUEST STATISTICS:
  requests: 5000
  request cannons: 50
  avg response payload size: 1.4 mb

TIME:
  min: 4109 ms
  max: 25658 ms
  mean: 13242.851485148514 ms
  std dev: 3711.8065908291583 ms

NOTES: This test caused block sync speed of several client nodes to fall during the test. Block sync speed recovered after the test concluded. Sustained heavy load on this endpoint would likely heavily slow down client performance.

Conclusion

When there are a lot of delegators attached to a validator, this endpoint seemingly has the ability to make nodes fall out of sync and block other tasks. This isn't entirely surprising given querying RocksDB for this information is a compute heavy task and this task is spawned into the main tokio task queue.

This endpoint needs to be throttled heavily.

Recommendations

  1. Spawn this task onto the blocking threadpool via spawn_blocking! (Addressed by commit 1c74880)
  2. This endpoint should return an error if the block height is too far behind. Sending outdated committee data isn't a good idea (Addressed by commit 1c74880)
  3. Limit this endpoint to serving a maximum amount of 5 concurrent invocations of this task (and not per IP, 5 invocations total)

@vicsn
Copy link
Contributor Author

vicsn commented May 4, 2024

Thx for pushing on this. Looks like a valid DoS vector if someone exposes their endpoints.

I think its likely other REST endpoints (e.g. GET /<network>/block/{height}/transactions and GET /<network>/blocks?) would cause similar performance issues. If you agree, it might be good to tackle defenses which apply across multiple REST endpoints in a separate PR so its easier to review.

Or perhaps our recommendation should be for clients with a public endpoint to have a very low general rate limit. Infura doesn't let you query geth directly either.

Copy link
Contributor

@iamalwaysuncomfortable iamalwaysuncomfortable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test was run on the latest changes.

It seems like these changes take care of the previous issues by not blocking the main thread and denying traffic if the node fails to sync further.

As is the rate limiting defaults should prevent dos-ing.

Details of the tests below.

Test 1

REQUEST STATISTICS:
  requests: 100
  request-cannons: 1
  avg request payload size: 1.6 mb


TIME:
  min: 4462 ms
  max: 5760 ms
  mean: 5007 ms
  std dev: 448 ms

NOTES: No degradation in block_sync time was seen

Test 2

REQUEST STATISTICS:
  requests: 3000
  request-cannons: 30
  avg request payload size: 1.6 mb

TIME:
  min: 6389 ms
  max: 30107 ms
  mean: 17809.54 ms
  std dev: 3377 ms


NOTES: The client under attack experienced lag in block sync rate. However after experiencing the lag in sync rate it denied further traffic and then was able to continue syncing. This lag-catchup response curve continued for as long as the load did. After the load relented - the node was able to catch back up.

Test 3

REQUEST STATISTICS:
  requests: 5000
  request-cannons: 30
  avg request payload size: 1.6 mb

TIME:
  min: 12481 ms
  max: 50167  ms
  mean: 34967 ms
  std dev: 3249 ms

NOTES: Clients stayed at tip during the entire test

Path(validator): Path<Address<N>>,
) -> Result<ErasedJson, RestError> {
// Do not process the request if the node is too far behind to avoid sending outdated data.
if rest.routing.num_blocks_behind() > SYNC_LENIENCY {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you update the doc comment on SYNC_LENIENCY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like it to say?

Copy link
Contributor Author

@vicsn vicsn May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a specific preference, but for example:

/// The maximum number of blocks the client can be behind it's latest peer before it skips
/// processing incoming transactions, solutions and GET delegator requests.

Or:

/// The maximum number of blocks the client can be behind it's latest peer before it skips
/// processing incoming transactions, solutions and heavy REST requests.

@howardwu howardwu merged commit d6a441b into mainnet-staging May 9, 2024
24 checks passed
@howardwu howardwu deleted the get_delegators_for_validator branch May 9, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants