-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AUD-223] Set a TTL for latest block in redis #1269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this now need a fallback to web3?
response_dictionary['latest_chain_block'] = (int(latest_chain_block) if latest_chain_block else None) |
If there is none, we will end up with none as latest_chain_block in the case we have an expired redis value. is that acceptable?
We should change this too. good find. i can make a wrapper around this logic like attempt to get from redis or fallback to web3 so it can be consumed by health check and the api helpers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me aside from hareehs's comment, good find on this. will leave approval to hareesh
@@ -126,6 +128,12 @@ def get_health(args, use_redis_cache=True): | |||
latest_block_num = latest_block.number | |||
latest_block_hash = latest_block.hash.hex() | |||
|
|||
if use_redis_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already an if use_redis_cache:
above on line 115, why not put these at the bottom of that block?
# if we had attempted to use redis cache and the values weren't there, set the values now | ||
# ex sets expiration time and nx only sets if key doesn't exist in redis | ||
redis.set(latest_block_redis_key, latest_block_num, ex=default_indexing_interval_seconds, nx=True) | ||
redis.set(latest_block_hash_redis_key, latest_block_hash, ex=default_indexing_interval_seconds, nx=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth wrapping these in a try-except? wouldn't want to fail the health check due to a redis.set call failure
up to you
@SidSethi brought this up but we should probably also set this value in monitors.py since having it cached is not actually relevant to indexing. We can literally set 'latestChainBlock' there as well with a separate redis key. If you recall we initially had this caching to avoid web3 calls in health check so another redundancy layer might be beneficial |
…o dm-indexing-latest-block-ttl
…o dm-indexing-latest-block-ttl
so there's no monitor to track latest chain block or block diff. should i add that to the monitors and plumb it everywhere? the goal for this pr was to fix the cached web3 calls, not to benefit indexing in any way |
if use_redis_cache: | ||
# get latest blockchain state from redis cache, or fallback to chain if None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about this change causing health check to be significantly slower some of the time (since we're manually clearing the key). The client's heavily depend on health check being as fast as possible for selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have sep. keys for health vs. indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually never used in indexing. the reason we added these keys was only for health check
as for health check latency, it's possible that an uncached value could cause the time to spike, but the tradeoff to weigh here is latency vs correctness. one tradeoff i'm willing to make is increase the ttl of this to 2x+1 of the indexing window so like 11 seconds. that way we should be no more than 3 blocks apart from chain in terms of accuracy but if indexing hasn't fired off and set the new value in redis, we'll get an accurate block diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i'm down to benchmark on a prod node to see what the latency effects are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha. i kinda see more value in the monitoring method that was suggested-- have this done independently of the health check behind the scenes make the most sense to me. we could make the calculation happen every 5s still or something. im ok to ship this now if you think the benefit from this correctness outways the potential incurred cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the benefits here are more higher level. accurate health checks are nice, but the result system wide is what i was going for. like certain discovery nodes have a tendency to oscillate from like 0 to 20 blocks behind and back to 0. effectively this represents like 90 seconds of falling behind and then catching up, which i think this may also contribute to content nodes timing out in the middlewares. however if the true block diff is exposed, when we make requests through libs (or cnode via libs) to the discovery node, if the block diff is accurate, _makeRequest in the libs discovery service be much quicker and more accurate to move to run discovery reselection and move to another node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, while i do think it would be better totally async from the health check, this makes a lot of sense. if your hunch that this will help the middleware situation is correct, we should def ship this asap
redis.set(latest_block_redis_key, latest_block_from_chain.number) | ||
redis.set(latest_block_hash_redis_key, latest_block_from_chain.hash.hex()) | ||
# these keys have a TTL which is the indexing interval | ||
redis.set(latest_block_redis_key, latest_block_from_chain.number, ex=default_indexing_interval_seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a possible race cond. between the celery task actually running at default_indexing_interval_seconds
and this getting expired at default_indexing_interval_seconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is not used anywhere for indexing related tasks i'm not super worried. indexing actually does a web3 get latest call before each task anyway. plus this is set while the health check override is setnx so it'll prefer the health check set value and it won't override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was going to suggest we make this last exactly 1s longer or something funky like that but based on the above idt its a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think if we want to add this to monitors we'd have to increase that frequency quite a bit from 60s so this is a better option
* Set a TTL for latest block in redis * Move this variable definition * Rename env var to include units with _sec * Refactor this logic into itss own function * lint * lint * Cleanup
Description
This PR sets a TTL for the latest block number and hash redis objects. More details in linear, but what was happening is if our indexing task takes a long time to finish, the latest block number and hash objects in redis become outdated since no other indexing tasks come up and update them. For example, if an indexing task takes 90 seconds to complete, that's 90/5=18 blocks that the discovery node would be behind at a 5 second/block rate. However, for those 90 seconds, the block diff in health check would show 0 because no other indexing task came up and wrote updated values to the cache.
This PR expires the cache value at the indexing interval of 5 seconds and if health checks doesn't see those redis keys, it sets them only if they don't exist with the same TTL. If a new indexing task starts up, it will override the values each time.
Tests
I slowed indexing locally to 60 seconds and hit health check to set the keys in redis. Notice the "HEALTH CHECK DEBUG" messages for cache misses and observe the health_check hits in between with no logs.
❗ Reminder 💡❗:
If this PR touches a critical flow (such as Indexing, Uploads, Gateway or the Filesystem), make sure to add the
requires-special-attention
label. Add relevant labels as necessary.