Event loop lag by slot second #6929

nflaig · 2024-07-02T12:53:54Z

I've been collecting some data to investigate delay of REST API responses in addition to data we get from metrics like #6691. This is from a Holesky beacon node running in a DVT setup with ~250 connected validators.

The data for this was simply collected by creating a log event if event loop lag > 1 second by running this branch unstable...nflaig/event-loop-delay. And all the data points are collected on the main thread, meaning event loop lag in network thread is not considered which might cause delays on some APIs that interact with the network, like getting the peer count, or submitting attestations / blocks.

Using data from event-loop-lag-detected.log created the following diagrams.

Event Loop Lag: Slot Seconds vs. Delay

This clearly shows the expected lag during the 8 second of the slot due to state / epoch transitions. But other seconds of the slot are mostly unaffected by event loop lag and should have a marginal effect on API latency (see % distribution below)

Percentage of Event Loop Lags per Slot Second

The percentage of lags above > 1 second are also mostly in the 8 second of the slot

Percentage of Slots with Event Loop Lag > 1 second

When looking at the percentage of slots over last few days, the amount of slots with an event loop lag is relatively low, especially for slot seconds other than 8.

Conclusion

Based on this data, it seems unlikely that event loop lag has a significant impact on API latency as during the 8-9 slot second, the validator client does not send any requests and the main tasks on the beacon node side is state and epoch transition while tasks like polling validator indices and getting duties happens at the beginning of the first slot of the epoch, and event loop lag there is relatively low and should not cause timeouts of the request even for really short timeouts like 2 seconds.

Next steps

It would be great if we could visualize similar data points in our metrics, one approach for this could be to look at event loop utilization (ELU) for certain slot seconds, this also gives us more data look at if we improve state / epoch transition or block processing as it should reduce the ELU during those slot seconds, see #6820 (comment).

The text was updated successfully, but these errors were encountered:

nflaig · 2024-08-09T13:05:41Z

New data from latest release (v1.21.0), looks quite a bit better 🎉

Compared to previous, the event loop lag in the range of 3-4 seconds is less frequent

The next one is interesting, while we improved the lag in the 8 second of the slot, it looks like we have much less lags in other slots of the epoch as well that are > 1 second

We have ~2% less slots with event loop lag > 1 second, the percentage in 8 second slot went up which is kinda strange but the lag duration overall is less as show above

nflaig added the scope-performance Performance issue and ideas to improve performance. label Jul 2, 2024

nflaig mentioned this issue Aug 3, 2024

Doppelganger getLiveness errors on holesky #6114

Closed

nflaig mentioned this issue Oct 10, 2024

Event loop utilization (ELU) metrics #7147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event loop lag by slot second #6929

Event loop lag by slot second #6929

nflaig commented Jul 2, 2024 •

edited

Loading

nflaig commented Aug 9, 2024

Event loop lag by slot second #6929

Event loop lag by slot second #6929

Comments

nflaig commented Jul 2, 2024 • edited Loading

Event Loop Lag: Slot Seconds vs. Delay

Percentage of Event Loop Lags per Slot Second

Percentage of Slots with Event Loop Lag > 1 second

Conclusion

Next steps

nflaig commented Aug 9, 2024

nflaig commented Jul 2, 2024 •

edited

Loading