You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have seen few of our spark workloads, whose spark jobs complete within the normal duration; but the openlineage spark listener takes hours to complete processing all spark events - which keeps the infrastructure up, resulting in SLA impacts as well as infrastructure cost.
While we have been seeing different root-causes for the spark listener taking a long time to finish processing events, we'd like to have a way to reduce the blast radius. That way, even if openlineage does end up going to the long-running scenario, we have a way to configure a hard timeout limit - so that we can be sure that the jobs don't end up going out of SLA, even if lineage doesn't get captured in such cases.
Proposed implementation
This section describes how you propose to model it.
We can add support for a new spark conf, that can be configured by users like: spark.openlineage.listener.timeout.seconds=120 (say, 2 mins)
Internally, if a timeout is configured by the user, we can have an internal Executor + Future.get() with timeout / use Guava SimpleTimeLimiter to ensure all the spark events handled by OL spark listener have a max timeout rather than running for hours.
Purpose:
Currently, we have seen few of our spark workloads, whose spark jobs complete within the normal duration; but the openlineage spark listener takes hours to complete processing all spark events - which keeps the infrastructure up, resulting in SLA impacts as well as infrastructure cost.
While we have been seeing different root-causes for the spark listener taking a long time to finish processing events, we'd like to have a way to reduce the blast radius. That way, even if openlineage does end up going to the long-running scenario, we have a way to configure a hard timeout limit - so that we can be sure that the jobs don't end up going out of SLA, even if lineage doesn't get captured in such cases.
Proposed implementation
This section describes how you propose to model it.
We can add support for a new spark conf, that can be configured by users like: spark.openlineage.listener.timeout.seconds=120 (say, 2 mins)
Internally, if a timeout is configured by the user, we can have an internal
Executor + Future.get()
with timeout / use GuavaSimpleTimeLimiter
to ensure all the spark events handled by OL spark listener have a max timeout rather than running for hours.Relevant slack thread: https://openlineage.slack.com/archives/C01CK9T7HKR/p1705161285825369
The text was updated successfully, but these errors were encountered: