[PROPOSAL] Support timeout for openlineage spark listener #2375

athityakumar · 2024-01-18T07:02:49Z

Purpose:

Currently, we have seen few of our spark workloads, whose spark jobs complete within the normal duration; but the openlineage spark listener takes hours to complete processing all spark events - which keeps the infrastructure up, resulting in SLA impacts as well as infrastructure cost.

While we have been seeing different root-causes for the spark listener taking a long time to finish processing events, we'd like to have a way to reduce the blast radius. That way, even if openlineage does end up going to the long-running scenario, we have a way to configure a hard timeout limit - so that we can be sure that the jobs don't end up going out of SLA, even if lineage doesn't get captured in such cases.

Proposed implementation
This section describes how you propose to model it.

We can add support for a new spark conf, that can be configured by users like: spark.openlineage.listener.timeout.seconds=120 (say, 2 mins)

Internally, if a timeout is configured by the user, we can have an internal Executor + Future.get() with timeout / use Guava SimpleTimeLimiter to ensure all the spark events handled by OL spark listener have a max timeout rather than running for hours.

Relevant slack thread: https://openlineage.slack.com/archives/C01CK9T7HKR/p1705161285825369

The text was updated successfully, but these errors were encountered:

mobuchowski · 2024-01-18T12:42:45Z

Agreed with this approach - just would want to add that if the timeout is not configured, the execution path should be the same as it already is.

mobuchowski · 2024-04-18T17:28:17Z

Solved by #2609

athityakumar added the kind:proposal A formal proposal for a spec-related or significant change label Jan 18, 2024

athityakumar mentioned this issue Jan 23, 2024

[SPARK] adds timeout for event processing in spark listener #2392

Closed

8 tasks

mobuchowski mentioned this issue Jan 29, 2024

[PROPOSAL] Add circuit breaker to java client #1255

Closed

mzheng-plaid mentioned this issue Mar 20, 2024

Spark integration can cause YARN applications to crash if Spark context is explicitly stopped #2513

Open

mobuchowski closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Support timeout for openlineage spark listener #2375

[PROPOSAL] Support timeout for openlineage spark listener #2375

athityakumar commented Jan 18, 2024 •

edited

Loading

mobuchowski commented Jan 18, 2024

mobuchowski commented Apr 18, 2024

[PROPOSAL] Support timeout for openlineage spark listener #2375

[PROPOSAL] Support timeout for openlineage spark listener #2375

Comments

athityakumar commented Jan 18, 2024 • edited Loading

mobuchowski commented Jan 18, 2024

mobuchowski commented Apr 18, 2024

athityakumar commented Jan 18, 2024 •

edited

Loading