-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spline agent affecting databricks driver performance #747
Comments
I don't know who is doing the retry, but the Agent does not. It just initializes and try to connect to the endpoint and then fails, that's it. Something else is then running the whole job again, I guess? You can disable the connection check at the http dispatcher initialization, but if the endpoint is not available when the lineage is supposed to be sent, it will still fail then. What version of Databricks and Spark this runs on? |
On production, you definitely want to decouple your main Spark jobs from any secondary dependencies. We recommend to use any resilient messaging system for this purpose. Spline Agent comes with the embedded
To temporarily disable Spline Agent you can simply set the property |
Hi @cerveada , @wajda we are using different DBR like 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12), 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) and 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12). And use the spline agent version according to spark version. We also see that, even in normal scenario (no load on azure function), when the cluster starts, spline initialization happens 2 times. Pls see attached logs at the time of cluster start. You will see the "Spark Lineage tracking is ENABLED" message 2 times, one at "23/10/05 07:48:21" and another at "23/10/05 07:48:37". Any idea why it is trying to enable itself two times. |
Could you try to use programmatic initialization instead of codeless? According to this guide, there were issues with codeless init: |
The init type is codeless, it's visible from the logs. Also, from what I can see, there must have been two independent spark sessions or even contexts creating. I don't know why this is happening, but it has nothing to do with Spline. Spline agent is just a Spark listener registered via the Spark public API, that's it. Spline agent listener doesn't contain any shared state, so if for some reason Spark driver decides to create two instances of the same listener there should be no impact (though we didn't test this scenario as normally this doesn't happen and listeners are shared between sessions). In other words, I don't know why agent is double initialised in your setup, but it hardly creates further issues by itself, you should get lineage normally. Try to switch the dispatcher from |
Sorry to bother with this again, but receiving lineage not an issue even with Azure function and we are receiving lineage fine. Only concern we had was the spline initializing 2 times at the start of cluster and when we had the function response issue, the agent goes in loop to connect even if it failed trying to connect the first time. Appreciate if you can check this when you get some time. |
As I tried to explain above, the only reason I see for multiple Spline inits is that there are multiple Spark inits. The Spark session might be repeatedly timing out and something re-runs your Spark job. Otherwise I cannot explain it. Try to enable DEBUG or even TRACE log level and see what's happening. |
Hi @wajda, @cerveada
We are using spline agent with databricks and sending lineage by http requests using the httpsdispatcher . We are using Azure function to collect the lineage. What we saw was, during high loads (and so high response times) on the function, if the agent is not able to establish connection to the gateway, it continues to retry every 2 mins. But during this time all operations on the cluster was hanged. I am attaching the logs here for your reference. We had to remove spline installation and restart the cluster to make it normal. Though we are working on improving the Azure function response time by correctly sizing it, but we want to know if we can do anything in the spline setting as well to stop retries if once the gate connection is failed. We plan to install spline on 100 clusters and do not want to lose business team's trust. Please Help!
log4j-2023-09-12-08 (1).log
The text was updated successfully, but these errors were encountered: