Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

Closed
afgaron opened this issue Jan 28, 2023 · 14 comments
Closed

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

afgaron opened this issue Jan 28, 2023 · 14 comments
Assignees
Labels
bug Something isn't working duplicate This issue or pull request already exists
Milestone

Comments

@afgaron
Copy link

afgaron commented Jan 28, 2023

I use the spark agent to capture lineage on Databricks. I updated to v1.0.2 and am encountering an error on Databricks runtime 11.3; it works fine on 10.4. During cluster initialization, the endpoint fails to connect to the spark master. Trace:

23/01/28 00:37:26 INFO log: Logging initialized @13626ms to org.eclipse.jetty.util.log.Slf4jLog
23/01/28 00:37:27 INFO Server: jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_345-b01
23/01/28 00:37:27 INFO Server: Started @14855ms
23/01/28 00:37:27 INFO AbstractConnector: Started ServerConnector@1382a7d8{HTTP/1.1, (http/1.1)}{10.139.64.117:40001}
23/01/28 00:37:27 INFO Utils: Successfully started service 'SparkUI' on port 40001.
23/01/28 00:37:27 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@682d9f21{/,null,AVAILABLE,@Spark}
23/01/28 00:37:27 INFO SparkContext: Added JAR /mnt/driver-daemon/jars/spark-3.3-spline-agent-bundle_2.12-1.0.2.jar at spark://10.139.64.117:33993/jars/spark-3.3-spline-agent-bundle_2.12-1.0.2.jar with timestamp 1674866242789
23/01/28 00:37:27 WARN FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
23/01/28 00:37:27 INFO FairSchedulableBuilder: Created default pool: default, schedulingMode: FIFO, minShare: 0, weight: 1
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: serverlessEnabled : false
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: perfPackEnabled : false
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: classicSqlEnabled : false
23/01/28 00:37:28 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.139.64.117:7077...
23/01/28 00:37:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 10.139.64.117:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:454)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anon$1.run(StandaloneAppClient.scala:108)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:110)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:74)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:60)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:107)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:110)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Failed to connect to /10.139.64.117:7077
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:284)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:231)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:204)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:200)
	... 11 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.139.64.117:7077
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)
@wajda
Copy link
Contributor

wajda commented Jan 28, 2023

How is it related to Spline?

@afgaron
Copy link
Author

afgaron commented Jan 28, 2023

The only change in my workflow is upgrading the spline spark agent from 0.7.13 to 1.0.2, which makes it seems like it's a bug with the spark agent

@wajda
Copy link
Contributor

wajda commented Jan 29, 2023

Can it be a coincidence? From the logs it looks like there's a connectivity issue, seemingly unrelated to Spline. Can you please double check that it works without Spline agent, but fails with it?

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.139.64.117:7077

@afgaron
Copy link
Author

afgaron commented Jan 29, 2023

Yes, I've checked that the cluster successfully connects with no spline agent loaded and with spline agent v0.7.13

@cerveada
Copy link
Contributor

cerveada commented Feb 1, 2023

I am not able to replicate this.

I used Databricks 11.3 LTS and za.co.absa.spline.agent.spark:spark-3.3-spline-agent-bundle_2.12:1.0.2

What if you start the cluster first and then install the agent?

@afgaron
Copy link
Author

afgaron commented Feb 2, 2023

For our use case, we're using the spline agent bundle and a custom wrapper. We load both from jars and set the necessary config variables with an init script

I can successfully start a cluster and then load the two libraries as jars, but that doesn’t set any config variables, so it doesn’t seem to be using them to capture lineage. A little testing shows that the cluster starts if I include our wrapper jar in the init script, but does not start if I include the spline agent jar in the init script

@wajda
Copy link
Contributor

wajda commented Feb 2, 2023

Can you share your script please? It would help a lot if we could replicate the issue on our side.

@afgaron
Copy link
Author

afgaron commented Feb 2, 2023

init_spline.txt
Converted from .sh to .txt to post it here. The dispatcher is our custom wrapper

@cerveada
Copy link
Contributor

cerveada commented Feb 3, 2023

The script is setting spark.jars to include $listener_jar. Shouldn't the $dispatcher_jar be included via spark.jars as well?

Could you comment out the custom $dispatcher_jar and just configure spline to use logging dispatcher to test if it works without the custom dispatcher?

    "spark.spline.lineageDispatcher" = "logging"

@afgaron
Copy link
Author

afgaron commented Feb 3, 2023

Shouldn't the $dispatcher_jar be included via spark.jars as well?

That's an interesting point. It has worked in the past without doing so. Maybe because spark.jars has the path for $listener_jar, it can load everything in that directory? I hadn't thought about that before

comment out the custom $dispatcher_jar

I tried that and set the dispatcher to logging but got the same error. If I comment out the $listener_jar and leave in the $dispatcher_jar, then the cluster does start

@wajda
Copy link
Contributor

wajda commented Feb 6, 2023

I wonder if it's not caused by a classpath conflict or any other unwanted runtime anomaly caused by a number of xbeans, service provides and other extra stuff I see the newer Spline bundle jar contains in comparison to the 0.7.x one.
@cerveada can you check where does it come from?

image

image

@cerveada cerveada added the bug Something isn't working label Feb 6, 2023
@cerveada cerveada added this to the 1.0.4 milestone Feb 6, 2023
@cerveada cerveada self-assigned this Feb 6, 2023
@afgaron
Copy link
Author

afgaron commented Feb 7, 2023

Yes, I was able to start a cluster on 11.3 using that bundle!

@wajda
Copy link
Contributor

wajda commented Feb 8, 2023

Awesome! Starting 1.0.4 release then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
Status: Done
Development

No branches or pull requests

3 participants