Can't connect to spark using 1.0.2 on Databricks 11.3 #587

afgaron · 2023-01-28T01:07:48Z

I use the spark agent to capture lineage on Databricks. I updated to v1.0.2 and am encountering an error on Databricks runtime 11.3; it works fine on 10.4. During cluster initialization, the endpoint fails to connect to the spark master. Trace:

23/01/28 00:37:26 INFO log: Logging initialized @13626ms to org.eclipse.jetty.util.log.Slf4jLog
23/01/28 00:37:27 INFO Server: jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_345-b01
23/01/28 00:37:27 INFO Server: Started @14855ms
23/01/28 00:37:27 INFO AbstractConnector: Started ServerConnector@1382a7d8{HTTP/1.1, (http/1.1)}{10.139.64.117:40001}
23/01/28 00:37:27 INFO Utils: Successfully started service 'SparkUI' on port 40001.
23/01/28 00:37:27 INFO ContextHandler: Started o.e.j.s.ServletContextHandler@682d9f21{/,null,AVAILABLE,@Spark}
23/01/28 00:37:27 INFO SparkContext: Added JAR /mnt/driver-daemon/jars/spark-3.3-spline-agent-bundle_2.12-1.0.2.jar at spark://10.139.64.117:33993/jars/spark-3.3-spline-agent-bundle_2.12-1.0.2.jar with timestamp 1674866242789
23/01/28 00:37:27 WARN FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
23/01/28 00:37:27 INFO FairSchedulableBuilder: Created default pool: default, schedulingMode: FIFO, minShare: 0, weight: 1
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: serverlessEnabled : false
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: perfPackEnabled : false
23/01/28 00:37:28 INFO DatabricksEdgeConfigs: classicSqlEnabled : false
23/01/28 00:37:28 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.139.64.117:7077...
23/01/28 00:37:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 10.139.64.117:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:454)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anon$1.run(StandaloneAppClient.scala:108)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:110)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:74)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:60)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:107)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:110)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Failed to connect to /10.139.64.117:7077
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:284)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:231)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:204)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:200)
	... 11 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.139.64.117:7077
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)

The text was updated successfully, but these errors were encountered:

wajda · 2023-01-28T14:56:00Z

How is it related to Spline?

afgaron · 2023-01-28T19:37:16Z

The only change in my workflow is upgrading the spline spark agent from 0.7.13 to 1.0.2, which makes it seems like it's a bug with the spark agent

wajda · 2023-01-29T10:54:25Z

Can it be a coincidence? From the logs it looks like there's a connectivity issue, seemingly unrelated to Spline. Can you please double check that it works without Spline agent, but fails with it?

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.139.64.117:7077

afgaron · 2023-01-29T20:10:17Z

Yes, I've checked that the cluster successfully connects with no spline agent loaded and with spline agent v0.7.13

cerveada · 2023-02-01T09:30:46Z

I am not able to replicate this.

I used Databricks 11.3 LTS and za.co.absa.spline.agent.spark:spark-3.3-spline-agent-bundle_2.12:1.0.2

What if you start the cluster first and then install the agent?

afgaron · 2023-02-02T01:00:46Z

For our use case, we're using the spline agent bundle and a custom wrapper. We load both from jars and set the necessary config variables with an init script

I can successfully start a cluster and then load the two libraries as jars, but that doesn’t set any config variables, so it doesn’t seem to be using them to capture lineage. A little testing shows that the cluster starts if I include our wrapper jar in the init script, but does not start if I include the spline agent jar in the init script

wajda · 2023-02-02T09:21:11Z

Can you share your script please? It would help a lot if we could replicate the issue on our side.

afgaron · 2023-02-02T18:05:03Z

init_spline.txt
Converted from .sh to .txt to post it here. The dispatcher is our custom wrapper

cerveada · 2023-02-03T07:52:49Z

The script is setting spark.jars to include $listener_jar. Shouldn't the $dispatcher_jar be included via spark.jars as well?

Could you comment out the custom $dispatcher_jar and just configure spline to use logging dispatcher to test if it works without the custom dispatcher?

    "spark.spline.lineageDispatcher" = "logging"

afgaron · 2023-02-03T21:10:50Z

Shouldn't the $dispatcher_jar be included via spark.jars as well?

That's an interesting point. It has worked in the past without doing so. Maybe because spark.jars has the path for $listener_jar, it can load everything in that directory? I hadn't thought about that before

comment out the custom $dispatcher_jar

I tried that and set the dispatcher to logging but got the same error. If I comment out the $listener_jar and leave in the $dispatcher_jar, then the cluster does start

wajda · 2023-02-06T11:35:44Z

I wonder if it's not caused by a classpath conflict or any other unwanted runtime anomaly caused by a number of xbeans, service provides and other extra stuff I see the newer Spline bundle jar contains in comparison to the 0.7.x one.
@cerveada can you check where does it come from?

wajda · 2023-02-07T17:03:49Z

@afgaron can you try this bundle and let us know if it works please?
https://teamcity.jetbrains.com/repository/download/OpenSourceProjects_AbsaOSS_SplineAgentSpark_AutoBuildArtifactsSparkScala212/4074758:id/bundle-3.3/target/spark-3.3-spline-agent-bundle_2.12-1.0.4-SNAPSHOT.jar

afgaron · 2023-02-07T18:36:04Z

Yes, I was able to start a cluster on 11.3 using that bundle!

wajda · 2023-02-08T10:29:09Z

Awesome! Starting 1.0.4 release then.

wajda added the investigating label Jan 30, 2023

cerveada added the bug Something isn't working label Feb 6, 2023

cerveada added this to the 1.0.4 milestone Feb 6, 2023

cerveada self-assigned this Feb 6, 2023

wajda mentioned this issue Feb 7, 2023

Unwanted dependencies included in Agent bundles #595

Closed

wajda closed this as completed Feb 8, 2023

wajda added duplicate This issue or pull request already exists and removed investigating labels Feb 8, 2023

wajda mentioned this issue Jun 6, 2023

Document a workaround for codeless init issue on Databricks AbsaOSS/spline-getting-started#45

Open

afgaron mentioned this issue Sep 21, 2023

Cant connect to spark using 2.0.0 on Databricks 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) #744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

afgaron commented Jan 28, 2023

wajda commented Jan 28, 2023

afgaron commented Jan 28, 2023

wajda commented Jan 29, 2023

afgaron commented Jan 29, 2023

cerveada commented Feb 1, 2023

afgaron commented Feb 2, 2023

wajda commented Feb 2, 2023

afgaron commented Feb 2, 2023

cerveada commented Feb 3, 2023

afgaron commented Feb 3, 2023

wajda commented Feb 6, 2023

wajda commented Feb 7, 2023

afgaron commented Feb 7, 2023 •

edited

wajda commented Feb 8, 2023

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

Can't connect to spark using 1.0.2 on Databricks 11.3 #587

Comments

afgaron commented Jan 28, 2023

wajda commented Jan 28, 2023

afgaron commented Jan 28, 2023

wajda commented Jan 29, 2023

afgaron commented Jan 29, 2023

cerveada commented Feb 1, 2023

afgaron commented Feb 2, 2023

wajda commented Feb 2, 2023

afgaron commented Feb 2, 2023

cerveada commented Feb 3, 2023

afgaron commented Feb 3, 2023

wajda commented Feb 6, 2023

wajda commented Feb 7, 2023

afgaron commented Feb 7, 2023 • edited

wajda commented Feb 8, 2023

afgaron commented Feb 7, 2023 •

edited