[SPARK-52348][CONNECT] Add support for Spark Connect handlers for pipeline commands #51057

jonmio · 2025-05-30T20:44:28Z

What changes were proposed in this pull request?

Introduces a PipelinesHandler which handles SparkConnect PipelineCommands. This follows the pattern of MLHandler where the SparkConnectPlanner delegates any ML commands to the MLHandler
Stream PipelineEvents that are emitted during pipeline execution back to the SparkConnect client
Rethrow exceptions that occur during pipeline execution in the StartRun handler so that they are automatically propagated back to the SC client

This is PR builds off changes in a few open PRs. I have squashed those changes into a single commit at the top of this PR - 49626fb. When reviewing please ignore that commit and just review all commits after that one.

Misc changes:

Convert to timestamp field in PipelineEvent proto from String to google.protobuf.Timestamp
Remove references to SerializedException and ErrorDetail in favor of representing errors just as Throwable

Why are the changes needed?

This change is needed to support Spark Declarative Pipelines.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

jonmio · 2025-05-30T21:03:32Z

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto

-  // The graph to attach this dataset to.
-  optional string dataflow_graph_id = 1;
+  // Parses the SQL file and registers all datasets and flows.
+  message DefineSqlGraphElements {


I think this should live under the PipelineCommand message?

There's a PR for this: #51044

jonmio · 2025-06-02T15:44:33Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+        logInfo(s"Start pipeline cmd received: $cmd")
+        startRun(cmd.getStartRun, responseObserver, sessionHolder)
+        defaultResponse
+//      case proto.PipelineCommand.CommandTypeCase.DEFINE_SQL_GRAPH_ELEMENTS =>


Todo: add this back once dependent PR is merged

jonmio · 2025-06-02T15:45:35Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+//              filePath = Option.when(dataset.getSourceCodeLocation.hasFileName)(
+//                dataset.getSourceCodeLocation.getFileName
+//              ),
+//              line = Option.when(dataset.getSourceCodeLocation.hasLineNumber)(
+//                dataset.getSourceCodeLocation.getLineNumber
+//              ),


Todo: add this back once dependent PR is merged

jonmio · 2025-06-02T16:19:54Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala

@@ -31,12 +33,12 @@ import org.apache.spark.sql.pipelines.graph.QueryOrigin
 */
 case class PipelineEvent(
    id: String,
-    timestamp: String,
+    timestamp: Timestamp,


Changing to a timestamp type in favor of representing the timestamp as a string. This allows for easier formatting and better support for timezones

jonmio · 2025-06-02T16:20:21Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala

    origin: PipelineEventOrigin,
    level: EventLevel,
    message: String,
    details: EventDetails,
-    error: Option[ErrorDetail]
+    error: Option[Throwable]


Changing this to a Throwable in favor of a custom class.

Also removing a bunch of tests around the custom class and error serialization since that is no longer needed

jonmio · 2025-06-02T16:42:22Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/QueryOriginType.scala

+
+package org.apache.spark.sql.pipelines
+
+object QueryOriginType extends Enumeration {


This is included in an upstream PR and can be removed once that is merged

jonmio · 2025-06-02T18:07:33Z

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala

+  }
+
+  // TODO: renable when dependency on SQL registration is merged
+  ignore(


Will reenable this once the SQL registration PR is merged

hvanhovell · 2025-06-02T18:29:07Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/util/SparkSessionUtils.scala

+  def withSqlConf[T](spark: SparkSession, pairs: (String, String)*)(f: => T): T = {
+    val conf = spark.conf
+    val (keys, values) = pairs.unzip
+    val currentValues = keys.map(conf.getOption)


This returns a default value if the conf is not set and the conf has a default defined for it. This means that in the finally block you are actually setting keys that were unset before. That gets a bit dicey when you use spark.conf.get(key, default) later on.

Ah this is actually from this dependent PR - #51050 cc: @SCHJonathan

hvanhovell · 2025-06-02T18:34:43Z

...t/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelineExecutionHolder.scala

+ * Holds the latest pipeline execution for each graph ID. This is used to manage the lifecycle of
+ * pipeline executions.
+ */
+object PipelineExecutionHolder {


How does this tie in with Connect's life cycle management? If a session get's killed, any pipeline execution associated with that session should also be killed.

How is that handled outside of pipelines, for example with streaming queries?

I chatted about this with @hvanhovell, and it sounds like there's a stop method in SessionHolder. He also suggested it could make sense to track the executions inside SessionHolder instead of a global object.

added this in the latest commit

hvanhovell · 2025-06-02T18:38:04Z

...t/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelineExecutionHolder.scala

+
+  def stopPipelineExecution(graphId: String): Unit = {
+    executions.compute(graphId, (_, context) => {
+      context.pipelineExecution.stopPipeline()


How expensive is it to stop a pipeline? This will still block a part of the ConcurrentHashMap.

I would say that it's not extremely cheap but reasonably cheap. Stop will unregister listeners that were created to monitor pipeline execution and then it will interrupt the graph execution thread.

Can we punt on addressing any perf issues around concurrent requests in a followup PR?

copied tests and protos fix imports save before copying event protos and event helpers connect module building but python is not working regenerated protos and going to rebase on sandy's python changes regen protos save mostly green nits herman fix

sryza

Thanks for making the requested changes @jon-mio. One stylistic nitpick; otherwise LGTM!

sryza · 2025-06-09T16:35:50Z

...lines/src/test/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecutionSuite.scala

@@ -1007,12 +992,10 @@ class TriggeredGraphExecutionSuite extends ExecutionTest {
        "Failed to resolve flow due to upstream failure: 'spark_catalog.test_db.table3'"
      ),
      errorChecker = { ex =>
-        ex.exceptions.exists { ex =>
-          ex.message.contains(
+        ex.getMessage.contains(


Style nitpick: should the message be indented a block back?

sql/connect/server/pom.xml

...ect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/DataflowGraphRegistry.scala

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

hvanhovell · 2025-06-09T20:56:09Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+      .getOrElse {
+        logInfo(
+          s"No default catalog was supplied. Falling back to the session catalog `spark_catalog`).")
+        "spark_catalog"


Do we actually want to fallback to this catalog? Or to the one that is currently the default? The same question applies to the defaultDatabase...

Updated to fallback to the current one

...ect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/DataflowGraphRegistry.scala

hvanhovell · 2025-06-09T21:00:03Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+            baseOrigin = QueryOrigin(
+              objectType = Option(QueryOriginType.Table.toString),
+              objectName = Option(tableIdentifier.unquotedString),
+              language = Option(Python())),


Python? You technically don't know that. To what end do we need to record this information?

Currently all the SQL code goes through defineSqlGraphElement so anything going through this path is Python. However, it's not being used right now so I'm happy to remove it

hvanhovell · 2025-06-09T21:04:33Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+    val graphElementRegistry = DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId)
+    // We will use this variable to store the run failure event if it occurs. This will be set
+    // by the event callback.
+    var runFailureEvent = Option.empty[PipelineEvent]


By which threads is the var accessed?

AFAICT it should be marked volatile...

Yup marked as volatile since this can be called by other threads that add events to the buffer causing the callback to be invoked

hvanhovell · 2025-06-09T23:21:15Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+                      .newBuilder()
+                      .setTimestamp(ProtoTimestamp
+                        .newBuilder()
+                        .setSeconds(event.timestamp.getTime / 1000)


I am not sure if we have to document this, initially it seemed that this is was arbitrarily dropping/adding milliseconds. However java.sql.Timestamp normalizes its time and nanos fields.

hvanhovell · 2025-06-10T00:24:54Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

This is neat!

hvanhovell · 2025-06-10T13:23:59Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

@@ -426,6 +434,58 @@ case class SessionHolder(userId: String, sessionId: String, session: SparkSessio
    listenerCache.keySet().asScala.toSeq
  }

+  /**


In a follow-up let's just put this in a different class. It is fine for now.

hvanhovell · 2025-06-10T13:25:18Z

Merging to master. Thanks!

github-actions bot added SQL BUILD PYTHON CONNECT labels May 30, 2025

jonmio changed the title ~~Sc pipelines~~ [WIP] [DRAFT] Sc pipelines May 30, 2025

jonmio commented May 30, 2025

View reviewed changes

jonmio changed the title ~~[WIP] [DRAFT] Sc pipelines~~ [WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands May 30, 2025

sryza changed the title ~~[WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands~~ [SPARK-52348] [WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands Jun 1, 2025

sryza self-assigned this Jun 1, 2025

HyukjinKwon changed the title ~~[SPARK-52348] [WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands~~ [SPARK-52348][CONNECT] [WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands Jun 1, 2025

jonmio commented Jun 2, 2025

View reviewed changes

jonmio changed the title ~~[SPARK-52348][CONNECT] [WIP] [DRAFT] Add support for Spark Connect handlers for pipeline commands~~ [SPARK-52348][CONNECT] Add support for Spark Connect handlers for pipeline commands Jun 2, 2025

hvanhovell reviewed Jun 2, 2025

View reviewed changes

jonmio requested review from sryza and hvanhovell June 4, 2025 17:09

github-actions bot added the STRUCTURED STREAMING label Jun 6, 2025

jonmio force-pushed the sc-pipelines branch from ac89560 to dba3dae Compare June 6, 2025 20:45

github-actions bot removed the STRUCTURED STREAMING label Jun 6, 2025

jon-mio added 3 commits June 7, 2025 19:39

copied non test code

4a15951

copied tests and protos fix imports save before copying event protos and event helpers connect module building but python is not working regenerated protos and going to rebase on sandy's python changes regen protos save mostly green nits herman fix

before removing incomplete tests

f64511d

probably works

27986d6

jonmio force-pushed the sc-pipelines branch from dba3dae to 27986d6 Compare June 8, 2025 00:12

doc

fb84ede

jon-mio added 7 commits June 7, 2025 21:04

fix formatting and python

d15d8cf

manual tested

2bc1d68

scala fmt

1038776

py format

7380ae8

remove venv

0565069

fix

32389c1

fix lint

7d1f5dd

sryza approved these changes Jun 9, 2025

View reviewed changes

fix style

569b582

hvanhovell reviewed Jun 9, 2025

View reviewed changes

sql/connect/server/pom.xml Outdated Show resolved Hide resolved

hvanhovell reviewed Jun 9, 2025

View reviewed changes

...ect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/DataflowGraphRegistry.scala Show resolved Hide resolved

hvanhovell reviewed Jun 9, 2025

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Jun 9, 2025

View reviewed changes

...ect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/DataflowGraphRegistry.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Jun 9, 2025

View reviewed changes

herman

37da022

hvanhovell reviewed Jun 10, 2025

View reviewed changes

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala Outdated

Copy link

Contributor

hvanhovell Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat!

hvanhovell approved these changes Jun 10, 2025

View reviewed changes

ci

cba63b5

hvanhovell reviewed Jun 10, 2025

View reviewed changes

asf-gitbox-commits closed this in 8c3194f Jun 10, 2025


		package org.apache.spark.sql.pipelines

		object QueryOriginType extends Enumeration {

[SPARK-52348][CONNECT] Add support for Spark Connect handlers for pipeline commands #51057

[SPARK-52348][CONNECT] Add support for Spark Connect handlers for pipeline commands #51057

Uh oh!

Conversation

jonmio commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hvanhovell Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jun 10, 2025

Uh oh!

Uh oh!

jonmio commented May 30, 2025 •

edited

Loading

hvanhovell Jun 9, 2025 •

edited

Loading