[SPARK-52638][SQL] Allow preserving Hive-style column order to be configurable #51342

szehon-ho · 2025-07-01T19:47:42Z

What changes were proposed in this pull request?

Add a flag "spark.sql.hive.legacy.preserveHiveColumnOrder" to determine whether HiveExternalCatalog persists a table where schema preserves the Hive-style column order (schema at end), or the original user-specified column order when creating the table.

Why are the changes needed?

Previously Spark relied heavily on Hive behavior. Nowadays, there are more catalogs, especially with DSV2 API. None of the other catalogs re-order the columns so that the partition columns are at the end. Many systems now expect to work on any catalog, and this makes the expectation hard.

This was from the discussion here: #51280 (comment)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add test to HiveDDLSuite

Was this patch authored or co-authored using generative AI tooling?

No

…figurable

pan3793 · 2025-07-02T02:41:51Z

Is it safe? Do other places assume that partition columns always stay at the end?

dongjoon-hyun · 2025-07-02T02:57:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.hive.preserveLegacyColumnOrder.enabled")
+      .internal()
+      .doc("When true, tables returned from HiveExternalCatalog preserve Hive-style column order " +
+        "where the partition columns are at the end.  Otherwise, the user-specified column order " +


nit. . O -> . O.

dongjoon-hyun · 2025-07-02T03:00:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -5989,6 +5989,16 @@ object SQLConf {
      .booleanConf
      .createWithDefault(true)

+  val HIVE_PRESERVE_LEGACY_COLUMN_ORDER =
+    buildConf("spark.sql.hive.preserveLegacyColumnOrder.enabled")


Although it's not a mandatory, shall we have this at HiveUtils.scala file like the following?

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

Line 125 in 47e0888

val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")

dongjoon-hyun

+1, LGTM with two minor comments. Thank you, @szehon-ho .

cloud-fan · 2025-07-02T04:30:47Z

In general, I like this direction to make the behavior more reasonable, but let's also be careful about storage features. For existing tables, will we also change their column order?

cloud-fan · 2025-07-02T04:32:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -818,16 +819,20 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
  // columns are not put at the end of schema. We need to reorder it when reading the schema
  // from the table properties.
  private def reorderSchema(schema: StructType, partColumnNames: Seq[String]): StructType = {


This method is called when we read out the table metadata. I think it's risky to directly change it.

My proposal is to introduce a hidden table property to decide if we want to order columns or not. For existing tables, they don't have this property and we don't change behavior. For newly created tables, we only add this property if the config is turned on.

viirya

None of the other catalogs re-order the columns so that the partition columns are at the end ...

Hmm, but this also only targets HiveExternalCatalog not other catalogs. I wonder why hive catalog doesn't want to follow Hive behavior? Shouldn't it be a normal behavior and expected by users?

I'm also worrying that, like other pointed it, by changing it, will it be safe if some places assume the expected order?

cloud-fan · 2025-07-02T07:56:46Z

I think it's weird for a catalog to define its own column order behavior. I'm fine with still keep this behavior for Hive tables (provider == "hive").

szehon-ho · 2025-07-03T01:04:35Z

so , iiuc we will have this config: spark.sql.hive.preserveLegacyColumnOrder.enabled, and if it is false then save a table property like "legacyColumnOrder"="false". Only then, read the table without re-ordering.

I guess this logic will only be in HiveExternalCatalog as it is only this one that does column re-ordering ? Other catalogs should not be affected?

szehon-ho · 2025-07-03T03:20:31Z

updated as per comments

cloud-fan · 2025-07-03T08:00:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "Otherwise, use the user-specified column order.")
+      .version("4.1.0")
+      .booleanConf
+      .createWithDefault(true)


I think the default behavior should be respect the user-specified column order.

i feel it is behavior change and too risky?

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

szehon-ho · 2025-07-04T01:13:03Z

I think this may actually not be possible for now. I did more tests with alter table, and because we don't keep the partition column in the end, the method Catalog::alterTableDataSchema(Schema newDataSchema) becomes ambiguious for HiveExternalCatalog.

That takes as argument only the new dataSchema (without partition columns), and so we dont know anymore where the partition columns should go as we lost the user intent (previously they just go at the end). Probably we should focus on adding HiveExternalCatalog support for changing column in Catalog::alterTable() because then we know where the user intent.

In any case, I did find a bug in V2SessionCatalog while doing the testing, I will fix that separately.

szehon-ho · 2025-07-04T02:04:13Z

this should be the way forward: #51373 and will return once that is in.

### What changes were proposed in this pull request? Add a new ExternalCatalog and SessionCatalog API alterTableSchema that will supersede alterTableDataSchema. ### Why are the changes needed? Because ExternalCatalog::alterTableDataSchema takes dataSchema only (without partition columns), we lost the context of the partition column. This will make it impossible for us to support column order where partition column are not at the end. See #51342 for context More generally, this is a better intuitive API than alterTableDataSchema, because the caller no longer needs to strip out partition columns. Also, it is not immediately intuitive that data schema means without partition columns. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test, move test for alterTableDataSchema to the new API ### Was this patch authored or co-authored using generative AI tooling? No Closes #51373 from szehon-ho/alter_table_schema. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jul 1, 2025

szehon-ho mentioned this pull request Jul 1, 2025

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

Closed

[SPARK-52638][SQL] Allow preserving Hive-style column order to be con…

6be4548

…figurable

szehon-ho force-pushed the hive_column_order branch from 197a129 to 6be4548 Compare July 1, 2025 20:01

fix test

41cc863

dongjoon-hyun reviewed Jul 2, 2025

View reviewed changes

dongjoon-hyun approved these changes Jul 2, 2025

View reviewed changes

cloud-fan reviewed Jul 2, 2025

View reviewed changes

viirya reviewed Jul 2, 2025

View reviewed changes

Add table property and only reorder if table property is false

f097847

cloud-fan reviewed Jul 3, 2025

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 3, 2025

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 3, 2025

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

Review comments

2508660

szehon-ho mentioned this pull request Jul 4, 2025

[SPARK-52683][SQL] Support ExternalCatalog alterTableSchema #51373

Closed

[SPARK-52638][SQL] Allow preserving Hive-style column order to be configurable #51342

Are you sure you want to change the base?

[SPARK-52638][SQL] Allow preserving Hive-style column order to be configurable #51342

Uh oh!

Conversation

szehon-ho commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Jul 2, 2025

Uh oh!

dongjoon-hyun Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 2, 2025

Uh oh!

cloud-fan Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 2, 2025

Uh oh!

szehon-ho commented Jul 3, 2025

Uh oh!

szehon-ho commented Jul 3, 2025

Uh oh!

cloud-fan Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

szehon-ho commented Jul 1, 2025 •

edited

Loading

szehon-ho commented Jul 4, 2025 •

edited

Loading

szehon-ho commented Jul 4, 2025 •

edited

Loading