Skip to content

Commit

Permalink
[SPARK-46122][SQL] Set spark.sql.legacy.createHiveTableByDefault to…
Browse files Browse the repository at this point in the history
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: apache#26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: apache#28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: apache#30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  • Loading branch information
dongjoon-hyun authored and JacobZheng0927 committed May 11, 2024
1 parent e4c556b commit 24fa70c
Show file tree
Hide file tree
Showing 4 changed files with 7 additions and 9 deletions.
1 change: 1 addition & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ license: |
## Upgrading from Spark SQL 3.5 to 4.0

- Since Spark 4.0, `spark.sql.ansi.enabled` is on by default. To restore the previous behavior, set `spark.sql.ansi.enabled` to `false` or `SPARK_ANSI_SQL_MODE` to `false`.
- Since Spark 4.0, `CREATE TABLE` syntax without `USING` and `STORED AS` will use the value of `spark.sql.sources.default` as the table provider instead of `Hive`. To restore the previous behavior, set `spark.sql.legacy.createHiveTableByDefault` to `true`.
- Since Spark 4.0, the default behaviour when inserting elements in a map is changed to first normalize keys -0.0 to 0.0. The affected SQL functions are `create_map`, `map_from_arrays`, `map_from_entries`, and `map_concat`. To restore the previous behaviour, set `spark.sql.legacy.disableMapKeyNormalization` to `true`.
- Since Spark 4.0, the default value of `spark.sql.maxSinglePartitionBytes` is changed from `Long.MaxValue` to `128m`. To restore the previous behavior, set `spark.sql.maxSinglePartitionBytes` to `9223372036854775807`(`Long.MaxValue`).
- Since Spark 4.0, any read of SQL tables takes into consideration the SQL configs `spark.sql.files.ignoreCorruptFiles`/`spark.sql.files.ignoreMissingFiles` instead of the core config `spark.files.ignoreCorruptFiles`/`spark.files.ignoreMissingFiles`.
Expand Down
5 changes: 2 additions & 3 deletions python/pyspark/sql/tests/test_readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,10 +247,9 @@ def test_create(self):

def test_create_without_provider(self):
df = self.df
with self.assertRaisesRegex(
AnalysisException, "NOT_SUPPORTED_COMMAND_WITHOUT_HIVE_SUPPORT"
):
with self.table("test_table"):
df.writeTo("test_table").create()
self.assertEqual(100, self.spark.sql("select * from test_table").count())

def test_table_overwrite(self):
df = self.df
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4457,7 +4457,7 @@ object SQLConf {
s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key} as the table provider.")
.version("3.1.0")
.booleanConf
.createWithDefault(true)
.createWithDefault(false)

val LEGACY_CHAR_VARCHAR_AS_STRING =
buildConf("spark.sql.legacy.charVarcharAsString")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2847,11 +2847,9 @@ class PlanResolutionSuite extends AnalysisTest {
assert(desc.viewText.isEmpty)
assert(desc.viewQueryColumnNames.isEmpty)
assert(desc.storage.locationUri.isEmpty)
assert(desc.storage.inputFormat ==
Some("org.apache.hadoop.mapred.TextInputFormat"))
assert(desc.storage.outputFormat ==
Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"))
assert(desc.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
assert(desc.storage.inputFormat.isEmpty)
assert(desc.storage.outputFormat.isEmpty)
assert(desc.storage.serde.isEmpty)
assert(desc.storage.properties.isEmpty)
assert(desc.properties.isEmpty)
assert(desc.comment.isEmpty)
Expand Down

0 comments on commit 24fa70c

Please sign in to comment.