Skip to content

Commit

Permalink
[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behav…
Browse files Browse the repository at this point in the history
…ior change behavior change since Spark 3.4

### What changes were proposed in this pull request?

Add migration guide for `CREATE TABLE AS SELECT...` behavior change.

SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`:

```
drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col;
drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col;
select * from test_table;

```
This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application.

### Why are the changes needed?
This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`doc build
`
### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47152 from asl3/allowNonEmptyLocationInCTAS.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
asl3 authored and cloud-fan committed Jul 2, 2024
1 parent 9304223 commit 8a5f4e0
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ license: |
- Since Spark 3.4, `BinaryType` is not supported in CSV datasource. In Spark 3.3 or earlier, users can write binary columns in CSV datasource, but the output content in CSV files is `Object.toString()` which is meaningless; meanwhile, if users read CSV tables with binary columns, Spark will throw an `Unsupported type: binary` exception.
- Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set `spark.sql.optimizer.runtime.bloomFilter.enabled` to `false`.
- Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation `isAdjustedToUTC=false` will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`.
- Since Spark 3.4, the behavior for `CREATE TABLE AS SELECT ...` is changed from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`. Users are recommended to avoid CTAS with a non-empty table location.

## Upgrading from Spark SQL 3.2 to 3.3

Expand Down

0 comments on commit 8a5f4e0

Please sign in to comment.