Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51496][SQL] CaseInsensitiveStringMap comparison should ignore case #50275

Closed
wants to merge 3 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Since both commandOptions and dsOptions are CaseInsensitiveStringMap objects, I think we should convert the keys and values to lowercase before comparing them

Why are the changes needed?

In iceberg/spark4.0 integration, I got a few assertion errors:

assertion failed
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:264)
	at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$mergeOptions(V2Writes.scala:128)

The assertion error occurs when comparing commandOptions and dsOptions; the cases of the keys don't match.

assert(commandOptions == dsOptions)

Does this PR introduce any user-facing change?

No

How was this patch tested?

I have verified that the iceberg tests can pass with this fix.

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 14, 2025
@pan3793
Copy link
Member

pan3793 commented Mar 14, 2025

because one of commandOptions and dsOptions was canonicalized but another wasn't? if so, should it be better to do canonicalization on the caller side too?

@huaxingao
Copy link
Contributor Author

@pan3793
When setting the option, Spark doesn't change the key to lower case. When creating a DataSourceV2Relation, the keys in dsOption are converted to lower case.
I think there are two ways to fix the problem, the first way is as what i did in the PR, the second way is to convert key to lower case in DataFrameWriter.option. The second way doesn't work, because in the iceberg test

    df2.select("id", "data")
            .sort("data")
            .write()
            .format("org.apache.iceberg.spark.source.ManualSource")
            .option(ManualSource.TABLE_NAME, manualTableName)

ManualSource.TABLE_NAME is upper case. If calling .option(ManualSource.TABLE_NAME, manualTableName) changing the key to lower case, I will get

Missing property TABLE_NAME
java.lang.IllegalArgumentException: Missing property TABLE_NAME
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
	at org.apache.iceberg.spark.source.ManualSource.getTable(ManualSource.java:64)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:98)

@huaxingao
Copy link
Contributor Author

cc @cloud-fan

@cloud-fan
Copy link
Contributor

shall we be case-preserving and do not lower case the keys when creating DataSourceV2Relation?

@huaxingao
Copy link
Contributor Author

shall we be case-preserving and do not lower case the keys when creating DataSourceV2Relation?

When creating a CaseInsensitiveStringMap, the keys are changed to lower case in the constructor. This dsOptions is used to create DataSourceV2Relation.
Since DataSourceV2Relation.option is a CaseInsensitiveStringMap, I feel it's the correct behavior to change the key to lower case.

@@ -125,7 +125,14 @@ object V2Writes extends Rule[LogicalPlan] with PredicateHelper {
// for DataFrame API cases, same options are carried by both Command and DataSourceV2Relation
// for DataFrameV2 API cases, options are only carried by Command
// for SQL cases, options are only carried by DataSourceV2Relation
assert(commandOptions == dsOptions || commandOptions.isEmpty || dsOptions.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we turn CaseInsensitiveStringMap to a scala Map is r.options.asScala.toMap, I think we should change it to r.options.asCaseSensitiveMap.asScala.toMap to be case preserving

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in bfe63a3 Mar 20, 2025
cloud-fan pushed a commit that referenced this pull request Mar 20, 2025
…case

### What changes were proposed in this pull request?
Since both `commandOptions` and `dsOptions` are `CaseInsensitiveStringMap` objects, I think we should convert the keys and values to lowercase before comparing them

### Why are the changes needed?

In iceberg/spark4.0 integration, I got a few assertion errors:
```
assertion failed
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:264)
	at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$mergeOptions(V2Writes.scala:128)
```

The assertion error occurs when comparing commandOptions and dsOptions; the cases of the keys don't match.
```
assert(commandOptions == dsOptions)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I have verified that the iceberg tests can pass with this fix.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #50275 from huaxingao/mergeOptions.

Authored-by: huaxingao <huaxin.gao11@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bfe63a3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@huaxingao
Copy link
Contributor Author

Thanks @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants