Skip to content

[SPARK-48356][FOLLOW UP][SQL] Improve FOR statement's column schema inference #51053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

davidm-db
Copy link
Contributor

What changes were proposed in this pull request?

This pull request changes FOR statement to infer column schemas from the query DataFrame, and no longer implicitly infer column schema in SetVariable. This is necessary due to type mismatch errors with complex nested types, e.g. ARRAY<STRUCT<..>>.

Why are the changes needed?

Bug fix for FOR statement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test that specifically targets problematic case.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 30, 2025
@davidm-db davidm-db marked this pull request as ready for review May 30, 2025 10:12
@davidm-db
Copy link
Contributor Author

cc @cloud-fan @dejankrak-db @miland-db @dusantism-db please review

@@ -122,6 +122,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
private lazy val nameToIndex: Map[String, Int] = SparkCollectionUtils.toMapWithIndex(fieldNames)
private lazy val nameToIndexCaseInsensitive: CaseInsensitiveMap[Int] =
CaseInsensitiveMap[Int](nameToIndex.toMap)
lazy val nameToDataType: collection.immutable.Map[String, DataType] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructType is a public API, we should only add new methods when we have to. It's also in the Spark Connect side, which means users need to upgrade the client version.

Can we build this map in the caller side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Wasn't aware of it, thanks!

@cloud-fan
Copy link
Contributor

The linter failure is unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in 23e6274 Jun 5, 2025
cloud-fan added a commit that referenced this pull request Jun 5, 2025
…nference

### What changes were proposed in this pull request?

This pull request changes `FOR` statement to infer column schemas from the query DataFrame, and no longer implicitly infer column schema in SetVariable. This is necessary due to type mismatch errors with complex nested types, e.g. `ARRAY<STRUCT<..>>`.

### Why are the changes needed?

Bug fix for FOR statement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test that specifically targets problematic case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #51053 from davidm-db/for_schema_inference.

Lead-authored-by: David Milicevic <david.milicevic@databricks.com>
Co-authored-by: David Milicevic <163021185+davidm-db@users.noreply.github.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 23e6274)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…nference

### What changes were proposed in this pull request?

This pull request changes `FOR` statement to infer column schemas from the query DataFrame, and no longer implicitly infer column schema in SetVariable. This is necessary due to type mismatch errors with complex nested types, e.g. `ARRAY<STRUCT<..>>`.

### Why are the changes needed?

Bug fix for FOR statement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test that specifically targets problematic case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51053 from davidm-db/for_schema_inference.

Lead-authored-by: David Milicevic <david.milicevic@databricks.com>
Co-authored-by: David Milicevic <163021185+davidm-db@users.noreply.github.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants