[SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances #51043

bersprockets · 2025-05-28T20:12:02Z

What changes were proposed in this pull request?

This PR changes InMemoryFileIndex#equals to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, InMemoryFileIndex#equals considers the following two collections of root paths to be equal, even though they represent a different number of rows:

["/tmp/test", "/tmp/test"]
["/tmp/test", "/tmp/test", "/tmp/test"]

Why are the changes needed?

The bug can cause correctness issues, e.g.

// create test data
val data = Seq((1, 2), (2, 3)).toDF("a", "b")
data.write.mode("overwrite").csv("/tmp/test")

val fileList1 = List.fill(2)("/tmp/test")
val fileList2 = List.fill(3)("/tmp/test")

val df1 = spark.read.schema("a int, b int").csv(fileList1: _*)
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // correctly returns 6

// the following is the same as above, except df1 is persisted
val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // incorrectly returns 4!!

In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., Set("/tmp/test") == Set("/tmp/test")), the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan.

The same bug also causes inappropriate exchange reuse.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon

looks making sense to me

This reverts commit 42c2acc.

HyukjinKwon · 2025-06-23T00:37:11Z

cc @cloud-fan

cloud-fan · 2025-06-23T06:20:34Z

thanks, merging to master/4.0!

### What changes were proposed in this pull request? This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51043 from bersprockets/multi_path_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ccded6c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

pan3793 · 2025-06-23T07:46:21Z

@cloud-fan should the patch also go branch-3.5?

cloud-fan · 2025-06-23T08:26:32Z

We need a separate PR. @bersprockets can you open a 3.5 backport PR?

### What changes were proposed in this pull request? This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51043 from bersprockets/multi_path_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a back-port of #51043. This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51256 from bersprockets/multi_path_issue_br35. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

…ly when size matches ### What changes were proposed in this pull request? A follow-up for #51043 that sorts paths in InMemoryFileIndex#equal only when size matches ### Why are the changes needed? Avoid potential perf regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test from #51043 ### Was this patch authored or co-authored using generative AI tooling? No Closes #51263 from yaooqinn/SPARK-52339. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…ly when size matches ### What changes were proposed in this pull request? A follow-up for #51043 that sorts paths in InMemoryFileIndex#equal only when size matches ### Why are the changes needed? Avoid potential perf regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test from #51043 ### Was this patch authored or co-authored using generative AI tooling? No Closes #51263 from yaooqinn/SPARK-52339. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 1cfe07c) Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51043 from bersprockets/multi_path_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label May 28, 2025

HyukjinKwon approved these changes May 29, 2025

View reviewed changes

bersprockets added 4 commits June 19, 2025 06:49

Testing

26604e8

Revert "Testing"

a865c3c

This reverts commit 42c2acc.

Possible fix and test

4c1df68

Update test name

a939125

bersprockets force-pushed the multi_path_issue branch from 8f0302d to a939125 Compare June 19, 2025 13:57

cloud-fan approved these changes Jun 23, 2025

View reviewed changes

cloud-fan closed this in ccded6c Jun 23, 2025

bersprockets mentioned this pull request Jun 24, 2025

[SPARK-52339][SQL][3.5] Fix comparison of InMemoryFileIndex instances #51256

Closed

yaooqinn mentioned this pull request Jun 24, 2025

[SPARK-52339][SQL][FOLLOWUP] Sort paths in InMemoryFileIndex#equal only when size matches #51263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances #51043

[SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances #51043

Uh oh!

bersprockets commented May 28, 2025

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Jun 23, 2025

Uh oh!

cloud-fan commented Jun 23, 2025

Uh oh!

pan3793 commented Jun 23, 2025

Uh oh!

cloud-fan commented Jun 23, 2025

Uh oh!

Uh oh!

[SPARK-52339][SQL] Fix comparison of InMemoryFileIndex instances #51043

[SPARK-52339][SQL] Fix comparison of InMemoryFileIndex instances #51043

Uh oh!

Conversation

bersprockets commented May 28, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 23, 2025

Uh oh!

cloud-fan commented Jun 23, 2025

Uh oh!

pan3793 commented Jun 23, 2025

Uh oh!

cloud-fan commented Jun 23, 2025

Uh oh!

Uh oh!

[SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances #51043

[SPARK-52339][SQL] Fix comparison of `InMemoryFileIndex` instances #51043