Skip to content

[SPARK-52334][CORE][K8S] update all files, jars, and pyFiles to reference the working directory after they are downloaded #51037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

TongWei1105
Copy link
Contributor

@TongWei1105 TongWei1105 commented May 28, 2025

…rking directory after they are downloaded.

What changes were proposed in this pull request?

This PR fixes a bug where submitting a Spark job using the --files option and also calling SparkContext.addFile() for a file with the same name causes Spark to throw an exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Why are the changes needed?

  1. Submit a Spark application using spark-submit with the --files option:
    bin/spark-submit --files s3://bucket/a.text --class testDemo app.jar
  2. In the testDemo application code, call:
    sc.addFile("a.text", true)

This works correctly in YARN mode, but throws an error in Kubernetes mode.
After SPARK-33782, in Kubernetes mode, --files, --jars, --archiveFiles, and --pyFiles are all downloaded to the working directory.

However, in the code, args.files = filesLocalFiles, and filesLocalFiles refers to a temporary download path, not the working directory.
This causes issues when user code like testDemo calls sc.addFile("a.text", true), resulting in an error such as:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Does this PR introduce any user-facing change?

This issue can be resolved after this PR.

How was this patch tested?

Existed UT

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the CORE label May 28, 2025
@TongWei1105
Copy link
Contributor Author

Hi @dongjoon-hyun ,
Could you please review this PR when you have time? Many thanks!

@TongWei1105 TongWei1105 changed the title [SPARK-52334][CORE][K8S] update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded [SPARK-52334][CORE][K8S] update all files, jars, and pyFiles to reference the working directory after they are downloaded May 28, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you can write a test case, @TongWei1105 ?

@TongWei1105
Copy link
Contributor Author

TongWei1105 commented May 29, 2025

Do you think you can write a test case, @TongWei1105 ?
@dongjoon-hyun
Thanks for the suggestion! I've added a unit test,please feel free to review it when convenient. Appreciate your feedback!

@TongWei1105 TongWei1105 requested a review from dongjoon-hyun May 29, 2025 08:54
"/home/thejar.jar",
"arg1")
val appArgs = new SparkSubmitArguments(clArgs)
val _ = submit.prepareSubmitEnvironment(appArgs)
Copy link
Contributor

@mridulm mridulm Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this test to handle archive as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this test to handle archive as well ?

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log4j2.properties is not an archive file - and so ends up getting copied for destination (variant of existing cases).
I am trying to ensure that if (isArchive) { works as expected when the file actually results in unpacking the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log4j2.properties is not an archive file - and so ends up getting copied for destination (variant of existing cases). I am trying to ensure that if (isArchive) { works as expected when the file actually results in unpacking the file

The isArchive logic does get triggered in this case — the test has been updated to cover that scenario accordingly.
Thank you for your suggestion.

@TongWei1105 TongWei1105 requested a review from mridulm June 3, 2025 04:54
@TongWei1105 TongWei1105 force-pushed the SPARK-52334 branch 4 times, most recently from 26b6316 to 818ff74 Compare June 4, 2025 02:21
@TongWei1105
Copy link
Contributor Author

When you have a moment, could you please take another look at this PR? Thanks!
@dongjoon-hyun @mridulm

@TongWei1105 TongWei1105 closed this Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants