Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to configure how Spark handles dates in parquet files. #2175

Open
benedeki opened this issue Feb 17, 2023 · 3 comments · May be fixed by #2184
Open

Add ability to configure how Spark handles dates in parquet files. #2175

benedeki opened this issue Feb 17, 2023 · 3 comments · May be fixed by #2184
Assignees
Labels
Conformance Conformance Job affected feature New feature priority: medium Important but not urgent run scripts Helper run scripts are affected Standardization Standardization Job affected under discussion Requires consideration before a decision is made whether/how to implement
Milestone

Comments

@benedeki
Copy link
Collaborator

benedeki commented Feb 17, 2023

Background

With Spark 3 new option were added how to work with dates pre 1900 in parquet files
The settings are:
spark.sql.parquet.datetimeRebaseModeInRead
spark.sql.parquet.datetimeRebaseModeInWrite
spark.sql.parquet.int96RebaseModeInRead
spark.sql.parquet.int96RebaseModeInWrite

Details here.

Feature

Allow setting of the options for Enceladus jobs

Tasks

To discuss

  • The command line option names
  • The command line defaults
  • The write configuration names
@benedeki benedeki added feature New feature Conformance Conformance Job affected under discussion Requires consideration before a decision is made whether/how to implement Standardization Standardization Job affected priority: medium Important but not urgent run scripts Helper run scripts are affected labels Feb 17, 2023
@benedeki benedeki changed the title Add ability to configure how Spakr handles dates in parquet files. Add ability to configure how Spark handles dates in parquet files. Feb 23, 2023
@miroslavpojer
Copy link
Collaborator

miroslavpojer commented Feb 24, 2023

This behaviour can be reached by adding:
--conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY
into spark job json file call "spark-submit": "spark-submit --num-executors 2 --executor-memory 2G --deploy-mode client --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY",

No code changes in Enceladus are needed.
See example usage in json file.

@benedeki
Copy link
Collaborator Author

benedeki commented Mar 1, 2023

This behaviour can be reached by adding: --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY into spark job json file call "spark-submit": "spark-submit --num-executors 2 --executor-memory 2G --deploy-mode client --conf spark.sql.parquet.datetimeRebaseModeInRead=LEGACY --conf spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY",

No code changes in Enceladus are needed. See example usage in json file.

Great finding and solution. So only the Helper scripts needs to be enhanced.

@miroslavpojer
Copy link
Collaborator

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Conformance Conformance Job affected feature New feature priority: medium Important but not urgent run scripts Helper run scripts are affected Standardization Standardization Job affected under discussion Requires consideration before a decision is made whether/how to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants