Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA][JSON] Support parsing single numbers in strings as dates in from_json #9664

Open
andygrove opened this issue Nov 8, 2023 · 2 comments
Labels
feature request New feature or request

Comments

@andygrove
Copy link
Collaborator

andygrove commented Nov 8, 2023

Is your feature request related to a problem? Please describe.
This is an edge case that I ran into when working on date support in from_json.

When parsing a four-digit number as a date using the default dateFormat of yyyy-MM-dd, it gets parsed as the first day of that year. For example, 1980 becomes 1980-01-01. However, different logic is used for five-digit numbers, which are interpreted as the number of days since 1970-01-01, as shown in the following example (using Spark 3.1.1).

scala> val df = Seq("1980", "4000", "9999", "10000", "45678").map(x => s"""{ "a": "$x" }""").toDF("json").repartition(2)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [json: string]

scala> df.createOrReplaceTempView("t")

scala> spark.sql("SELECT json, from_json(json, 'a date') from t").show(false)
+----------------+---------------+
|json            |from_json(json)|
+----------------+---------------+
|{ "a": "1980" } |{1980-01-01}   |
|{ "a": "4000" } |{4000-01-01}   |
|{ "a": "9999" } |{9999-01-01}   |
|{ "a": "10000" }|{1997-05-19}   |
|{ "a": "45678" }|{2095-01-23}   |
+----------------+---------------+

Describe the solution you'd like
Ideally, we should match Spark's behavior, even though it is confusing.

Describe alternatives you've considered

Additional context

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 8, 2023
@andygrove andygrove mentioned this issue Nov 8, 2023
58 tasks
@andygrove
Copy link
Collaborator Author

This behavior may only apply to certain Spark versions. I am investigating.

@andygrove
Copy link
Collaborator Author

As of Spark 3.4, five-digit numbers result in an error such as ValueError: year 99346 is out of range

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants