Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support reading JSON numeric values as timestamps and dates #4940

Open
Tracked by #2063
andygrove opened this issue Mar 11, 2022 · 7 comments
Open
Tracked by #2063

[FEA] Support reading JSON numeric values as timestamps and dates #4940

andygrove opened this issue Mar 11, 2022 · 7 comments
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@andygrove
Copy link
Contributor

andygrove commented Mar 11, 2022

Is your feature request related to a problem? Please describe.
PR #4938 adds support for reading CSV and JSON strings as timestamps and it supports valid timestamps formatted in a number of timestamp formats consisting of year, month, day, hours, minutes, seconds, and so on. However, it does not support parsing numeric values (UNIX timestamps) representing elapsed time since the epoch.

Describe the solution you'd like
Add support for GpuJsonScan for reading UNIX timestamps.

Describe alternatives you've considered
None

Additional context
See existing tests that reference this issue

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 11, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 15, 2022
@andygrove andygrove changed the title [FEA] Support reading JSON numeric values as timestamps [FEA] Support reading CSV and JSON numeric values as timestamps Mar 16, 2022
@andygrove andygrove self-assigned this Apr 1, 2022
@andygrove andygrove added this to the Apr 4 - Apr 15 milestone Apr 1, 2022
@res-life res-life self-assigned this Apr 19, 2022
@sameerz sameerz removed this from the Apr 18 - Apr 29 milestone Apr 29, 2022
@res-life res-life assigned res-life and unassigned res-life May 19, 2022
@sameerz sameerz added this to To do in Release 22.08 via automation May 20, 2022
@res-life
Copy link
Collaborator

@andygrove @jlowe @revans2 Help to check:

  • Spark does not support reading CSV numeric UNIX timestamps as timestamps.
  • Spark does support reading JSON numeric UNIX timestamps as timestamps.
    But we have difficulty supporting this on GPU.
    Spark behavoir:
    string "1653014790" => null
    int 1653014790 => 2022-05-20 10:46:30
    If one column has both above values, we can't know the original types if read this column as String by cuDF, because we will get two string rows with the same value.

Details:

Spark does not support reading CSV numeric UNIX timestamps as timestamps

$SPARK_HOME/bin/pyspark 
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])

spark.read.csv("/tmp/tmp.csv").show()
+----------+
|       _c0|
+----------+
|1653012031|
|       197|
+----------+

spark.read.schema(schema).csv("/tmp/tmp.csv").show()
+----+
|  c1|
+----+
|null|
|null|
+----+

Spark read JSON string int and int

Spark code:
Spark uses VALUE_STRING and VALUE_NUMBER_INT tokens to distinguish string and int

https://github.com/apache/spark/blob/v3.3.0-rc2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L251toL266

    case TimestampType =>
      (parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) {
        case VALUE_STRING if parser.getTextLength >= 1 =>
          try {
            timestampFormatter.parse(parser.getText)
          } catch {
            case NonFatal(e) =>
              // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
              // compatibility.
              val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(parser.getText))
              DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
          }

        case VALUE_NUMBER_INT =>
          parser.getLongValue * 1000000L
      }

Different behavior for string int and int

cat tmp.json 
{ "c1": "2020-01-01 00:00:00" }
{ "c1": 1653014790 }  // Valid int

$SPARK_HOME/bin/pyspark 
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])

spark.read.schema(schema).json("/tmp/tmp.json").show()
+-------------------+
|                 c1|
+-------------------+
|2020-01-01 00:00:00|
|2022-05-20 10:46:30| // Valid timestamp
+-------------------+


cat tmp2.json 
{ "c1": "2020-01-01 00:00:00" }
{ "c1": "1653014790" }   // Note: it's a string, not int

spark.read.schema(schema).json("/tmp/tmp2.json").show()
+-------------------+
|                 c1|
+-------------------+
|2020-01-01 00:00:00|
|               null|    // Note
+-------------------+

cuDF reads as long

json
{ "c1": "2020-01-13" }
{ "c1": 1653014790 }
{ "c1": 0 }

result:
0                         isNull = false   // IMO, this should be false  
1653014790       isNull = false  
0                         isNull = false   // can't distinguish with the first row  

0 indicates the epoch 1970-01-01 00:00:00

@andygrove andygrove removed their assignment May 20, 2022
@andygrove
Copy link
Contributor Author

@res-life I will take a look at this later today

@revans2
Copy link
Collaborator

revans2 commented May 20, 2022

@res-life
Yes this is a known issue with JSON parsing. Spark distinguishes between quoted fields and non-quoted fields. For now we are just documenting this

https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/docs/compatibility.md#json-floating-point

Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where Spark will treat them as invalid inputs and will just return null.

Although it should be moved to its own section to make it more visible.

We hope that once CUDF finishes work on the new JSON parser that we can get some help from them to indicate if a field was quoted or not.

@res-life res-life added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label May 23, 2022
@sameerz sameerz removed this from To do in Release 22.08 Jul 7, 2022
@sameerz
Copy link
Collaborator

sameerz commented Jul 7, 2022

Removing from 22.08. Will put it into a release once the cudf dependencies are resolved.

@res-life
Copy link
Collaborator

It's not planned in 22.12, let me unassign myself.

@res-life res-life removed their assignment Oct 18, 2022
@revans2 revans2 mentioned this issue Oct 27, 2022
38 tasks
@firestarman
Copy link
Collaborator

firestarman commented Oct 31, 2022

I have failed to find a case that says Spark CSV can read numeric values as timestamps, or any related config to enable this behavior.
Maybe we can remove the CSV word from the title, since this issue appears to be specific to JSON reader.

@andygrove Do you have any concern for removing the 'CSV' word?

@firestarman
Copy link
Collaborator

firestarman commented Nov 1, 2022

I am going to remove the CSV word from the title.
If any concern, free to add it back.

@firestarman firestarman changed the title [FEA] Support reading CSV and JSON numeric values as timestamps [FEA] Support reading JSON numeric values as timestamps Nov 1, 2022
@andygrove andygrove removed their assignment Dec 14, 2023
@revans2 revans2 mentioned this issue Mar 13, 2024
58 tasks
@revans2 revans2 changed the title [FEA] Support reading JSON numeric values as timestamps [FEA] Support reading JSON numeric values as timestamps and dates Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants