Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Finish the JSON test matrix #10491

Open
revans2 opened this issue Feb 23, 2024 · 0 comments
Open

[FEA] Finish the JSON test matrix #10491

revans2 opened this issue Feb 23, 2024 · 0 comments
Labels
test Only impacts tests

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 23, 2024

Is your feature request related to a problem? Please describe.

#10490 is adding in some beginnings of a test matrix for all JSON processing. It is a good start but it is not done. We are still missing a number of things.

Confs/Options

  • maxNestingDepth - added in 3.5.0 sets the maximum nesting depth of JSON before it is considered invalid. The default value is 1000, but in my tests with from_json It started to fail at a depth of 255. I don't know what the limit is for get_json_object. I also don't know if we want to go for the full 1000, or if we are okay with a smaller depth.
  • maxNumLen - also added in 3.5.0 this is the maximum length of a number. It looks like we go way beyond the default 1000 so we probably are okay, but it still would be good to have some kind of a test
  • maxStringLen - also added in 3.5.0. The default here is 20,000,000 (and I think that is chars not bytes) In my tests we go way beyond this but it would be good to add some tests here too.
  • local - local appears to impact date, timestamp, and decimal parsing. We have tests written and issues filed around decimal. We have not done the same for date/timestamp because there are issues with the date/timestamp format settings, which when fixed should hopefully mask this (as it really only impacts formats that we do not support) But tests that we fallback would still be good.
  • parseMode. We only support permissive, but we should have unified tests for this
  • corruptedColumnName this is for capturing the input that didn't parse properly. It only matters if a column with this name shows up in the read schema. We have some tests but we should have unified tests.
  • timeZone - we have some tests but we should look at putting them in the test matrix.
  • dateFormat - again we have some tests but they are not unified
  • timestampFormat - again we have some tests but they are not unified
  • timestampNTZFormat - again we have some tests but they are not unified
  • enableDateTimeParsingFallback. This also needs to be combined with LEGACY parsing configs testing
  • linSep/encoding. We have very little testing here, but for the most part it is ignored except for ScanJson so it is probably okay.

Input Data Types/Formats:

  • dictionary of double quoted strings
  • dictionary of single quoted strings
  • dictionary of ints
  • array of double quoted strings i.e. {"data":[...]}
  • array of single quoted strings
  • array of integer format
  • top level array of double quoted strings i.e. [...]
  • top level array of single quoted strings
  • top level array of integer format
  • garbage at the end of the input lines
  • date formatted string for different locals
  • timestamp formatted string for different locals
  • extra white space in between fields
  • leading zeros on numbers
  • repeated columns
  • missing columns
  • \r\n whitespace in strings
  • escaped chars in strings
  • deeply nested/mixed data. These should include things that would map to the various map types

Data Types (note that these only apply to from_json and ScanJson because the others don't take a full read schema):

All nested types should be tested both at the top level and as a child (data) column. If we don't support those types yet, then we should verify that we fallback to the CPU as expected.

  • date
  • timestamp
  • map<string,string>
  • map<string,int>
  • map<string,decimal(38,0)>
  • array
  • array
  • array<decimal(38,0)>
  • map<string,map<string,string>>
  • map<string,array>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants