Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for reading JSON containing structs where rows are not consistent #10263

Closed
andygrove opened this issue Jan 24, 2024 · 2 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@andygrove
Copy link
Contributor

andygrove commented Jan 24, 2024

Is your feature request related to a problem? Please describe.

Given this input file:

{ "teacher": "Bob" }
{ "student": { "name": "Carol", "age":  21 } }
{ "teacher": "Bob", "student": { "name": "Carol", "age": 21 } }

Given this test for GpuJsonScan (in PR #10245 which adds support for reading structs in GpuJsonScan):

@pytest.mark.parametrize('filename', ['nested-structs.ndjson'])
@pytest.mark.parametrize('schema', [
    StructType([StructField('teacher', StringType())]),
    StructType([
        StructField('student', StructType([
            StructField('name', StringType()),
            StructField('age', IntegerType())
        ]))
    ]),
    StructType([
        StructField('teacher', StringType()),
        StructField('student', StructType([
            StructField('name', StringType()), 
            StructField('age', IntegerType())
        ]))
    ]),
])
@pytest.mark.parametrize('read_func', [read_json_df, read_json_sql])
@pytest.mark.parametrize('v1_enabled_list', ["", "json"])
def test_read_nested_struct(spark_tmp_table_factory, std_input_path, read_func, filename, schema, v1_enabled_list):
    conf = copy_and_update(_enable_all_types_conf, {'spark.sql.sources.useV1SourceList': v1_enabled_list})
    assert_gpu_and_cpu_are_equal_collect(
        read_func(std_input_path + '/' + filename,
                  schema,
                  spark_tmp_table_factory,
                  {}),
        conf=conf)

I see errors such as this:

Caused by: java.lang.AssertionError: Type conversion is not allowed from Table{columns=[ColumnVector{rows=3, type=STRING, nullCount=Optional.empty, offHeap=(ID: 22 7fc165d1cda0)}, ColumnVector{rows=3, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 23 7fc1640ea610)}], cudfTable=140468586241024, rows=3} to [StringType, StructType(StructField(name,StringType,true),StructField(age,IntegerType,true))] columns 0 to 2
	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:674)
	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:555)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$3(GpuTextBasedPartitionReader.scala:279)
	at scala.Option.map(Option.scala:230)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:279)

This is a similar issue with GpuJsonToStruct:

@pytest.mark.parametrize('schema', [
    'struct<teacher:string>',
    'struct<student:struct<name:string,age:int>>',
    'struct<teacher:string,student:struct<name:string,age:int>>'
])
@allow_non_gpu(*non_utc_allow)
def test_from_json_struct_of_struct(schema):
    json_string_gen = StringGen(r'{"teacher": "[A-Z]{1}[a-z]{2,5}",' \
                                r'"student": {"name": "[A-Z]{1}[a-z]{2,5}", "age": 1\d}}') \
        .with_special_pattern('', weight=50) \
        .with_special_pattern('null', weight=50) \
        .with_special_pattern('invalid_entry', weight=50)
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark : unary_op_df(spark, json_string_gen) \
            .select(f.from_json('a', schema)),
        conf={"spark.rapids.sql.expression.JsonToStructs": True})

fails with:

Type conversion is not allowed from STRUCT(STRING,STRUCT(STRING,INT64)) to StructType(StructField(teacher,StringType,true),StructField(student,StructType(StructField(name,StringType,true),StructField(age,IntegerType,true)),true))

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 24, 2024
@andygrove
Copy link
Contributor Author

Root cause seems to be rapidsai/cudf#14864

@mattahrens mattahrens added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Jan 30, 2024
@andygrove andygrove assigned revans2 and unassigned andygrove and revans2 Apr 1, 2024
@revans2
Copy link
Collaborator

revans2 commented Apr 11, 2024

I do not see these errors when I run the tests locally. These tests have been enabled for a long time too, so I suspect that #10542 fixed this, or at least worked around it enough that it is no longer a problem.

@revans2 revans2 closed this as completed Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants