Skip to content

Improve reading of Parquet files with nested schema #1619

@Allex-Nik

Description

@Allex-Nik

blocked by #536

There is a schema of a Parquet file:

{
  "name": "book",
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "title", "type": "string"},
    {
      "name": "author",
      "type": {
        "type": "record",
        "name": "author",
        "fields": [
          {"name": "id", "type": "int"},
          {"name": "firstName", "type": "string"},
          {"name": "lastName", "type": "string"}
        ]
      }
    },
    {"name": "genre", "type": "string"},
    {"name": "publisher", "type": "string"}
  ]
}

The field author is nested.
The schema is parsed as org.apache.avro.Schema and AvroParquetWriter is used to write the Parquet file.
When this file is read with DataFrame.readParquet(), the nested field author is represented in a DataFrame as a ValueColumn containing a map in each cell:

dataframe

This kind of fields, however, could be represented as a ColumnGroup.

The Parquet file mentioned above and the Kotlin Notebook in the screenshot are attached below (as a zip archive, as GitHub does not accept .parquet files).

parquet_file_and_notebook.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions