Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable customizing list inner child element name? #84

Open
AlJohri opened this issue Oct 30, 2022 · 3 comments
Open

enable customizing list inner child element name? #84

AlJohri opened this issue Oct 30, 2022 · 3 comments

Comments

@AlJohri
Copy link

AlJohri commented Oct 30, 2022

When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:

message spark_schema {
  ....
  OPTIONAL group mylistcolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.

Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

I'm guessing this is because of this line of code?

arrow2::datatypes::DataType::List(Box::new(<T as ArrowField>::field("item")))

  1. If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
  2. Should the default by re-evaluated if parquet-mr / Spark uses element?

P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31

@AlJohri
Copy link
Author

AlJohri commented Oct 30, 2022

  1. And since this inner element name effectively doesn't matter, perhaps there's some way to deserialize to a struct regardless of what the inner list element is called?

@AlJohri
Copy link
Author

AlJohri commented Oct 30, 2022

Following up on 2:

Reading the documentation for use_compliant_nested_type here (copied below for reference), I see that the default for Arrow is indeed item so perhaps the default is fine? I think we just need to figure out a way to support "compliant Parquet nested type" as defined here.

use_compliant_nested_type : bool, default False

Whether to write compliant Parquet nested type (lists) as defined here, defaults to False. For use_compliant_nested_type=True, this will write into a list with 3-level structure where the middle level, named list, is a repeated group with a single field named element:

<list-repetition> group <name> (LIST) {
    repeated group list {
          <element-repetition> <element-type> element;
    }
}

For use_compliant_nested_type=False, this will also write into a list with 3-level structure, where the name of the single field of the middle level list is taken from the element name for nested columns in Arrow, which defaults to item:

<list-repetition> group <name> (LIST) {
    repeated group list {
        <element-repetition> <element-type> item;
    }
}

Following up on 3:

I see some precedent here: apache/arrow#13851 in the C++ impl to ignore the internal field name when checking for equality.

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 30, 2022

@AlJohri thanks for filing this. This is a bug, since as you observed the check should be name agnostic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants