Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse produces non-compliant Parquet lists #49997

Closed
slotrans opened this issue May 18, 2023 · 2 comments
Closed

Clickhouse produces non-compliant Parquet lists #49997

slotrans opened this issue May 18, 2023 · 2 comments
Labels
potential bug To be reviewed by developers and confirmed/rejected.

Comments

@slotrans
Copy link

slotrans commented May 18, 2023

Describe what's wrong

When writing Parquet from Clickhouse, using either FORMAT Parquet or insert into table function file('...', Parquet), where the output contains an array/list column, the resulting schema has a LIST field (correct), containing a single repeated field named list (correct), containing a single filed named item (NOT correct, must be named element).

See the Parquet documentation here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
Relevant excerpt:

LIST must always annotate a 3-level structure:

<list-repetition> group <name> (LIST) {
  repeated group list {
    <element-repetition> <element-type> element;
  }
}
  • The outer-most level must be a group annotated with LIST that contains a single field named list. The repetition of this level must be either optional or required and determines whether the list is nullable.
  • The middle level, named list, must be a repeated group with a single field named element.
  • The element field encodes the list's element type and repetition. Element repetition must be required or optional.

See also this Arrow issue apache/arrow#29781 which mentions the use of item rather than element

This behavior is consistent across all values of output_format_parquet_version (1.0, 2.4, 2.6, 2.latest).

Does it reproduce on recent release?

Yes.
Also with 23.5.1.570

How to reproduce

  1. Create a Parquet file with an array/list
% clickhouse local -q "select 1 as num, 'foo' as word, [1, 2, 3] as array_of_num FORMAT Parquet" > bug.parquet
  1. Inspect its schema
% duckdb -s "select name, type, converted_type, repetition_type, num_children, logical_type from parquet_schema('bug.parquet')"                                           15:55:20
┌──────────────┬────────────┬────────────────┬─────────────────┬──────────────┬────────────────────────────────┐
│     name     │    type    │ converted_type │ repetition_type │ num_children │          logical_type          │
│   varchar    │  varchar   │    varchar     │     varchar     │    int64     │            varchar             │
├──────────────┼────────────┼────────────────┼─────────────────┼──────────────┼────────────────────────────────┤
│ schema       │            │                │ REQUIRED        │            3 │                                │
│ num          │ INT32      │ UINT_8         │ REQUIRED        │              │ IntType(bitWidth, isSigned=0) │
│ word         │ BYTE_ARRAY │                │ REQUIRED        │              │                                │
│ array_of_num │            │ LIST           │ REQUIRED        │            1 │ ListType()                     │
│ list         │            │                │ REPEATED        │            1 │                                │
│ item         │ INT32      │ UINT_8         │ OPTIONAL        │              │ IntType(bitWidth, isSigned=0) │
└──────────────┴────────────┴────────────────┴─────────────────┴──────────────┴────────────────────────────────┘
  1. Observe the nested structure: array_of_num -> list -> item

Expected behavior

The field nested within list should be named element.

Additional context

Parquet files that deviate from the spec in this way are not correctly understood by Google BigQuery.

@slotrans slotrans added the potential bug To be reviewed by developers and confirmed/rejected. label May 18, 2023
@evillique
Copy link
Member

@al13n321 @Avogar

@al13n321
Copy link
Member

al13n321 commented May 19, 2023

See also this Arrow issue apache/arrow#29781 which mentions the use of item rather than element

Arrow is what we're using. Thanks for pointing to the relevant issue!

So, on our side this amounts to either updating the Arrow library or calling enable_compliant_nested_types(). Sent a PR for the latter: #50001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential bug To be reviewed by developers and confirmed/rejected.
Projects
None yet
Development

No branches or pull requests

3 participants