Skip to content

Commit

Permalink
Merge pull request #58519 from Avogar/control-arrow-dict-indexes-type
Browse files Browse the repository at this point in the history
Add settings for better control of indexes type in Arrow dictionary
  • Loading branch information
alexey-milovidov committed Jan 13, 2024
2 parents 211c285 + 8d7c24a commit afb50f0
Show file tree
Hide file tree
Showing 13 changed files with 139 additions and 74 deletions.
2 changes: 1 addition & 1 deletion docker/test/stateless/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ RUN apt-get update -y \
p7zip-full \
&& apt-get clean

RUN pip3 install numpy scipy pandas Jinja2
RUN pip3 install numpy scipy pandas Jinja2 pyarrow

RUN mkdir -p /tmp/clickhouse-odbc-tmp \
&& wget -nv -O - ${odbc_driver_url} | tar --strip-components=1 -xz -C /tmp/clickhouse-odbc-tmp \
Expand Down
2 changes: 2 additions & 0 deletions docs/en/interfaces/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -2356,6 +2356,8 @@ $ clickhouse-client --query="SELECT * FROM {some_table} FORMAT Arrow" > {filenam
### Arrow format settings {#parquet-format-settings}

- [output_format_arrow_low_cardinality_as_dictionary](/docs/en/operations/settings/settings-formats.md/#output_format_arrow_low_cardinality_as_dictionary) - enable output ClickHouse LowCardinality type as Dictionary Arrow type. Default value - `false`.
- [output_format_arrow_use_64_bit_indexes_for_dictionary](/docs/en/operations/settings/settings-formats.md/#output_format_arrow_use_64_bit_indexes_for_dictionary) - use 64-bit integer type for Dictionary indexes. Default value - `false`.
- [output_format_arrow_use_signed_indexes_for_dictionary](/docs/en/operations/settings/settings-formats.md/#output_format_arrow_use_signed_indexes_for_dictionary) - use signed integer type for Dictionary indexes. Default value - `true`.
- [output_format_arrow_string_as_string](/docs/en/operations/settings/settings-formats.md/#output_format_arrow_string_as_string) - use Arrow String type instead of Binary for String columns. Default value - `false`.
- [input_format_arrow_case_insensitive_column_matching](/docs/en/operations/settings/settings-formats.md/#input_format_arrow_case_insensitive_column_matching) - ignore case when matching Arrow columns with ClickHouse columns. Default value - `false`.
- [input_format_arrow_allow_missing_columns](/docs/en/operations/settings/settings-formats.md/#input_format_arrow_allow_missing_columns) - allow missing columns while reading Arrow data. Default value - `false`.
Expand Down
22 changes: 22 additions & 0 deletions docs/en/operations/settings/settings-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -1269,6 +1269,28 @@ Possible values:

Default value: `0`.

### output_format_arrow_use_signed_indexes_for_dictionary {#output_format_arrow_use_signed_indexes_for_dictionary}

Use signed integer types instead of unsigned in `DICTIONARY` type of the [Arrow](../../interfaces/formats.md/#data-format-arrow) format during [LowCardinality](../../sql-reference/data-types/lowcardinality.md) output when `output_format_arrow_low_cardinality_as_dictionary` is enabled.

Possible values:

- 0 — Unsigned integer types are used for indexes in `DICTIONARY` type.
- 1 — Signed integer types are used for indexes in `DICTIONARY` type.

Default value: `1`.

### output_format_arrow_use_64_bit_indexes_for_dictionary {#output_format_arrow_use_64_bit_indexes_for_dictionary}

Use 64-bit integer type in `DICTIONARY` type of the [Arrow](../../interfaces/formats.md/#data-format-arrow) format during [LowCardinality](../../sql-reference/data-types/lowcardinality.md) output when `output_format_arrow_low_cardinality_as_dictionary` is enabled.

Possible values:

- 0 — Type for indexes in `DICTIONARY` type is determined automatically.
- 1 — 64-bit integer type is used for indexes in `DICTIONARY` type.

Default value: `0`.

### output_format_arrow_string_as_string {#output_format_arrow_string_as_string}

Use Arrow String type instead of Binary for String columns.
Expand Down
2 changes: 2 additions & 0 deletions src/Core/Settings.h
Original file line number Diff line number Diff line change
Expand Up @@ -1100,6 +1100,8 @@ class IColumn;
M(UInt64, cross_to_inner_join_rewrite, 1, "Use inner join instead of comma/cross join if there're joining expressions in the WHERE section. Values: 0 - no rewrite, 1 - apply if possible for comma/cross, 2 - force rewrite all comma joins, cross - if possible", 0) \
\
M(Bool, output_format_arrow_low_cardinality_as_dictionary, false, "Enable output LowCardinality type as Dictionary Arrow type", 0) \
M(Bool, output_format_arrow_use_signed_indexes_for_dictionary, true, "Use signed integers for dictionary indexes in Arrow format", 0) \
M(Bool, output_format_arrow_use_64_bit_indexes_for_dictionary, false, "Always use 64 bit integers for dictionary indexes in Arrow format", 0) \
M(Bool, output_format_arrow_string_as_string, false, "Use Arrow String type instead of Binary for String columns", 0) \
M(Bool, output_format_arrow_fixed_string_as_fixed_byte_array, true, "Use Arrow FIXED_SIZE_BINARY type instead of Binary for FixedString columns.", 0) \
M(ArrowCompression, output_format_arrow_compression_method, "lz4_frame", "Compression method for Arrow output format. Supported codecs: lz4_frame, zstd, none (uncompressed)", 0) \
Expand Down
3 changes: 2 additions & 1 deletion src/Core/SettingsChangesHistory.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,8 @@ namespace SettingsChangesHistory
static std::map<ClickHouseVersion, SettingsChangesHistory::SettingsChanges> settings_changes_history =
{
{"24.1", {{"print_pretty_type_names", false, true, "Better user experience."},
{"input_format_json_read_bools_as_strings", false, true, "Allow to read bools as strings in JSON formats by default"}}},
{"input_format_json_read_bools_as_strings", false, true, "Allow to read bools as strings in JSON formats by default"},
{"output_format_arrow_use_signed_indexes_for_dictionary", false, true, "Use signed indexes type for Arrow dictionaries by default as it's recommended"}}},
{"23.12", {{"allow_suspicious_ttl_expressions", true, false, "It is a new setting, and in previous versions the behavior was equivalent to allowing."},
{"input_format_parquet_allow_missing_columns", false, true, "Allow missing columns in Parquet files by default"},
{"input_format_orc_allow_missing_columns", false, true, "Allow missing columns in ORC files by default"},
Expand Down
2 changes: 2 additions & 0 deletions src/Formats/FormatFactory.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,8 @@ FormatSettings getFormatSettings(ContextPtr context, const Settings & settings)
format_settings.with_types_use_header = settings.input_format_with_types_use_header;
format_settings.write_statistics = settings.output_format_write_statistics;
format_settings.arrow.low_cardinality_as_dictionary = settings.output_format_arrow_low_cardinality_as_dictionary;
format_settings.arrow.use_signed_indexes_for_dictionary = settings.output_format_arrow_use_signed_indexes_for_dictionary;
format_settings.arrow.use_64_bit_indexes_for_dictionary = settings.output_format_arrow_use_64_bit_indexes_for_dictionary;
format_settings.arrow.allow_missing_columns = settings.input_format_arrow_allow_missing_columns;
format_settings.arrow.skip_columns_with_unsupported_types_in_schema_inference = settings.input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference;
format_settings.arrow.skip_columns_with_unsupported_types_in_schema_inference = settings.input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference;
Expand Down
2 changes: 2 additions & 0 deletions src/Formats/FormatSettings.h
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ struct FormatSettings
{
UInt64 row_group_size = 1000000;
bool low_cardinality_as_dictionary = false;
bool use_signed_indexes_for_dictionary = false;
bool use_64_bit_indexes_for_dictionary = false;
bool allow_missing_columns = false;
bool skip_columns_with_unsupported_types_in_schema_inference = false;
bool case_insensitive_column_matching = false;
Expand Down
11 changes: 8 additions & 3 deletions src/Processors/Formats/Impl/ArrowBlockOutputFormat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,14 @@ void ArrowBlockOutputFormat::consume(Chunk chunk)
ch_column_to_arrow_column = std::make_unique<CHColumnToArrowColumn>(
header,
"Arrow",
format_settings.arrow.low_cardinality_as_dictionary,
format_settings.arrow.output_string_as_string,
format_settings.arrow.output_fixed_string_as_fixed_byte_array);
CHColumnToArrowColumn::Settings
{
format_settings.arrow.output_string_as_string,
format_settings.arrow.output_fixed_string_as_fixed_byte_array,
format_settings.arrow.low_cardinality_as_dictionary,
format_settings.arrow.use_signed_indexes_for_dictionary,
format_settings.arrow.use_64_bit_indexes_for_dictionary
});
}

auto chunks = std::vector<Chunk>();
Expand Down

0 comments on commit afb50f0

Please sign in to comment.