New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _error and _raw_message virtual columns for Kafka engine #20249
Add _error and _raw_message virtual columns for Kafka engine #20249
Conversation
@@ -98,36 +101,53 @@ Chunk IRowInputFormat::generate() | |||
if (!isParseError(e.code())) | |||
throw; | |||
|
|||
if (params.allow_errors_num == 0 && params.allow_errors_ratio == 0) | |||
throw; | |||
if (!append_error_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just set allow_errors_ratio
to 1 and add column if append_error_column
, but this branch execute as earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just catching exception on the caller side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just catching exception on the caller side?
We should just handle the current row message, and collect the message, then process the next one.
It's better not to break the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just set
allow_errors_ratio
to 1 and add column ifappend_error_column
, but this branch execute as earlier.
Yes, I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like current approach.
It will not handle errors from readPrefix / readSuffix. And will work only for IRowInputFormat based formats.
We process single message at a time.
You can put try catch in loop (I've added comment about the place we should catch it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also that approach
- pollute clean interface of IRowInputFormat
- introduce extra states for IRowInputFormat
- may affect parsers performance for non-kafka case.
For me approach with try catch in Kafka sounds much more clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also that approach
- pollute clean interface of IRowInputFormat
- introduce extra states for IRowInputFormat
- may affect parsers performance for non-kafka case.
For me approach with try catch in Kafka sounds much more clean.
Sure. I agree with you. I'll change the current approach.
column->insertDefault(); | ||
} | ||
} | ||
syncAfterError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we skip allowSyncAfterError
here? Is allowSyncAfterError = false
and append_error_column
incompatible ?
@@ -189,6 +221,7 @@ Block KafkaBlockInputStream::readImpl() | |||
} | |||
virtual_columns[6]->insert(headers_names); | |||
virtual_columns[7]->insert(headers_values); | |||
virtual_columns[8]->insert(payload); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need _raw_message
always?
@@ -29,7 +29,8 @@ class ASTStorage; | |||
M(Char, kafka_row_delimiter, '\0', "The character to be considered as a delimiter in Kafka message.", 0) \ | |||
M(String, kafka_schema, "", "Schema identifier (used by schema-based formats) for Kafka engine", 0) \ | |||
M(UInt64, kafka_skip_broken_messages, 0, "Skip at least this number of broken messages from Kafka topic per block", 0) \ | |||
M(Bool, kafka_thread_per_consumer, false, "Provide independent thread for each consumer", 0) | |||
M(Bool, kafka_thread_per_consumer, false, "Provide independent thread for each consumer", 0) \ | |||
M(Bool, kafka_auto_append_error_column, false, "Virtual column named _error for Kafka engine to collect the parser error.", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just kafka_append_error_column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be named kafka_append_error_column.
auto result_block = non_virtual_header.cloneWithColumns(std::move(result_columns)); | ||
if (auto_append_error_column) | ||
{ | ||
result_columns.pop_back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs clarification
@@ -119,7 +143,14 @@ Block KafkaBlockInputStream::readImpl() | |||
auto columns = chunk.detachColumns(); | |||
for (size_t i = 0, s = columns.size(); i < s; ++i) | |||
{ | |||
result_columns[i]->insertRangeFrom(*columns[i], 0, columns[i]->size()); | |||
if (i != error_column_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that it may be better to handle it in separate loop. error_column_index
always last column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. I'll fix it.
@@ -155,6 +186,7 @@ Block KafkaBlockInputStream::readImpl() | |||
auto partition = buffer->currentPartition(); | |||
auto timestamp_raw = buffer->currentTimestamp(); | |||
auto header_list = buffer->currentHeaderList(); | |||
auto payload = buffer->currentPayload(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 170
try {} catch {}
around read_message
Also: actually several options for handling errors were discussed:
From my perspective the best would be to start with defining the p. 6 and after that implement the rest. For Rabbit there is rabbitmq_deadletter_exchange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the comments + need tests
Thanks for your review & suggestions. I'm working around it. |
I create the new pr. And close this one. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add two virtual columns (_error and _raw_message) to Kafka Engine.
Detailed description / Documentation draft:
Use case:
Create the kafka engine with kafka_auto_append_error_column = 1.The new parameter's default value is 0.
The table data is used to save data from kafka.
The table error is used to save the exception row information.