Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload Logs that Fail to Parse as a String in a Specified Table #22065

Open
rlazimi-dev opened this issue Mar 24, 2021 · 1 comment
Open

Upload Logs that Fail to Parse as a String in a Specified Table #22065

rlazimi-dev opened this issue Mar 24, 2021 · 1 comment
Labels

Comments

@rlazimi-dev
Copy link

Use case
Allow users to specify a table to upload logs which fail to parse during uploads using the Template Format.

Describe the solution you'd like
My first suggested, and naive solution, is to store all characters that have been successfully parsed so far into a string, and then to upload that string, as usual, to the table being uploaded to. If parsing is unsuccessful, then simply store, in a string, all characters until format_template_rows_between_delimiter, and upload this string to a BACKUP table specified by the user in a new setting (which could be for example: invalid_logs_table='logs.unstructured_log_rows').

I assume the failure is caused by the implementation not keeping any data in memory while parsing, which in turn I assume is done for efficiency. If this is the case, then my second suggested solution is that clickhouse proceeds until format_template_rows_between_delimiter and at least returns the line numbers that caused the issue, while still uploading the logs that can successfully be parsed. Keeping track of line number is negligible relative to keeping track of a line.

Ideally I would like to be able to run a command similar to this:

cat example.log | clickhouse-client --query="INSERT INTO logs.main FORMAT Template SETTINGS format_template_row = 'row.template', format_template_resultset = 'log.template', format_template_rows_between_delimiter = '\n', invalid_logs_table = 'logs.unstructured_log_rows' "

Describe alternatives you've considered
The alternative is to preprocess a log to ensure all logs are structured correctly before sending them to clickhouse. The solution I'm suggesting is preferred because it does not increase upload time and it will allow clickhouse users to store logs and logs that need more processing in the same place.

@UnamedRus
Copy link
Contributor

UnamedRus commented Mar 27, 2021

Redshift have special table for this:

query  | line_number | value | raw_line | err_reason
-------+-------------+-------+----------+----------------
4      |      3      |  1201 |  1201    | Invalid digit
4      |      3      |   126 |   126    | Invalid digit
4      |      3      |       |   aaa    | Invalid digit
(3 rows)

docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants