Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV unquoted nulls and default values #6055

Merged
merged 13 commits into from Aug 1, 2019

Conversation

tavplubix
Copy link
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • Improvement

Short description (up to few sentences):
Fixes #5990

Detailed description (optional):
For CSV input format:

  • consider unquoted NULL literal as \N (if setting format_csv_unquoted_null_literal_as_null=1)
  • initialize null fields with default values if data type of this field is not nullable (if setting input_format_null_as_default=1)

@alexey-milovidov
Copy link
Member

Should we rename format_csv_unquoted_null_literal_as_null
to input_format_csv_unquoted_null_literal_as_null
to have a chance to introduce a setting
output_format_csv_unquoted_null_literal_as_null
to write NULLs as NULL?


auto check_for_null = [&istr, &settings, &null_prefix_len]
{
if (checkStringByFirstCharacterAndAssertTheRest("\\N", istr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@tavplubix
Copy link
Member Author

Should we rename format_csv_unquoted_null_literal_as_null
to input_format_csv_unquoted_null_literal_as_null
to have a chance to introduce a setting
output_format_csv_unquoted_null_literal_as_null
to write NULLs as NULL?

We can use format_csv_unquoted_null_literal_as_null for input and output (like format_csv_delimiter), but it may be inconvenient if user wants to read nulls as \N or NULL and write nulls as \N.

@alexey-milovidov
Copy link
Member

We can use format_csv_unquoted_null_literal_as_null for input and output (like format_csv_delimiter), but it may be inconvenient if user wants to read nulls as \N or NULL and write nulls as \N.

Let's split.

if (null_prefix_len < buf.count())
istr.position() = buf.position();
else if (null_prefix_len > buf.count())
throw DB::Exception("Some characters were extracted from buffer, but nested parser did not read them",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen if there is something like NU in a place of numeric data type?
(and buffer is split in the middle of NU)
What will be the error message for the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can (also it can happen if format_csv_delimiter = U or L). I've changed the error message to print some diagnostic info.

@alexey-milovidov
Copy link
Member

00411_long_accurate_number_comparison

Will be fixed after merge with master.

@@ -211,6 +211,11 @@ Possible values:

Default value: 0.

## input_format_null_as_default {#settings-input_format_null_as_default}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be input_format_csv_null_as_default?

Suppose we want to implement similar logic for TSV. But in TSV it's much more dangerous, because there is no way to unambiguate from NULL and a String with 'NULL' value. But still usable (e.g. user has only numeric fields in a table). And this gives the motivation to allow the user to enable this logic separately for CSV and TSV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting is about replacing nulls with default values for non-nullable data types (#2633 and #6033), not about parsing NULL as \N (input_format_null_as_default and input_format_csv_unquoted_null_literal_as_null are independent). Should we add a separate setting *_null_as_default for each format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's fine :)

@@ -24,5 +24,17 @@ echo '"2016-01-01 01:02:03","1"
1502792101,"3"
99999,"4"' | $CLICKHOUSE_CLIENT --query="INSERT INTO csv FORMAT CSV";

echo '\N, \N' | $CLICKHOUSE_CLIENT --input_format_null_as_default=1 --query="INSERT INTO csv FORMAT CSV";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make a separate test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should have a test case with complex defaults.

@tavplubix tavplubix merged commit 6625536 into master Aug 1, 2019
@tavplubix tavplubix deleted the csv_unquoted_nulls_and_default_values branch August 20, 2019 11:16
@KochetovNicolai KochetovNicolai added the pr-improvement Pull request with some product improvements label Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load data with CSVWithNames or TabSeparatedWithNames got error when two columns continuous null
3 participants