CSV unquoted nulls and default values #6055

tavplubix · 2019-07-18T16:13:34Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

Improvement

Short description (up to few sentences):
Fixes #5990

Detailed description (optional):
For CSV input format:

consider unquoted NULL literal as \N (if setting format_csv_unquoted_null_literal_as_null=1)
initialize null fields with default values if data type of this field is not nullable (if setting input_format_null_as_default=1)

alexey-milovidov · 2019-07-20T00:05:12Z

Should we rename format_csv_unquoted_null_literal_as_null
to input_format_csv_unquoted_null_literal_as_null
to have a chance to introduce a setting
output_format_csv_unquoted_null_literal_as_null
to write NULLs as NULL?

alexey-milovidov · 2019-07-20T00:07:13Z

dbms/src/DataTypes/DataTypeNullable.cpp

+
+    auto check_for_null = [&istr, &settings, &null_prefix_len]
+    {
+        if (checkStringByFirstCharacterAndAssertTheRest("\\N", istr))


tavplubix · 2019-07-21T21:11:58Z

Should we rename format_csv_unquoted_null_literal_as_null
to input_format_csv_unquoted_null_literal_as_null
to have a chance to introduce a setting
output_format_csv_unquoted_null_literal_as_null
to write NULLs as NULL?

We can use format_csv_unquoted_null_literal_as_null for input and output (like format_csv_delimiter), but it may be inconvenient if user wants to read nulls as \N or NULL and write nulls as \N.

alexey-milovidov · 2019-07-22T19:06:50Z

We can use format_csv_unquoted_null_literal_as_null for input and output (like format_csv_delimiter), but it may be inconvenient if user wants to read nulls as \N or NULL and write nulls as \N.

Let's split.

alexey-milovidov · 2019-07-22T19:12:10Z

dbms/src/DataTypes/DataTypeNullable.cpp

+            if (null_prefix_len < buf.count())
+                istr.position() = buf.position();
+            else if (null_prefix_len > buf.count())
+                throw DB::Exception("Some characters were extracted from buffer, but nested parser did not read them",


Can this happen if there is something like NU in a place of numeric data type?
(and buffer is split in the middle of NU)
What will be the error message for the user?

It can (also it can happen if format_csv_delimiter = U or L). I've changed the error message to print some diagnostic info.

…nquoted_null_literal_as_null

alexey-milovidov · 2019-07-22T23:50:53Z

00411_long_accurate_number_comparison

Will be fixed after merge with master.

alexey-milovidov · 2019-07-22T23:54:56Z

docs/en/operations/settings/settings.md

@@ -211,6 +211,11 @@ Possible values:

 Default value: 0.

+## input_format_null_as_default {#settings-input_format_null_as_default}


Shouldn't it be input_format_csv_null_as_default?

Suppose we want to implement similar logic for TSV. But in TSV it's much more dangerous, because there is no way to unambiguate from NULL and a String with 'NULL' value. But still usable (e.g. user has only numeric fields in a table). And this gives the motivation to allow the user to enable this logic separately for CSV and TSV.

This setting is about replacing nulls with default values for non-nullable data types (#2633 and #6033), not about parsing NULL as \N (input_format_null_as_default and input_format_csv_unquoted_null_literal_as_null are independent). Should we add a separate setting *_null_as_default for each format?

No. It's fine :)

alexey-milovidov · 2019-07-22T23:56:04Z

dbms/tests/queries/0_stateless/00301_csv.sh

@@ -24,5 +24,17 @@ echo '"2016-01-01 01:02:03","1"
 1502792101,"3"
 99999,"4"' | $CLICKHOUSE_CLIENT --query="INSERT INTO csv FORMAT CSV";

+echo '\N, \N' | $CLICKHOUSE_CLIENT --input_format_null_as_default=1 --query="INSERT INTO csv FORMAT CSV";


Better to make a separate test.

And we should have a test case with complex defaults.

…b.com/yandex/ClickHouse into csv_unquoted_nulls_and_default_values

tavplubix added 3 commits July 18, 2019 16:43

use default if not nullable

668959b

parse unquoted NULL

6565d5c

add tests

4c8c516

tavplubix added the can be tested label Jul 18, 2019

tavplubix and others added 5 commits July 19, 2019 16:57

improvements

8146126

update docs

6467330

Merge branch 'master' into csv_unquoted_nulls_and_default_values

349d69c

optimization

fb06a85

Merge branch 'master' into csv_unquoted_nulls_and_default_values

92a8e00

alexey-milovidov reviewed Jul 20, 2019

View reviewed changes

alexey-milovidov reviewed Jul 22, 2019

View reviewed changes

rename format_csv_unquoted_null_literal_as_null to input_format_csv_u…

96d0a06

…nquoted_null_literal_as_null

alexey-milovidov reviewed Jul 22, 2019

View reviewed changes

tavplubix added 4 commits July 23, 2019 13:44

better error message

35b5769

Merge branch 'master' into csv_unquoted_nulls_and_default_values

89a4462

better test

87c7186

Merge branch 'csv_unquoted_nulls_and_default_values' of https://githu…

c071f69

…b.com/yandex/ClickHouse into csv_unquoted_nulls_and_default_values

tavplubix merged commit 6625536 into master Aug 1, 2019

tavplubix deleted the csv_unquoted_nulls_and_default_values branch August 20, 2019 11:16

filimonov mentioned this pull request Aug 23, 2019

Empty values in comma-delimited data causes parsing issues #469

Closed

KochetovNicolai added the pr-improvement Pull request with some product improvements label Sep 19, 2019

qoega mentioned this pull request Feb 13, 2020

INSERT DateTime grab delimiter #4727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV unquoted nulls and default values #6055

CSV unquoted nulls and default values #6055

tavplubix commented Jul 18, 2019

alexey-milovidov commented Jul 20, 2019

alexey-milovidov Jul 20, 2019

alexey-milovidov Jul 20, 2019

tavplubix commented Jul 21, 2019

alexey-milovidov commented Jul 22, 2019

alexey-milovidov Jul 22, 2019

tavplubix Jul 23, 2019

alexey-milovidov commented Jul 22, 2019

alexey-milovidov Jul 22, 2019 •

edited

tavplubix Jul 23, 2019

alexey-milovidov Jul 23, 2019

alexey-milovidov Jul 22, 2019

alexey-milovidov Jul 22, 2019

		@@ -211,6 +211,11 @@ Possible values:

		Default value: 0.

		## input_format_null_as_default {#settings-input_format_null_as_default}

CSV unquoted nulls and default values #6055

CSV unquoted nulls and default values #6055

Conversation

tavplubix commented Jul 18, 2019

alexey-milovidov commented Jul 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tavplubix commented Jul 21, 2019

alexey-milovidov commented Jul 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexey-milovidov commented Jul 22, 2019

alexey-milovidov Jul 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexey-milovidov Jul 22, 2019 •

edited