Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error if you query less columns than source has and some settings are enabled #52591

Closed
den-crane opened this issue Jul 25, 2023 · 7 comments · Fixed by #54513
Closed

Comments

@den-crane
Copy link
Contributor

den-crane commented Jul 25, 2023

https://fiddle.clickhouse.com/c43bf6f9-91f8-4209-b2d2-2d59247625e2

create table test(d Date,a String) Engine=Memory
as select  today(), 'a' from numbers(3);

insert into function file('test.csv', 'CSVWithNames') select * from test;

SELECT * FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_skip_unknown_fields = 0,
  input_format_with_names_use_header = 0
Format Pretty;  
+------------+---+
|          d | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+

-- query only one column of 2
SELECT d FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_skip_unknown_fields = 0,
  input_format_with_names_use_header = 0;

Received exception from server (version 23.6.2):
Code: 117. DB::Exception: Received from localhost:9000. DB::Exception: Expected end of line: (at row 1)
: 
Row 1:
Column 0,   name: d, type: Date, parsed text: "<DOUBLE QUOTE>2023-07-25<DOUBLE QUOTE>"
ERROR: There is no line feed. "<DOUBLE QUOTE>" found instead.
 It's like your file has more columns than expected.
And if your file has the right number of columns, maybe it has an unquoted string value with a comma.

https://fiddle.clickhouse.com/97f8549e-e107-4a37-be37-ed1adc9dd69e

create table test(d Date,a String) Engine=Memory
as select  today(), 'a' from numbers(3);

insert into function file('test.csv', 'CSVWithNames') select * from test;

SELECT * FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_skip_unknown_fields = 0,
  input_format_with_names_use_header = 1
Format Pretty;  
+------------+---+
|          d | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+

SELECT d FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_skip_unknown_fields = 0,
  input_format_with_names_use_header = 1;
Received exception from server (version 23.6.2):
Code: 117. DB::Exception: Received from localhost:9000. DB::Exception: Unknown field found in format header: 'a' at position 1
Set the 'input_format_skip_unknown_fields' parameter explicitly to ignore and proceed: While executing ParallelParsingBlockInputFormat: While executing File. (INCORRECT_DATA)

https://fiddle.clickhouse.com/446198bb-54a1-47cc-a900-c957059c409b

set allow_experimental_analyzer=1;

create table test(d Date,a String) Engine=Memory
as select  today(), 'a' from numbers(3);

insert into function file('test.csv', 'CSVWithNames') select * from test;

SELECT * FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_with_names_use_header = 0
Format Pretty;  
+------------+---+
|          d | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+
| 2023-07-25 | a |
+------------+---+


select d from file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_with_names_use_header = 0;

Received exception from server (version 23.6.2):
Code: 117. DB::Exception: Received from localhost:9000. DB::Exception: Expected end of line: (at row 1)
: 
Row 1:
Column 0,   name: d, type: Date, parsed text: "<DOUBLE QUOTE>2023-07-25<DOUBLE QUOTE>"
ERROR: There is no line feed. "<DOUBLE QUOTE>" found instead.
 It's like your file has more columns than expected.
And if your file has the right number of columns, maybe it has an unquoted string value with a comma.

: While executing ParallelParsingBlockInputFormat: While executing File. (INCORRECT_DATA)
@den-crane den-crane changed the title Parsing error if you queries less columns than source has and some settings are enabled Parsing error if you query less columns than source has and some settings are enabled Jul 25, 2023
@den-crane
Copy link
Contributor Author

@den-crane
Copy link
Contributor Author

@Avogar

@Avogar Avogar self-assigned this Sep 8, 2023
@Avogar
Copy link
Member

Avogar commented Sep 8, 2023

The problem is in setting input_format_with_names_use_header, we don't check if it's disabled during check if format CSVWithNames supports reading subset of columns. I will fix it

@praniti6111
Copy link

@Avogar We are planning to upgrade to v23.3.7.5 and are currently facing the above issue. Are you going to provide the fix in the same version ? What are the tentative timelines for this release ?

@Avogar
Copy link
Member

Avogar commented Sep 8, 2023

The fix will be in the upstream, but It will be backported to 23.3 and will be included in the patch release for 23.3 after the fix is created and merged. So, I cannot say the timeline because tere is no fix yet, I will start working on the fix next week

@praniti6111
Copy link

Thank you for the update @Avogar.

@Avogar
Copy link
Member

Avogar commented Sep 11, 2023

Investigated it a bit more, the problem with input_format_with_names_use_header=0 and parsing errors will be fixed, so this will work:

:) SELECT d FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
  input_format_skip_unknown_fields = 0,
  input_format_with_names_use_header = 0;

┌──────────d─┐
│ 2023-09-11 │
│ 2023-09-11 │
│ 2023-09-11 │
└────────────┘

:) select d from file('test.csv', 'CSVWithNames', 'd Date, a String') SETTINGS
  input_format_with_names_use_header = 0, input_format_skip_unknown_fields=1;

SELECT d
FROM file('test.csv', 'CSVWithNames', 'd Date, a String')
SETTINGS
    input_format_with_names_use_header = 0,
    input_format_skip_unknown_fields = 1

Query id: caa7b66f-e370-4f4f-bb5c-345a3a4bbbb6

┌──────────d─┐
│ 2023-09-11 │
│ 2023-09-11 │
│ 2023-09-11 │
└────────────┘

But parsing error in case when input_format_with_names_use_header = 1, input_format_skip_unknown_fields=0 won't be fixed, because it will require a lot of changes in the code and in the logic. And it's not worth it, you can just enable input_format_skip_unknown_fields, I don't see real usecases when it will be disabled when input_format_with_names_use_header=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment