Skip to content

Respect input_format_allow_errors_num/input_format_allow_errors_ratio during schema inference #61095

@mneedham

Description

@mneedham

I made a deliberately bad CSV file:

$ cat foo.csv
name,favColour
Mark,Blue
David
Giles,Red

I try to process it:

SELECT *
FROM file('foo.csv')

Query id: 42729c66-9642-4c79-8c77-a68228aa64a4


Elapsed: 0.031 sec.

Received exception:
Code: 636. DB::Exception: The table structure cannot be extracted from a CSV format file. Error:
Code: 117. DB::Exception: Rows have different amount of values. (INCORRECT_DATA) (version 24.3.1.469 (official build)).
You can specify the structure manually: (in file/uri /Users/m

Makes sense, the structure is bad. So I set input_format_allow_errors_num which I thought would skip the bad row and I told it to use the CSVWithNames format too. But it still throws the error?

SELECT *
FROM file('foo.csv', CSVWithNames)
SETTINGS input_format_allow_errors_num = 5

Query id: 167b8033-9fa4-496f-b3e5-dd05cd3f8e04


Elapsed: 0.001 sec.

Received exception:
Code: 636. DB::Exception: The table structure cannot be extracted from a CSVWithNames format file. Error:
Code: 117. DB::Exception: Rows have different amount of values. (INCORRECT_DATA) (version 24.3.1.469 (official build)).
You can specify the structure manually: (in file/uri /Users/markhneedham/projects/videos/20240305-WindowFunctions/foo.csv). (CANNOT_EXTRACT_TABLE_STRUCTURE)

We can work around that by setting input_format_max_rows_to_read_for_schema_inference=1 which will have it use only 1 row to infer the schema, but it would be simpler to use if input_format_allow_errors_num and input_format_allow_errors_ratio were used during schema inference

cc @Avogar

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions