Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow imports of CSV files with empty lines at the end of the file #49315

Closed
fneves opened this issue Apr 28, 2023 · 0 comments · Fixed by #50635
Closed

Allow imports of CSV files with empty lines at the end of the file #49315

fneves opened this issue Apr 28, 2023 · 0 comments · Fixed by #50635
Assignees
Labels

Comments

@fneves
Copy link

fneves commented Apr 28, 2023

Use case

While trying to import some CSV files into CH that I found scattered around the internet I noticed we can't properly import those CSV files. I noticed these files contain empty lines at the end of the file.

This can be reproduced by running this:

-- Table Creation
CREATE TABLE "default"."new-table" (
    "Name" String,
    "Sex" Nullable(String),
    "Age" Nullable(Int64),
    "Height (in)" Nullable(Int64),
    "Weight (lbs)" Nullable(Int64)
) 
ENGINE MergeTree()
ORDER BY Name;

-- Data Insertion
INSERT INTO "new-table"
SELECT "Name","Sex","Age","Height (in)","Weight (lbs)" from url('https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv', CSV, '"Name" String,"Sex" Nullable(String),"Age" Nullable(Int64),"Height (in)" Nullable(Int64),"Weight (lbs)" Nullable(Int64)') SETTINGS date_time_input_format='best_effort' ,format_csv_delimiter=',';

We can bypass this issue by changing the insert statement with:

INSERT INTO "new-table"
SELECT "c1","c2","c3","c4","c5" from url('https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv', CSV, '"c1" String,"c2" Nullable(String),"c3" Nullable(String),"c4" Nullable(String),"c5" Nullable(String)') 
SETTINGS date_time_input_format='best_effort',
         input_format_allow_errors_num=10,
         input_format_allow_errors_ratio=0.99,
         format_csv_delimiter=',';

While this works it is a bit cumbersome when we have no idea of how many empty lines a file can have at the end.

I would like to propose another setting like input_skip_trailing_empty_lines that could skip the trailing empty lines of the file. This could work for multiple formats, from CSV, TSV to JSONEachRow. We could simply ignore the lines.

Import statement could then be:

INSERT INTO "new-table"
SELECT "c1","c2","c3","c4","c5" from url('https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv', CSV, '"c1" String,"c2" Nullable(String),"c3" Nullable(String),"c4" Nullable(String),"c5" Nullable(String)') 
SETTINGS date_time_input_format='best_effort',
                   input_skip_trailing_empty_lines=1,
                   format_csv_delimiter=',';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants