Skip to content

Support an arbitrary CSV delimiter#2263

Merged
alexey-milovidov merged 10 commits intoClickHouse:masterfrom
zhkvia:arbitrary-csv-delimiter
Apr 27, 2018
Merged

Support an arbitrary CSV delimiter#2263
alexey-milovidov merged 10 commits intoClickHouse:masterfrom
zhkvia:arbitrary-csv-delimiter

Conversation

@zhkvia
Copy link
Copy Markdown
Contributor

@zhkvia zhkvia commented Apr 22, 2018

Added a format_csv_delimiter setting for specifying an arbitrary CSV delimiter.
How it can be used:

  • As a client argument:
    $ clickhouse-client --format_csv_delimiter=";" --query="INSERT INTO table FORMAT CSV"
  • As a session setting:
    :) SET format_csv_delimiter=';'SET format_csv_delimiter = ';'
    
    Ok.
    
    :) SELECT * FROM table FORMAT CSV

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

@zhkvia zhkvia force-pushed the arbitrary-csv-delimiter branch from 23c6017 to 8cb4539 Compare April 22, 2018 17:30
Comment thread dbms/src/Interpreters/SettingsCommon.h Outdated
private:
void checkStringIsACharacter(const String & x) const {
if (x.size() != 1)
throw Exception(std::string("A setting's value string has to be an exactly one character long"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to construct std::string explicitly.

Comment thread dbms/src/Interpreters/SettingsCommon.h Outdated
private:
void checkStringIsACharacter(const String & x) const {
if (x.size() != 1)
throw Exception(std::string("A setting's value string has to be an exactly one character long"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ErrorCodes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style. Braces {} should be in a separate new line.

Comment thread dbms/src/Interpreters/SettingsCommon.h Outdated

void set(const Field & x)
{
String s = safeGet<const String &>(x);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use reference here to avoid copying.

Comment thread docs/en/formats/csv.md
When formatting, rows are enclosed in double quotes. A double quote inside a string is output as two double quotes in a row. There are no other rules for escaping characters. Date and date-time are enclosed in double quotes. Numbers are output without quotes. Values ​​are separated by a delimiter&ast;. Rows are separated using the Unix line feed (LF). Arrays are serialized in CSV as follows: first the array is serialized to a string as in TabSeparated format, and then the resulting string is output to CSV in double quotes. Tuples in CSV format are serialized as separate columns (that is, their nesting in the tuple is lost).

When parsing, all values can be parsed either with or without quotes. Both double and single quotes are supported. Rows can also be arranged without quotes. In this case, they are parsed up to a comma or line feed (CR or LF). In violation of the RFC, when parsing rows without quotes, the leading and trailing spaces and tabs are ignored. For the line feed, Unix (LF), Windows (CR LF) and Mac OS Classic (CR LF) are all supported.
&ast;By default — `,`. See a [format_csv_delimiter](/docs/en/operations/settings/settings/#format_csv_delimiter) setting for additional info.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that Markdown supports HTML entities?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems logical since markdown is compiled into HTML. Also, mkdocs in the docs/ directory compiles it correctly.

@alexey-milovidov
Copy link
Copy Markdown
Member

Almost everything is Ok, but tests are missing.
You may use simple functional tests (look at dbms/tests/queries directory).
Both input and output should be tested.
Please add test cases when unquoted string use comma and delimiter is a semicolon or something like this:
abc,def;hello

@amosbird
Copy link
Copy Markdown
Collaborator

amosbird commented Apr 24, 2018

Hmm, can we still call that a CSV format without comma? I assume a new format with some delimiter argument would be a better choice. And there are some use cases of multi-bytes delimiters such as data exported from Netezza.

@alexey-milovidov alexey-milovidov merged commit 093c054 into ClickHouse:master Apr 27, 2018
@alexey-milovidov
Copy link
Copy Markdown
Member

Hmm, can we still call that a CSV format without comma? I assume a new format with some delimiter argument would be a better choice.

Yes, it is controversial. In fact, many "CSV readers" are configurable in this way.
We may add a format with name 'DSV', but CSV is also Ok.

And there are some use cases of multi-bytes delimiters such as data exported from Netezza.

Do they use multibyte delimiter by default?

@amosbird
Copy link
Copy Markdown
Collaborator

amosbird commented Apr 28, 2018

Do they use multibyte delimiter by default?

Nope. The default is |.

@hereTac
Copy link
Copy Markdown

hereTac commented Jun 22, 2018

does --format_csv_delimiter= support in v1.1.54385-stable?
Errors:
Bad arguments: unrecognised option '--format_csv_delimiter= '

@AntonSaykovsky
Copy link
Copy Markdown

does --format_csv_delimiter= support in v1.1.54385-stable?
Errors:
Bad arguments: unrecognised option '--format_csv_delimiter= '

1.1.54390 version has the same error. =(((

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants