Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import csv data double quotes escaped by backslash | CSV #10624

Closed
shivakumarss opened this issue May 2, 2020 · 9 comments
Closed

Import csv data double quotes escaped by backslash | CSV #10624

shivakumarss opened this issue May 2, 2020 · 9 comments

Comments

@shivakumarss
Copy link

shivakumarss commented May 2, 2020

I am trying to import a html snippet which is part of one of the column in csv.
There are double quotes in the html snippet and its is escaped. this csv is created using apache spark.

for illustrating the issue i have just created 2 columns with minimal data.

CREATE TABLE logs.processing ( ts String,text String)  ENGINE = Log
cat sample.csv
"Fri, 01 May 2020 06:47:05 UTC","<html id=\"html-div\">"

The the import command is issued following exception is thrown.

cat sample.csv | clickhouse-client --query="INSERT INTO logs.processing FORMAT CSV"
Code: 117. DB::Exception: Expected end of line

Just to add another case here. if i change the sample.csv to

"Fri, 01 May 2020 06:47:05 UTC","col2"

It works fine.

Could you please help me on this issue.

Thanks.

@shivakumarss shivakumarss added the question Question? label May 2, 2020
@shivakumarss shivakumarss changed the title Import into table html content having double quotes escaped by backslash | CSV Import csv data double quotes escaped by backslash | CSV May 2, 2020
@alexey-milovidov
Copy link
Member

CSV has different escaping rules: double quote is escaped by another double quote, not by backslash.

Example:

"Fri, 01 May 2020 06:47:05 UTC","<html id=""html-div"">"

@alexey-milovidov
Copy link
Member

this csv is created using apache spark

It means that either Apache Spark misses CSV specification or that you did not specified it correctly.

@shivakumarss
Copy link
Author

I was just going through the CSV specification, there are ways to escape it by either double quotes or by backslash.

https://en.wikipedia.org/wiki/Comma-separated_values

CSV is a delimited text file that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used; for example, the use of a "Sep=^" row as the first row in the *.csv file will cause Excel to open the file expecting caret "^" to be the separator instead of comma ","). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes,[13] or by prefixing a double quote with an escape character such as a backslash (for example in Sybase Central).

There are frameworks which allow the escaping characters to be specified which import. is there any way in CH while import i can say use backslash as escape character. ?

@den-crane
Copy link
Contributor

den-crane commented May 2, 2020

is there any way in CH while import i can say use backslash as escape character. ?

No. CH supports only CSV with "double quote is escaped by another double quote"

@shivakumarss
Copy link
Author

Thanks a lot @den-crane for the clarification. It helps.
It would really help people to have this feature in place otherwise the source csv needs to be modified or while importing i need to replace the / backslash with double quotes.

Closing this ticket now.

@kokes
Copy link

kokes commented Jun 11, 2020

@shivakumarss You can tell Spark to escape quotes using a second quote, instead of the non-standard backslash. That way you'll get correctly exported data, which you can then import to CH as well as other tools. I wrote about this and recommended some options for df.read and df.write. Hope it helps.

@alexey-milovidov
Copy link
Member

It's also easy to read CSV with backslash escaping in ClickHouse with CustomSeparated format.
It is controlled by the settings:

format_custom_escaping_rule
format_custom_field_delimiter
format_custom_row_before_delimiter
format_custom_row_after_delimiter
format_custom_row_between_delimiter
format_custom_result_before_delimiter
format_custom_result_after_delimiter

@Damanotra
Copy link

Damanotra commented Sep 12, 2022

It's also easy to read CSV with backslash escaping in ClickHouse with CustomSeparated format. It is controlled by the settings:

format_custom_escaping_rule format_custom_field_delimiter format_custom_row_before_delimiter format_custom_row_after_delimiter format_custom_row_between_delimiter format_custom_result_before_delimiter format_custom_result_after_delimiter

Hi I have the exact issues, and I import the data using S3 engine. Since I need the _path information, I can't do any preprocessing (or can I?)

I already using custom separated but I can't find the right combination of the parameter for the custom separated settings. Any clue what combinations of settings would work for this issues?

List of settings I already tried

settings:
format_custom_field_delimiter=','
result : it looks like it fails parsing data but without error
image

settings:
format_custom_field_delimiter=','
format_custom_escaping_rule='CSV'
result : parsing fine but when find the double quotes escaped by backslash, it return errors

SQL Error [27] [07000]: Code: 27. DB::ParsingException: Cannot parse input: expected ',' before: ' Qanita \" Build/QKQ1.190910.002; wv) AppleWebKit/537.36 (KHTML like Gecko) Version/4.0 Chrome/95.0.4638.74 Mobile Safari/537.36",IDN,ID.SE,ID.SE.UP,90245,qxhfz':
Row 291616:
Column 0, name: maid, type: String, parsed text: "4cf4d112-976d-44d4-afea-b4c4fc4a5a81"
Column 1, name: latitude, type: Decimal(18, 8), parsed text: "2.108108"
Column 2, name: longitude, type: Decimal(18, 8), parsed text: "99.833336"
Column 3, name: timestamp, type: UInt32, parsed text: "1636367676"
Column 4, name: id_type, type: UInt8, parsed text: "1"
Column 5, name: ipv4, type: String, parsed text: "114.122.39.152"
Column 6, name: ipv6, type: String, parsed text: ""
Column 7, name: user_agent, type: String, parsed text: "Mozilla/5.0 (Linux; Android 9; SM-A730F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML like Gecko) Version/4.0 Chrome/92.0.4515.166 Mobile Safari/537.36"

Row 291617:
Column 0, name: maid, type: String, parsed text: "c9e743ac-1d0d-4c87-8f3f-f610c34d283b"
Column 1, name: latitude, type: Decimal(18, 8), parsed text: "-5.127097"
Column 2, name: longitude, type: Decimal(18, 8), parsed text: "119.492226"
Column 3, name: timestamp, type: UInt32, parsed text: "1636367445"
Column 4, name: id_type, type: UInt8, parsed text: "1"
Column 5, name: ipv4, type: String, parsed text: "114.125.187.45"
Column 6, name: ipv6, type: String, parsed text: ""
Column 7, name: user_agent, type: String, parsed text: "Mozilla/5.0 (Linux; Android 10; "
ERROR: There is no delimiter between fields: expected ",", got " Qanita "

: While executing CustomSeparatedRowInputFormat: While executing S3: (at row 291617)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 22.8.1.2097 (official build))
, server ClickHouseNode(addr=http:172.19.28.178:18123, db=default)@-1765859537

settings :
format_custom_field_delimiter other than ',' doesn't work, well ofc because it's csv
format_custom_escaping_rule for Quoted, JSON, XML, and None failed to parse with error. Raw resulted same as Escaped

Please help. Thank you

@Fanvadar
Copy link

I was able to get it working with a query like this:

SELECT * FROM s3(
  '...',
  'CustomSeparatedWithNames',
  'field1 DateTime, field2 String, field3 String'
) SETTINGS format_custom_escaping_rule='JSON', format_custom_field_delimiter=',';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants