Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CustomSeparated/Template formats don't work properly with CSV escaping rule and arbitrary field delimiter #39620

Closed
duhy-max opened this issue Jul 26, 2022 · 5 comments · Fixed by #43332
Assignees

Comments

@duhy-max
Copy link

duhy-max commented Jul 26, 2022

版本为:22.6.3.35

问题:使用CustomSeparated 格式导入多字符字段分隔符的文本中,文本内容不能包含多分隔中的第一个字符。

测试过程:

  • 创建表
CREATE TABLE tableA
(
    `c1` text,
    `c2` text,
    `c3` text
)
ENGINE = MergeTree
ORDER BY c1
  • cat tableA.txt
name1|!name2|!name3
name41|!name42|!name43
name51|!name|52|!name53

文件的分隔符号为 ”|!“
当导入tableA.txt 数据的时候,就会在name|52位置报错。 但是”|“ 字符是我需要的内容

[root@hostname ~]# cat tableA.txt | clickhouse client --query "insert into tableA FORMAT CustomSeparated" --format_custom_row_after_delimiter=$'\x0A' --format_custom_escaping_rule='CSV' --format_custom_field_delimiter=$'\x7C\x21' --max_insert_block_size=1000000
Code: 27. DB::ParsingException: Cannot parse input: expected '|!' before: '52|!name53\n':
Row 2:
Column 0,   name: c1, type: String, parsed text: "name41"
Column 1,   name: c2, type: String, parsed text: "name42"
Column 2,   name: c3, type: String, parsed text: "name43"

Row 3:
Column 0,   name: c1, type: String, parsed text: "name51"
Column 1,   name: c2, type: String, parsed text: "name"
ERROR: There is no delimiter between fields: expected "|!", got "52|!name53<LINE FEED>"

: While executing CustomSeparatedRowInputFormat: data for INSERT was parsed from stdin: (in query: insert into tableA FORMAT CustomSeparated): (at row 3)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)

就像报错提示的内容一样,如果字段中含有分隔符(多个)的首个字符,则解析会认为是分隔符开始,但是这不是我期望的结果。因为name|52 就是一个完整的字符串。

@den-crane
Copy link
Contributor

select version()

@duhy-max
Copy link
Author

select version()

SELECT version()

Query id: 8a53cd85-5096-47df-95e8-9e011fa88880

┌─version()─┐
│ 22.6.3.35 │

Actually , i have test many version ,but all is not work well .

@den-crane
Copy link
Contributor

den-crane commented Jul 27, 2022

shorter example

$ echo "aa._bb"|clickhouse-local --structure "a String, b String"  --input-format CustomSeparated --format_custom_escaping_rule='CSV' --format_custom_field_delimiter='._' -q 'select * from table'
aa	bb

$ echo "a_a._bb"|clickhouse-local --structure "a String, b String"  --input-format CustomSeparated --format_custom_escaping_rule='CSV' --format_custom_field_delimiter='._' -q 'select * from table'
a_a	bb

$ echo '"a.a"._"bb"'|clickhouse-local --structure "a String, b String"  --input-format CustomSeparated --format_custom_escaping_rule='CSV' --format_custom_field_delimiter='._' -q 'select * from table'
a.a	bb

$ echo "a.a._bb"|clickhouse-local --structure "a String, b String"  --input-format CustomSeparated --format_custom_escaping_rule='CSV' --format_custom_field_delimiter='._' -q 'select * from table'
Code: 27. DB::ParsingException: Cannot parse input: expected '._' before: 'a._bb\n':
Row 1:
Column 0,   name: a, type: String, parsed text: "a"
ERROR: There is no delimiter between fields: expected "._", got "a._bb<LINE FEED>"

: While executing CustomSeparatedRowInputFormat: While executing File: (at row 1)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)

it fails if the first character . of format_custom_field_delimiter ._ is met in the text a.a and text without quotes.

@duhy-max
Copy link
Author

duhy-max commented Jul 28, 2022

”it fails if the first character . of format_custom_field_delimiter ._ is met in the text a.a and text without quotes.“ @den-crane do you known how to resolve this problem

@Avogar
Copy link
Member

Avogar commented Oct 24, 2022

Yes, there is a problem with parsing of unquoted strings in CSV format in CustomSeparated/Template formats when deilimiter is large then 1 symbol. It happels because internal implementation of parsing data types in CSV format works only with char delimiter. I will try to fix this problem.

By now you can use Regexp format to read such files.

@Avogar Avogar changed the title 使用CustomSeparated 格式导入多字符字段分隔符的文本中,文本内容不能包含多分隔中的第一个字符 CustomSeparated/Template formats don't work properly with CSV escaping rule and arbitrary field delimiter Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants