COPY INTO does not load UTF8 encoded text #6716
Last updated: 2019-09-02 16:05:28 +0200
Date: 2019-06-17 21:00:57 +0200
Created attachment 619
Trying to load the dictionary extracted from the TREC Washington Post collection, as indexed by Anserini (Lucene), does not load into MonetDB.
I attached an extract that should load correctly, but does not.
With help from Spinque, we found that this dictionary extract does load correctly in their modified version of an older MonetDB, but not in the most recent one that I used (as distributed in the Fedora packages).
Radboud & Spinque tried: MonetDB v11.33.3 (Apr2019) - problem occurs.
Single quote escaping
The dictionary was processed to escape single quotes (right & left) as follows:
MonetDB complains about a misread character with a message like:
In many problem cases, the error is caused by a line close before the data quoted in the error message; but not always.
While debugging, I relied on a very useful UTF-8 Tool, and the following analyses:
Import still fails on many different characters that should be processed correctly (?).
Still no correct CSV import after all these modifications.
Even pretty standard characters if you consider Greek and Cyrillic are problematic:
Even now, the import still fails - I tried finding a block of characters to replace, but did not find the right pattern yet.
I can workaround the situation using
Date: 2019-06-19 17:53:37 +0200
Can you reproduce the bug?
Date: 2019-06-21 13:19:41 +0200
After a brief investigation this is what I found:
** There is indeed a problem:
$ cat /tmp/bug-report
$ mclient -d bugdb
** The problem is probably in the COPY INTO code
Inserting each of the lines individually works fine.
sql>create table bgtbl (i bigint, t text, f int);
** The problem is actually on the fourth line:
$ cat /tmp/single_line.csv
$ mclient -d bugdb
Incidentally I discovered that formatting in mclient is broken for Unicode strings but the strings themselves are correct.
As a first conclusion, I would say that the bug is probably in the CSV parser. It seems that the kernel handles the unicode strings in the attachment correctly if they are inserted from mclient.
I will take a more extensive look next week probably.
Date: 2019-06-21 13:51:32 +0200
One more comment/question:
I noticed that the third line contains three bytes after the string "2015": e2 80 8e, which according to the tool you mentioned are interpreted as LEFT-TO-RIGTH MARK in UTF8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=E2+80+8E&mode=bytes).
$ hexdump -C /tmp/bug-report
These bytes are preserved in the INSERT INTO statements:
and in the database:
$ mclient -d bugdb -s "select * from bgtbl where i=41561" | hexdump -C
but they seem to produce a problem in the CSV parsing:
$ hexdump -C /tmp/single_line.csv 30s
sql>copy into bgtbl from '/tmp/single_line.csv';
I was wondering if this mark is intended to be there or not.
Date: 2019-06-21 14:08:41 +0200
Well, in the real application, we would drop that specific entry, but the source data does contain that symbol.
Date: 2019-06-24 09:24:13 +0200
Thanks to input from another user, I found the following difference when specifying a string_quote (or not):
sql>select * from dict;
So that helps, but why?
Thanks to the input, Arjen
Date: 2019-06-24 15:10:04 +0200
If I am not mistaken the default delimiters are '|', '\n' and '"'. I agree that this is most probably a bug. I need to look in the code to understand what are the semantics for a quote specified as '', but this provides another hint to help with debugging.
Date: 2019-06-25 21:45:11 +0200
Actually https://www.monetdb.org/bugzilla/show_bug.cgi?id=6716c6 is not correct. If the user does not specify a quote char, then the CSV parser should NOT use a default one. The problem is a decoherence in how we signify that fact internally. The CSV parser expects the quote character to be NULL if the user has not specified while the physical plan contains the value 0x80 in hex. As far as I can tell this value works if we assume that the files we are going to process only contain ASCII characters, since 0x80 is larger than any ASCII value.
On the other hand, in UTF-8 the byte 0x80 comes up in some characters: for instance the LEFT-TO-RIGHT MARK in UTF-8 is encoded as 0xE2 0x80 0x8E. When the CSV parser encounters the byte 0x80 it starts a quoted string that lasts until the next 0x80 byte.
The workaround you posted in your latest message works because it sets the quote value to something that does not appear in the file.
I count 5 bytes having the value 0x80 in the attached file and this is why the parser fails: When it encounters EOF it is inside a "quoted" string. Even if the number of bytes with this value were even, it would still fail in most cases, except if by change the number of "quoted" field delimiters happened to be a multiple of the delimiters per line (i.e. 2 in the above CSV). In this case (ignoring any problems that might arise due to the schema of the table) it would insert fewer lines, with garbage in the text field.
Date: 2019-06-25 22:35:34 +0200
0x80 is larger than any ASCII value - indeed a pre-utf8 solution. Nice to understand the cause, and good to have a workaround.
Date: 2019-06-26 15:05:06 +0200
For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=b2b0c0606d53
Date: 2019-06-26 15:05:09 +0200
For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=22733760e10a
Date: 2019-06-26 17:41:01 +0200
Great, thank you folks!
The text was updated successfully, but these errors were encountered: