Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: COPY FROM FORMAT CSV: quotes not stripped out unless immediately following a delimiter #9075

Open
philip-stoev opened this issue Nov 15, 2021 · 1 comment
Assignees
Labels
A-sql Area: SQL planning C-bug Category: something is broken T-correctness Theme: relates to consistency and correctness of results.

Comments

@philip-stoev
Copy link
Contributor

philip-stoev commented Nov 15, 2021

What version of Materialize are you using?

From the shell:

v0.9.13-dev (0d9feb389)

What was the issue?

I do not think this is covered under our existing documented differences, and it seems a distinct issue from #9074, but a string containing quotes for the second column is ingested differently in Mz and in Postgres.

Is the issue reproducible? If so, please provide reproduction instructions.

In Mz:

create table t1 (f1 text not null, f2 text not null);
materialize=> COPY t1 FROM STDIN WITH (FORMAT CSV);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> " a "," b "
>> " a " , " b "
>> \.
COPY 2
materialize=> select f1, length(f1), f2, length(f2) from t1;
  f1  | length |   f2   | length 
------+--------+--------+--------
  a   |      3 |  b     |      3
  a   |      4 |  " b " |      6
(2 rows)

On the second row, the quotes around "b" were preserved. Contrast with Postgres:

pstoev=# COPY t1 FROM STDIN WITH (FORMAT CSV);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> " a "," b "
>> " a " , " b "
>> \.
COPY 2
pstoev=# select f1, length(f1), f2, length(f2) from t1;
  f1  | length |  f2  | length 
------+--------+------+--------
  a   |      3 |  b   |      3
  a   |      4 |   b  |      4
(2 rows)

So the two products agree on everything (spaces et al) except the quotes in the second column of the second row.

@philip-stoev philip-stoev added the C-bug Category: something is broken label Nov 15, 2021
@philip-stoev philip-stoev added A-sql Area: SQL planning T-correctness Theme: relates to consistency and correctness of results. labels Nov 15, 2021
@sploiselle
Copy link
Contributor

tysm again, @philip-stoev!

The feature this requires is trimming whitespace (or at least spaces) from the beginning of input. It seems as if the library we're using assumes that all characters after a delimiter are definitely part of the intended value.

Despite being valid in PG, Cockroach also fails on this input, which our CSV parser simply doesn't support (i.e. it never fails). Seems like their library also fails to support the trim-like operation PG does.

@philip-stoev philip-stoev changed the title sql: quotes not stripped out unless immediately following a delimiter sql: COPY FROM FORMAT CSV: quotes not stripped out unless immediately following a delimiter Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: SQL planning C-bug Category: something is broken T-correctness Theme: relates to consistency and correctness of results.
Projects
SQL and Coordinator
Awaiting triage
Development

No branches or pull requests

2 participants