-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a design doc for S3 Sources with Headers #7407
Add a design doc for S3 Sources with Headers #7407
Conversation
I've added specific examples of syntax and rewriting rules for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design proposed here seems good in isolation except that I think the foundation on which you're building has some structural flaws 🙈. Let me know if you agree. I think this is the time to address them, though!
CREATE SOURCE example (id, value) | ||
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket' | ||
FORMAT CSV WITH HEADER; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that our existing purification for CSV files is broken. Since purification doesn't output the expected number of columns, replacing the CSV file with one with a different number of columns and then restarting Materialize can result in a corrupt catalog. I think we need to purify to something like:
CREATE SOURCE example (id, value)
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
FORMAT CSV WITH 4 COLUMNS, HEADER;
And that way the shape of the source is fully independent of the data.
It'd be nice if we could infer the number of columns from the (id, value)
aliases at the top, but alas, SQL is perfectly happy if you specify less aliases than there are columns:
postgres=# create view v (a) as select 1 as b, 2 as c;
CREATE VIEW
postgres=# select * from v;
a | c
---+---
1 | 2
(1 row)
postgres=#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also no reason we shouldn't support
CREATE SOURCE example FORMAT CSV
and purify that to
CREATE SOURCE (column1, column2) example FORMAT CSV WITH 2 COLUMNS
or however many columns there actually were on the first line of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alas, SQL is perfectly happy if you specify less aliases than there are columns
Ah, is our current behavior a bug? We give errors if the number of columns specified doesn't exactly match the actual number of columns (it doesn't matter the format, but csv is easy to test):
$ cat /opt/blah.csv
foo,bar
materialize=> create materialized source csv (a) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR: Decode error: Text: CSV error at record number 1: expected 1 columns, got 2.
materialize=> drop source csv;
DROP SOURCE
materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR: Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.
I was relying on the idea that we were doing this intentionally, otherwise I agree that we need to specify this more concretely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR: Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.
I mean this is a bug any way you slice it, right? Shouldn't be possible to ask for a 2 column CSV source and end up with a 3 column one... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my biggest concern with that syntax was allowing users to specify that column aliases must match header column names, but we can just add a REQUIRE MATCHING COLUMN NAMES
or something to the CSV with options if it gets requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @benesch in your proposal, we entirely ignore the names of columns in the header for all purposes? And only care about the number of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of the proposal is that we only if the column aliases are present and complete, otherwise we still check the headers.
Since we check the actual headers as part of purification, this only happens once per source (not source instance) and doesn't happen when we're reading from the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns are named column1, column2, etc., unless there is an alias for that column name
My interpretation of this language was that we would ignore the header names if there is no alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright the proposal is now pretty clear about this:
- use
WITH HEADER COLUMNS (name, ..)
for verification - allow aliases for renaming
|
||
For S3 sources, it's possible to specify that the *only* objects to ingest are | ||
the ones specified in an SQS queue. For this use case, reading from the SQS | ||
queue in order to determine what the header should be would be destructive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily advocating for this, but reading from SQS isn't destructive unless you explicitly delete what you've read. You could always attempt to read one message and then terminate the visibility timeout: https://docs.amazonaws.cn/en_us/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#changing-message-visibility-timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there is no guaranteed message delivery order, so if you don't delete messages then there is no guarantee of progress unless you delete messages, AFAIK. Certainly there's no guarantee of termination -- we'd just continuously get new messages from SQS in a hot loop if we don't delete them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, yeah, that's true? But that's about the data plane. The sentence I was responding to seems to be talking about why purify can't read from SQS (it's destructive), but my point is that purify could read from SQS if it wanted to, and then put the message back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, true! Yeah that has the same category of failure mode as scanning an empty bucket, so there is at least an argument that it should be equally valid to attempt.
f72d8ac
to
9789db1
Compare
9789db1
to
c2be524
Compare
c2be524
to
10414c6
Compare
Now they are both allowed, with column names being used for validation and column aliases being used for column naming.
10414c6
to
02d2dde
Compare
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145
This is the design doc for MaterializeInc/database-issues#2218