Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a design doc for S3 Sources with Headers #7407

Merged

Conversation

quodlibetor
Copy link
Contributor

@quodlibetor quodlibetor commented Jul 13, 2021

This is the design doc for #7145

@quodlibetor
Copy link
Contributor Author

I've added specific examples of syntax and rewriting rules for FORMAT CSV WITH HEADER (WITH n COLUMNS doesn't require any rewriting.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design proposed here seems good in isolation except that I think the foundation on which you're building has some structural flaws 🙈. Let me know if you agree. I think this is the time to address them, though!

CREATE SOURCE example (id, value)
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
FORMAT CSV WITH HEADER;
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that our existing purification for CSV files is broken. Since purification doesn't output the expected number of columns, replacing the CSV file with one with a different number of columns and then restarting Materialize can result in a corrupt catalog. I think we need to purify to something like:

CREATE SOURCE example (id, value)
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
FORMAT CSV WITH 4 COLUMNS, HEADER;

And that way the shape of the source is fully independent of the data.

It'd be nice if we could infer the number of columns from the (id, value) aliases at the top, but alas, SQL is perfectly happy if you specify less aliases than there are columns:

postgres=# create view v (a) as select 1 as b, 2 as c;
CREATE VIEW
postgres=# select * from v;
 a | c 
---+---
 1 | 2
(1 row)

postgres=# 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also no reason we shouldn't support

CREATE SOURCE example FORMAT CSV

and purify that to

CREATE SOURCE (column1, column2) example FORMAT CSV WITH 2 COLUMNS

or however many columns there actually were on the first line of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alas, SQL is perfectly happy if you specify less aliases than there are columns

Ah, is our current behavior a bug? We give errors if the number of columns specified doesn't exactly match the actual number of columns (it doesn't matter the format, but csv is easy to test):

$ cat /opt/blah.csv
foo,bar
materialize=> create materialized source csv (a) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR:  Decode error: Text: CSV error at record number 1: expected 1 columns, got 2.
materialize=> drop source csv;
DROP SOURCE
materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR:  Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.

I was relying on the idea that we were doing this intentionally, otherwise I agree that we need to specify this more concretely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns;
CREATE SOURCE
materialize=> select * from csv;
ERROR:  Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.

I mean this is a bug any way you slice it, right? Shouldn't be possible to ask for a 2 column CSV source and end up with a 3 column one... :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my biggest concern with that syntax was allowing users to specify that column aliases must match header column names, but we can just add a REQUIRE MATCHING COLUMN NAMES or something to the CSV with options if it gets requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @benesch in your proposal, we entirely ignore the names of columns in the header for all purposes? And only care about the number of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the proposal is that we only if the column aliases are present and complete, otherwise we still check the headers.

Since we check the actual headers as part of purification, this only happens once per source (not source instance) and doesn't happen when we're reading from the catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns are named column1, column2, etc., unless there is an alias for that column name

My interpretation of this language was that we would ignore the header names if there is no alias

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright the proposal is now pretty clear about this:

  • use WITH HEADER COLUMNS (name, ..) for verification
  • allow aliases for renaming


For S3 sources, it's possible to specify that the *only* objects to ingest are
the ones specified in an SQS queue. For this use case, reading from the SQS
queue in order to determine what the header should be would be destructive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily advocating for this, but reading from SQS isn't destructive unless you explicitly delete what you've read. You could always attempt to read one message and then terminate the visibility timeout: https://docs.amazonaws.cn/en_us/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#changing-message-visibility-timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there is no guaranteed message delivery order, so if you don't delete messages then there is no guarantee of progress unless you delete messages, AFAIK. Certainly there's no guarantee of termination -- we'd just continuously get new messages from SQS in a hot loop if we don't delete them.

Copy link
Member

@benesch benesch Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, yeah, that's true? But that's about the data plane. The sentence I was responding to seems to be talking about why purify can't read from SQS (it's destructive), but my point is that purify could read from SQS if it wanted to, and then put the message back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, true! Yeah that has the same category of failure mode as scanning an empty bucket, so there is at least an argument that it should be equally valid to attempt.

@quodlibetor quodlibetor force-pushed the design-s3-sources-with-headers branch 3 times, most recently from f72d8ac to 9789db1 Compare July 26, 2021 18:30
@quodlibetor quodlibetor force-pushed the design-s3-sources-with-headers branch from 9789db1 to c2be524 Compare July 26, 2021 19:29
@quodlibetor quodlibetor force-pushed the design-s3-sources-with-headers branch from c2be524 to 10414c6 Compare July 26, 2021 19:32
Now they are both allowed, with column names being used for validation and
column aliases being used for column naming.
@quodlibetor quodlibetor force-pushed the design-s3-sources-with-headers branch from 10414c6 to 02d2dde Compare July 26, 2021 19:34
@quodlibetor quodlibetor merged commit 4c83e6c into MaterializeInc:main Jul 26, 2021
@quodlibetor quodlibetor deleted the design-s3-sources-with-headers branch July 26, 2021 21:31
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Jul 28, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Jul 28, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Jul 28, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Jul 28, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 2, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 2, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 3, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 4, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 11, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 11, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 11, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 11, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 18, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 18, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 23, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 27, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 27, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
aljoscha pushed a commit to aljoscha/materialize that referenced this pull request Aug 31, 2021
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants