Add a design doc for S3 Sources with Headers #7407

quodlibetor · 2021-07-13T19:52:15Z

This is the design doc for MaterializeInc/database-issues#2218

quodlibetor · 2021-07-14T16:30:25Z

I've added specific examples of syntax and rewriting rules for FORMAT CSV WITH HEADER (WITH n COLUMNS doesn't require any rewriting.

benesch

The design proposed here seems good in isolation except that I think the foundation on which you're building has some structural flaws 🙈. Let me know if you agree. I think this is the time to address them, though!

benesch · 2021-07-19T06:08:07Z

doc/developer/design/20210713_S3_sources_with_headers.md

+CREATE SOURCE example (id, value)
+FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
+FORMAT CSV WITH HEADER;
+```


It occurs to me that our existing purification for CSV files is broken. Since purification doesn't output the expected number of columns, replacing the CSV file with one with a different number of columns and then restarting Materialize can result in a corrupt catalog. I think we need to purify to something like:

CREATE SOURCE example (id, value) FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket' FORMAT CSV WITH 4 COLUMNS, HEADER;

And that way the shape of the source is fully independent of the data.

It'd be nice if we could infer the number of columns from the (id, value) aliases at the top, but alas, SQL is perfectly happy if you specify less aliases than there are columns:

postgres=# create view v (a) as select 1 as b, 2 as c; CREATE VIEW postgres=# select * from v; a | c ---+--- 1 | 2 (1 row) postgres=#

There's also no reason we shouldn't support

CREATE SOURCE example FORMAT CSV

and purify that to

CREATE SOURCE (column1, column2) example FORMAT CSV WITH 2 COLUMNS

or however many columns there actually were on the first line of the file.

alas, SQL is perfectly happy if you specify less aliases than there are columns

Ah, is our current behavior a bug? We give errors if the number of columns specified doesn't exactly match the actual number of columns (it doesn't matter the format, but csv is easy to test):

$ cat /opt/blah.csv foo,bar

materialize=> create materialized source csv (a) from file '/opt/blah.csv' format csv with 2 columns; CREATE SOURCE materialize=> select * from csv; ERROR: Decode error: Text: CSV error at record number 1: expected 1 columns, got 2. materialize=> drop source csv; DROP SOURCE materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns; CREATE SOURCE materialize=> select * from csv; ERROR: Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.

I was relying on the idea that we were doing this intentionally, otherwise I agree that we need to specify this more concretely.

> materialize=> create materialized source csv (a, b, c) from file '/opt/blah.csv' format csv with 2 columns; CREATE SOURCE materialize=> select * from csv; ERROR: Decode error: Text: CSV error at record number 1: expected 3 columns, got 2.

I mean this is a bug any way you slice it, right? Shouldn't be possible to ask for a 2 column CSV source and end up with a 3 column one... :/

Yeah that's fair.

Yeah, my biggest concern with that syntax was allowing users to specify that column aliases must match header column names, but we can just add a REQUIRE MATCHING COLUMN NAMES or something to the CSV with options if it gets requested.

So @benesch in your proposal, we entirely ignore the names of columns in the header for all purposes? And only care about the number of them?

My reading of the proposal is that we only if the column aliases are present and complete, otherwise we still check the headers.

Since we check the actual headers as part of purification, this only happens once per source (not source instance) and doesn't happen when we're reading from the catalog.

Columns are named column1, column2, etc., unless there is an alias for that column name

My interpretation of this language was that we would ignore the header names if there is no alias

Alright the proposal is now pretty clear about this:

use WITH HEADER COLUMNS (name, ..) for verification

allow aliases for renaming

benesch · 2021-07-19T06:13:29Z

doc/developer/design/20210713_S3_sources_with_headers.md

+
+For S3 sources, it's possible to specify that the *only* objects to ingest are
+the ones specified in an SQS queue. For this use case, reading from the SQS
+queue in order to determine what the header should be would be destructive.


Not necessarily advocating for this, but reading from SQS isn't destructive unless you explicitly delete what you've read. You could always attempt to read one message and then terminate the visibility timeout: https://docs.amazonaws.cn/en_us/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#changing-message-visibility-timeout.

Unfortunately there is no guaranteed message delivery order, so if you don't delete messages then there is no guarantee of progress unless you delete messages, AFAIK. Certainly there's no guarantee of termination -- we'd just continuously get new messages from SQS in a hot loop if we don't delete them.

Er, yeah, that's true? But that's about the data plane. The sentence I was responding to seems to be talking about why purify can't read from SQS (it's destructive), but my point is that purify could read from SQS if it wanted to, and then put the message back.

Oh, true! Yeah that has the same category of failure mode as scanning an empty bucket, so there is at least an argument that it should be equally valid to attempt.

doc/developer/design/20210713_S3_sources_with_headers.md

Now they are both allowed, with column names being used for validation and column aliases being used for column naming.

This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: #7145

quodlibetor requested a review from umanwizard July 14, 2021 13:22

benesch reviewed Jul 19, 2021

View reviewed changes

quodlibetor added 2 commits July 22, 2021 12:00

Add a design doc for S3 Sources with Headers

e809236

Include examples using existing CSV purification

fe1bff2

quodlibetor force-pushed the design-s3-sources-with-headers branch 3 times, most recently from f72d8ac to 9789db1 Compare July 26, 2021 18:30

umanwizard approved these changes Jul 26, 2021

View reviewed changes

doc/developer/design/20210713_S3_sources_with_headers.md Outdated Show resolved Hide resolved

doc/developer/design/20210713_S3_sources_with_headers.md Outdated Show resolved Hide resolved

quodlibetor force-pushed the design-s3-sources-with-headers branch from 9789db1 to c2be524 Compare July 26, 2021 19:29

Add new syntax for explicit named columns

f50cf0a

quodlibetor force-pushed the design-s3-sources-with-headers branch from c2be524 to 10414c6 Compare July 26, 2021 19:32

Update to make column aliases and named columns syntax complementary

02d2dde

Now they are both allowed, with column names being used for validation and column aliases being used for column naming.

quodlibetor force-pushed the design-s3-sources-with-headers branch from 10414c6 to 02d2dde Compare July 26, 2021 19:34

quodlibetor merged commit 4c83e6c into MaterializeInc:main Jul 26, 2021

quodlibetor deleted the design-s3-sources-with-headers branch July 26, 2021 21:31

quodlibetor mentioned this pull request Jul 28, 2021

Implement new CSV WITH HEADER COLUMNS syntax #7507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a design doc for S3 Sources with Headers #7407

Add a design doc for S3 Sources with Headers #7407

quodlibetor commented Jul 13, 2021 •

edited

Loading

quodlibetor commented Jul 14, 2021

benesch left a comment

benesch Jul 19, 2021

benesch Jul 19, 2021

quodlibetor Jul 19, 2021

benesch Jul 20, 2021

quodlibetor Jul 21, 2021

quodlibetor Jul 26, 2021

umanwizard Jul 26, 2021

quodlibetor Jul 26, 2021

umanwizard Jul 26, 2021

quodlibetor Jul 26, 2021

benesch Jul 19, 2021

quodlibetor Jul 19, 2021

benesch Jul 20, 2021 •

edited

Loading

quodlibetor Jul 21, 2021

Add a design doc for S3 Sources with Headers #7407

Add a design doc for S3 Sources with Headers #7407

Conversation

quodlibetor commented Jul 13, 2021 • edited Loading

quodlibetor commented Jul 14, 2021

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quodlibetor commented Jul 13, 2021 •

edited

Loading

benesch Jul 20, 2021 •

edited

Loading