Implement new CSV WITH HEADER COLUMNS syntax #7507

quodlibetor · 2021-07-22T21:39:33Z

This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog without interacting with the more general aliases SQL feature.

A follow-up PR will add more verification to re-opened files, that is files that
exist in the catalog and possibly have their headers changed between materialized
restarts or which is empty but we are operating on a file WITH (TAIL) stream.

Design: #7407
Part of: #7145

benesch

Seems like a good start! Can you work with @philip-stoev to get some test coverage of this change across a version upgrade? Also needs doc updates and a release note!

src/sql-parser/src/ast/defs/ddl.rs

src/sql-parser/src/parser.rs

src/dataflow-types/src/types.rs

src/sql/src/pure.rs

benesch · 2021-07-29T02:49:37Z

src/sql/src/pure.rs

+                        csv_header
+                            .split(*delimiter as char)
+                            .map(|name| name.to_string())
+                            .collect::<Vec<_>>(),


I know this was here before but this does not seem like a valid way to parse a CSV header row? I think you need a proper CSV parser to account for the situation where the header names are double quoted.

Yeah I'm planning on fixing that in a follow up PR, I should have added a TODO here to make that clear.

The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here

yeah the goal was to make the code review easier, not to delay the work, I'll just add it here since now it's the opposite.

src/sql/src/pure.rs

philip-stoev · 2021-07-29T17:29:15Z

test/upgrade/create-in-latest_version-csv-with-header.td

+
+> CREATE SOURCE csv_upgrade_explicit
+  FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv'
+  FORMAT CSV WITH HEADER (id, value)


Yes, you are allowed to share a csv file between the "before" and "after" portion of the upgrade test. Testdrive is run such that the two .td tests will share the same temp-dir.

So if you can add a check-from .td file that checks the data that was ingested that would be much appreciated. If SHOW SOURCE does not return any variable strings, you can also add such a statement to the ```check-from`` .td file.

right, that makes sense, done I think.

petrosagg · 2021-07-29T17:58:45Z

src/dataflow-types/src/types.rs

    pub delimiter: u8,
 }

+impl CsvEncoding {
+    pub fn has_header_rows(&self) -> bool {


Not sure if typo, should that be has_header_row?

yeah I was thinking of this in context of multi-object sources (e.g. S3, hypothetical future multi-file sources) but any given file does just have one row, I'll change it.

src/sql-parser/src/ast/defs/ddl.rs

petrosagg · 2021-07-29T18:10:33Z

src/sql/src/pure.rs

+                        csv_header
+                            .split(*delimiter as char)
+                            .map(|name| name.to_string())
+                            .collect::<Vec<_>>(),


The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here

src/sql/src/pure.rs

philip-stoev

Thank you for the tests, I can not think of anything else to add here.

benesch

The code itself LGTM (🎉), but I'm confused about the upgrade test—is there a bug in it? Comments within.

benesch · 2021-07-30T20:48:21Z

test/upgrade/create-in-latest_version-csv-with-header.td

+
+> CREATE MATERIALIZED SOURCE csv_upgrade_no_header
+  FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv'
+  FORMAT CSV WITH 2 COLUMNS


Maybe I don't understand this new upgrade framework, but I don't understand what this test is testing. Doesn't it need to create these sources in v0.8.1, or some other old version that doesn't contain the new code?

oh, yeah that makes sense, I was guarding against future breakage not ensuring that I didn't break something from that past.

quodlibetor · 2021-08-11T18:27:14Z

@benesch and @sploiselle the very last commit in this pr contains the new migration code, which is the only thing that now needs review/has changed since the last LGTM.

sploiselle

migration itself LGTM

benesch

Migration LGTM.

umanwizard · 2021-08-03T18:19:00Z

doc/user/layouts/partials/create-source/format/csv/details.html

@@ -6,6 +6,7 @@
 Method | Outcome
 -------|--------
 **HEADER** | Materialize reads the first line of the file to determine:<br/><br/>&bull; The number of columns in the file<br/><br/>&bull; The name of each column<br/><br/>The first line of the file is not ingested as data.
+**HEADER** (name_list) | All of the same behaviors as bare **HEADER** with the additional features that:<br/><br/>&bull; Header names from source objects will be validated to exactly match those specified in the name list.<br/><br/>&bull; Specifying a column list allows using CSV format with sources that have headers but may not yet exist. Primarily this is intended for S3, but it also can work with CSV sources in Kafka or other streaming systems.


Let's get rid of the line about Kafka -- we don't support CSV files in Kafka at the moment, so it's misleading (what we do support is per-message csv-encoded lines, but we don't try to extract headers from them AFAIK)

umanwizard · 2021-08-12T15:36:06Z

src/coord/src/catalog/migrate.rs

+    if let Format::Csv { columns, delimiter } = format {
+        if !matches!(columns, CsvColumns::Header { .. }) {
+            return Ok(());
+        }


if you want, you should be able to roll both of these checks into the pattern matching statement at the beginning of the function

umanwizard · 2021-08-12T15:48:25Z

src/dataflow-types/src/types.rs

@@ -598,14 +603,43 @@ pub struct ProtobufEncoding {
    pub message_name: String,
 }

-/// Encoding in CSV format, with `n_cols` columns per row, with an optional header.
+/// Encoding in CSV format


I don't think this comment is useful as it's just a restatement of the struct name. I asked Frank for advice and he suggested /// Arguments necessary to define how to decode from CSV format

umanwizard · 2021-08-12T16:09:10Z

src/dataflow-types/src/types.rs

+    }
+}
+
+/// What we know about the CSV columns


Can we document this a bit better too? In particular we should probably call out that it changes the behavior of the decoder (i.e., that it determines whether the first row is a header row)

src/sql/src/plan/statement/ddl.rs

src/sql/src/pure.rs

`unimplemented!` has almost made it through code review in a place that `unsupported` was intended at least once, despite 5 people performing code review (MaterializeInc#7507). The unimplemented macro is just shorthand around `panic!("unimplemented");` so require that if folks really want to panic.

This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: MaterializeInc#7145

quodlibetor marked this pull request as draft July 22, 2021 21:40

quodlibetor force-pushed the csv-headers-syntax branch 7 times, most recently from e00b885 to 77c7b63 Compare July 28, 2021 19:33

quodlibetor changed the title ~~[wip] csv headers syntax~~ Implement new CSV WITH HEADER COLUMNS syntax Jul 28, 2021

quodlibetor requested review from benesch, umanwizard and JLDLaughlin July 28, 2021 19:53

quodlibetor force-pushed the csv-headers-syntax branch from 77c7b63 to 5edf5a1 Compare July 28, 2021 20:23

quodlibetor marked this pull request as ready for review July 28, 2021 21:19

quodlibetor force-pushed the csv-headers-syntax branch from 5edf5a1 to 7c8d81d Compare July 28, 2021 21:25

benesch reviewed Jul 29, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch from 06310be to fcda97a Compare July 29, 2021 16:09

quodlibetor requested a review from petrosagg July 29, 2021 16:11

philip-stoev self-requested a review July 29, 2021 17:22

philip-stoev reviewed Jul 29, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch from fcda97a to 52b1c05 Compare July 29, 2021 18:03

petrosagg reviewed Jul 29, 2021

View reviewed changes

philip-stoev approved these changes Jul 30, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch from 23b53fb to 8ef28f9 Compare July 30, 2021 17:04

quodlibetor requested review from benesch and petrosagg July 30, 2021 18:21

benesch reviewed Jul 30, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch 3 times, most recently from 8c92c41 to df108af Compare August 2, 2021 19:16

quodlibetor force-pushed the csv-headers-syntax branch 5 times, most recently from 780688f to 635c861 Compare August 11, 2021 15:59

quodlibetor marked this pull request as ready for review August 11, 2021 17:36

quodlibetor requested a review from benesch August 11, 2021 17:36

quodlibetor force-pushed the csv-headers-syntax branch from 635c861 to ead735c Compare August 11, 2021 17:38

quodlibetor requested a review from sploiselle August 11, 2021 18:26

sploiselle approved these changes Aug 11, 2021

View reviewed changes

benesch reviewed Aug 12, 2021

View reviewed changes

umanwizard reviewed Aug 12, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch 3 times, most recently from b913e5e to 0c31578 Compare August 18, 2021 18:32

quodlibetor requested a review from umanwizard August 18, 2021 19:37

quodlibetor mentioned this pull request Aug 23, 2021

Deny the unimplemented! macro, it is too similar to unsupported! #7999

Closed

quodlibetor force-pushed the csv-headers-syntax branch from f702c63 to af44adf Compare August 23, 2021 14:34

umanwizard approved these changes Aug 24, 2021

View reviewed changes

quodlibetor force-pushed the csv-headers-syntax branch from af44adf to 01a77ed Compare August 27, 2021 18:57

quodlibetor enabled auto-merge August 27, 2021 19:23

quodlibetor added 2 commits August 27, 2021 15:23

csv with header: Add docs and release note

a2a11bd

quodlibetor force-pushed the csv-headers-syntax branch from 01a77ed to a2a11bd Compare August 27, 2021 19:24

quodlibetor merged commit 7ff21e4 into MaterializeInc:main Aug 27, 2021

quodlibetor deleted the csv-headers-syntax branch August 30, 2021 13:52

materialize-bot mentioned this pull request Aug 30, 2021

release: v0.9.2-rc1 required reviews #8108

Closed

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement new CSV WITH HEADER COLUMNS syntax #7507

Implement new CSV WITH HEADER COLUMNS syntax #7507

quodlibetor commented Jul 22, 2021 •

edited

Loading

benesch left a comment

benesch Jul 29, 2021

quodlibetor Jul 29, 2021

petrosagg Jul 29, 2021

quodlibetor Jul 30, 2021

philip-stoev Jul 29, 2021

quodlibetor Jul 29, 2021

petrosagg Jul 29, 2021

quodlibetor Jul 30, 2021

petrosagg Jul 29, 2021

philip-stoev left a comment

benesch left a comment •

edited

Loading

benesch Jul 30, 2021

quodlibetor Aug 2, 2021

quodlibetor commented Aug 11, 2021

sploiselle left a comment

benesch left a comment

umanwizard Aug 3, 2021

umanwizard Aug 12, 2021

umanwizard Aug 12, 2021

umanwizard Aug 12, 2021

Implement new CSV WITH HEADER COLUMNS syntax #7507

Implement new CSV WITH HEADER COLUMNS syntax #7507

Conversation

quodlibetor commented Jul 22, 2021 • edited Loading

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philip-stoev left a comment

Choose a reason for hiding this comment

benesch left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quodlibetor commented Aug 11, 2021

sploiselle left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quodlibetor commented Jul 22, 2021 •

edited

Loading

benesch left a comment •

edited

Loading