Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new CSV WITH HEADER COLUMNS syntax #7507

Merged
merged 2 commits into from
Aug 27, 2021

Conversation

quodlibetor
Copy link
Contributor

@quodlibetor quodlibetor commented Jul 22, 2021

This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog without interacting with the more general aliases SQL feature.

A follow-up PR will add more verification to re-opened files, that is files that
exist in the catalog and possibly have their headers changed between materialized
restarts or which is empty but we are operating on a file WITH (TAIL) stream.

Design: #7407
Part of: #7145

@quodlibetor quodlibetor marked this pull request as draft July 22, 2021 21:40
@quodlibetor quodlibetor force-pushed the csv-headers-syntax branch 7 times, most recently from e00b885 to 77c7b63 Compare July 28, 2021 19:33
@quodlibetor quodlibetor changed the title [wip] csv headers syntax Implement new CSV WITH HEADER COLUMNS syntax Jul 28, 2021
@quodlibetor quodlibetor marked this pull request as ready for review July 28, 2021 21:19
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good start! Can you work with @philip-stoev to get some test coverage of this change across a version upgrade? Also needs doc updates and a release note!

src/sql-parser/src/ast/defs/ddl.rs Show resolved Hide resolved
src/sql-parser/src/parser.rs Show resolved Hide resolved
src/dataflow-types/src/types.rs Outdated Show resolved Hide resolved
src/dataflow-types/src/types.rs Outdated Show resolved Hide resolved
src/sql/src/pure.rs Outdated Show resolved Hide resolved
csv_header
.split(*delimiter as char)
.map(|name| name.to_string())
.collect::<Vec<_>>(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was here before but this does not seem like a valid way to parse a CSV header row? I think you need a proper CSV parser to account for the situation where the header names are double quoted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm planning on fixing that in a follow up PR, I should have added a TODO here to make that clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the goal was to make the code review easier, not to delay the work, I'll just add it here since now it's the opposite.

src/sql/src/pure.rs Outdated Show resolved Hide resolved
src/sql/src/pure.rs Outdated Show resolved Hide resolved

> CREATE SOURCE csv_upgrade_explicit
FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv'
FORMAT CSV WITH HEADER (id, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are allowed to share a csv file between the "before" and "after" portion of the upgrade test. Testdrive is run such that the two .td tests will share the same temp-dir.

So if you can add a check-from .td file that checks the data that was ingested that would be much appreciated. If SHOW SOURCE does not return any variable strings, you can also add such a statement to the ```check-from`` .td file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that makes sense, done I think.

pub delimiter: u8,
}

impl CsvEncoding {
pub fn has_header_rows(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if typo, should that be has_header_row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was thinking of this in context of multi-object sources (e.g. S3, hypothetical future multi-file sources) but any given file does just have one row, I'll change it.

src/sql-parser/src/ast/defs/ddl.rs Show resolved Hide resolved
csv_header
.split(*delimiter as char)
.map(|name| name.to_string())
.collect::<Vec<_>>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here

src/sql/src/pure.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@philip-stoev philip-stoev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the tests, I can not think of anything else to add here.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself LGTM (🎉), but I'm confused about the upgrade test—is there a bug in it? Comments within.


> CREATE MATERIALIZED SOURCE csv_upgrade_no_header
FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv'
FORMAT CSV WITH 2 COLUMNS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand this new upgrade framework, but I don't understand what this test is testing. Doesn't it need to create these sources in v0.8.1, or some other old version that doesn't contain the new code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah that makes sense, I was guarding against future breakage not ensuring that I didn't break something from that past.

@quodlibetor quodlibetor force-pushed the csv-headers-syntax branch 3 times, most recently from 8c92c41 to df108af Compare August 2, 2021 19:16
@quodlibetor quodlibetor force-pushed the csv-headers-syntax branch 5 times, most recently from 780688f to 635c861 Compare August 11, 2021 15:59
@quodlibetor quodlibetor marked this pull request as ready for review August 11, 2021 17:36
@quodlibetor
Copy link
Contributor Author

@benesch and @sploiselle the very last commit in this pr contains the new migration code, which is the only thing that now needs review/has changed since the last LGTM.

Copy link
Contributor

@sploiselle sploiselle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migration itself LGTM

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration LGTM.

@@ -6,6 +6,7 @@
Method | Outcome
-------|--------
**HEADER** | Materialize reads the first line of the file to determine:<br/><br/>&bull; The number of columns in the file<br/><br/>&bull; The name of each column<br/><br/>The first line of the file is not ingested as data.
**HEADER** (name_list) | All of the same behaviors as bare **HEADER** with the additional features that:<br/><br/>&bull; Header names from source objects will be validated to exactly match those specified in the name list.<br/><br/>&bull; Specifying a column list allows using CSV format with sources that have headers but may not yet exist. Primarily this is intended for S3, but it also can work with CSV sources in Kafka or other streaming systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of the line about Kafka -- we don't support CSV files in Kafka at the moment, so it's misleading (what we do support is per-message csv-encoded lines, but we don't try to extract headers from them AFAIK)

Comment on lines 485 to 488
if let Format::Csv { columns, delimiter } = format {
if !matches!(columns, CsvColumns::Header { .. }) {
return Ok(());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want, you should be able to roll both of these checks into the pattern matching statement at the beginning of the function

@@ -598,14 +603,43 @@ pub struct ProtobufEncoding {
pub message_name: String,
}

/// Encoding in CSV format, with `n_cols` columns per row, with an optional header.
/// Encoding in CSV format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is useful as it's just a restatement of the struct name. I asked Frank for advice and he suggested /// Arguments necessary to define how to decode from CSV format

}
}

/// What we know about the CSV columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this a bit better too? In particular we should probably call out that it changes the behavior of the decoder (i.e., that it determines whether the first row is a header row)

src/sql/src/plan/statement/ddl.rs Show resolved Hide resolved
src/sql/src/pure.rs Outdated Show resolved Hide resolved
@quodlibetor quodlibetor force-pushed the csv-headers-syntax branch 3 times, most recently from b913e5e to 0c31578 Compare August 18, 2021 18:32
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 23, 2021
`unimplemented!` has almost made it through code review in a place that
`unsupported` was intended at least once, despite 5 people performing code
review (MaterializeInc#7507). The unimplemented macro is just shorthand around
`panic!("unimplemented");` so require that if folks really want to panic.
quodlibetor added a commit to quodlibetor/materialize that referenced this pull request Aug 23, 2021
`unimplemented!` has almost made it through code review in a place that
`unsupported` was intended at least once, despite 5 people performing code
review (MaterializeInc#7507). The unimplemented macro is just shorthand around
`panic!("unimplemented");` so require that if folks really want to panic.
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog interacting less with the more general aliases SQL feature -- we
still put default column names in the SQL aliases if the format is specifed as
`WITH n COLUMNS`.

Design: MaterializeInc#7407
Part of: MaterializeInc#7145
@quodlibetor quodlibetor merged commit 7ff21e4 into MaterializeInc:main Aug 27, 2021
@quodlibetor quodlibetor deleted the csv-headers-syntax branch August 30, 2021 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants