-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new CSV WITH HEADER COLUMNS syntax #7507
Implement new CSV WITH HEADER COLUMNS syntax #7507
Conversation
e00b885
to
77c7b63
Compare
77c7b63
to
5edf5a1
Compare
5edf5a1
to
7c8d81d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good start! Can you work with @philip-stoev to get some test coverage of this change across a version upgrade? Also needs doc updates and a release note!
src/sql/src/pure.rs
Outdated
csv_header | ||
.split(*delimiter as char) | ||
.map(|name| name.to_string()) | ||
.collect::<Vec<_>>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was here before but this does not seem like a valid way to parse a CSV header row? I think you need a proper CSV parser to account for the situation where the header names are double quoted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm planning on fixing that in a follow up PR, I should have added a TODO here to make that clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the goal was to make the code review easier, not to delay the work, I'll just add it here since now it's the opposite.
06310be
to
fcda97a
Compare
|
||
> CREATE SOURCE csv_upgrade_explicit | ||
FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv' | ||
FORMAT CSV WITH HEADER (id, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are allowed to share a csv file between the "before" and "after" portion of the upgrade test. Testdrive is run such that the two .td tests will share the same temp-dir.
So if you can add a check-from
.td file that checks the data that was ingested that would be much appreciated. If SHOW SOURCE
does not return any variable strings, you can also add such a statement to the ```check-from`` .td file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that makes sense, done I think.
fcda97a
to
52b1c05
Compare
src/dataflow-types/src/types.rs
Outdated
pub delimiter: u8, | ||
} | ||
|
||
impl CsvEncoding { | ||
pub fn has_header_rows(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if typo, should that be has_header_row
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was thinking of this in context of multi-object sources (e.g. S3, hypothetical future multi-file sources) but any given file does just have one row, I'll change it.
src/sql/src/pure.rs
Outdated
csv_header | ||
.split(*delimiter as char) | ||
.map(|name| name.to_string()) | ||
.collect::<Vec<_>>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust-csv crate seems to have a simple enough API, maybe it makes sense to just add this in this PR? Unless this fix is in a hurry it makes more sense to me to put a CSV parser here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the tests, I can not think of anything else to add here.
23b53fb
to
8ef28f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code itself LGTM (🎉), but I'm confused about the upgrade test—is there a bug in it? Comments within.
|
||
> CREATE MATERIALIZED SOURCE csv_upgrade_no_header | ||
FROM FILE '${testdrive.temp-dir}/upgrade-csv-with-headers.csv' | ||
FORMAT CSV WITH 2 COLUMNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't understand this new upgrade framework, but I don't understand what this test is testing. Doesn't it need to create these sources in v0.8.1, or some other old version that doesn't contain the new code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yeah that makes sense, I was guarding against future breakage not ensuring that I didn't break something from that past.
8c92c41
to
df108af
Compare
780688f
to
635c861
Compare
635c861
to
ead735c
Compare
@benesch and @sploiselle the very last commit in this pr contains the new migration code, which is the only thing that now needs review/has changed since the last LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
migration itself LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migration LGTM.
@@ -6,6 +6,7 @@ | |||
Method | Outcome | |||
-------|-------- | |||
**HEADER** | Materialize reads the first line of the file to determine:<br/><br/>• The number of columns in the file<br/><br/>• The name of each column<br/><br/>The first line of the file is not ingested as data. | |||
**HEADER** (name_list) | All of the same behaviors as bare **HEADER** with the additional features that:<br/><br/>• Header names from source objects will be validated to exactly match those specified in the name list.<br/><br/>• Specifying a column list allows using CSV format with sources that have headers but may not yet exist. Primarily this is intended for S3, but it also can work with CSV sources in Kafka or other streaming systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of the line about Kafka -- we don't support CSV files in Kafka at the moment, so it's misleading (what we do support is per-message csv-encoded lines, but we don't try to extract headers from them AFAIK)
src/coord/src/catalog/migrate.rs
Outdated
if let Format::Csv { columns, delimiter } = format { | ||
if !matches!(columns, CsvColumns::Header { .. }) { | ||
return Ok(()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want, you should be able to roll both of these checks into the pattern matching statement at the beginning of the function
src/dataflow-types/src/types.rs
Outdated
@@ -598,14 +603,43 @@ pub struct ProtobufEncoding { | |||
pub message_name: String, | |||
} | |||
|
|||
/// Encoding in CSV format, with `n_cols` columns per row, with an optional header. | |||
/// Encoding in CSV format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is useful as it's just a restatement of the struct name. I asked Frank for advice and he suggested /// Arguments necessary to define how to decode from CSV format
src/dataflow-types/src/types.rs
Outdated
} | ||
} | ||
|
||
/// What we know about the CSV columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this a bit better too? In particular we should probably call out that it changes the behavior of the decoder (i.e., that it determines whether the first row is a header row)
b913e5e
to
0c31578
Compare
`unimplemented!` has almost made it through code review in a place that `unsupported` was intended at least once, despite 5 people performing code review (MaterializeInc#7507). The unimplemented macro is just shorthand around `panic!("unimplemented");` so require that if folks really want to panic.
f702c63
to
af44adf
Compare
`unimplemented!` has almost made it through code review in a place that `unsupported` was intended at least once, despite 5 people performing code review (MaterializeInc#7507). The unimplemented macro is just shorthand around `panic!("unimplemented");` so require that if folks really want to panic.
af44adf
to
01a77ed
Compare
This syntax allows users to provide header names for objects that do not yet exist. It additionally allows Materialize to record header columns into SQL for the catalog interacting less with the more general aliases SQL feature -- we still put default column names in the SQL aliases if the format is specifed as `WITH n COLUMNS`. Design: MaterializeInc#7407 Part of: MaterializeInc#7145
01a77ed
to
a2a11bd
Compare
This syntax allows users to provide header names for objects that do not yet
exist. It additionally allows Materialize to record header columns into SQL for
the catalog without interacting with the more general aliases SQL feature.
A follow-up PR will add more verification to re-opened files, that is files that
exist in the catalog and possibly have their headers changed between materialized
restarts or which is empty but we are operating on a
file WITH (TAIL)
stream.Design: #7407
Part of: #7145