s3 sources: Add the ability to read a single object from S3 #5194

quodlibetor · 2021-01-04T20:38:43Z

Individual objects are downloaded, and then split on newlines with individual
records being sent through to the dataflow layer.

There are some open questions about how MzOffsets and other things should map
into an S3 world -- "partition" can be mapped to S3 objects, but we can grow to
arbitrarily large numbers of S3 objects, so a naive implementation of that
won't necessarily make sense.

This PR still requires testing infrastructure, but ingesting an S3 object
works correctly. This should be reviewed commit-by-commit.

Part of #4914

This change is

benesch

Exciting!

src/sql/src/plan/statement.rs

benesch · 2021-01-04T20:49:13Z

src/sql-parser/src/ast/defs/ddl.rs

+        bucket: String,
+        /// A glob-like pattern that objects must match
+        objects_pattern: String,
+    },


FYI, we're trying to move away from these sorts of comments in the SQL parser. The principle is that descriptions of the semantics of S3 sources don't really belong in doc comments in the SQL parser. If anything these could describe the syntax they represent ("The value of the BUCKET clause of the S3 connector."), but in general I think those comments clutter things up more than they help. The syn crate is a nice example of describing just an AST without describing the semantics of the language.

How would you feel about:

A glob-like object specifier: `'a/**/*.json'`

This seems to match what happens in syn, in e.g. Item: https://docs.rs/syn/1.0.57/syn/enum.Item.html

I still think that's a bit too specific since the syntax of that pattern is opaque to the SQL parser. Like I think the correct level of detail for these doc comments is something like

The `S3 ...` connector. S3 { /// The `BUCKET 'bucket'` clause. bucket: String, /// The `OBJECTS 'pattern'` clause. object_patterns: String, }

And then the semantics of these things would be fully-described in the user-facing docs, so that we focus our documentation efforts on what users will actually see. But again this is super minor so 💯 down to roll with what you have!

src/sql-parser/src/parser.rs

benesch · 2021-01-04T21:06:58Z

src/dataflow-types/src/types.rs

 }

 impl ExternalSourceConnector {
+    /// Metadata columns reflect how many records we have processed


If we're going to document this method, then I think it would be great to be a bit more detailed! Perhaps something like this:

/// Returns the name and type of each additional metadata column that /// Materialize will automatically append to the source's inherent columns. /// /// Presently, each source type exposes precisely one metadata column that /// corresponds to some source-specific record counter. For example, file /// sources use a line number, while Kafka sources use a topic offset. /// /// The columns declared here must be kept in sync with the actual source /// implementations that produce these columns.

Thanks, this is exactly what I had wished was there when I was trying to read this.

benesch · 2021-01-04T21:09:35Z

src/dataflow-types/src/types.rs

+    pub bucket: String,
+    /// Used to filter results
+    #[serde(with = "s3_serde_glob")]
+    pub objects_pattern: glob::Pattern,


What do you say to using BurntSushi's globset crate instead? It's both supposed to be faster and comes with built-in serde support if you enable the serde1 feature!

Ah, I looked for it but could only find a crate that actually did stuff on the filesystem, will switch.

benesch · 2021-01-04T21:11:21Z

src/dataflow-types/src/types.rs

+}
+
+#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize)]
+pub struct AwsConnectInfo {


I think this might as well be named AwsConnector for parity with the other types in this module.

This is slightly different to a connector, though, since it's a subset of the information required by connectors, it's not enough info by itself to create a connector, and I felt like that difference was worth preserving. I agree that AwsConnectInfo is a bad name, but I couldn't think of anything better.

I have to admit I'm unconvinced! The term "connector" was never meant to mean "complete description"; it was picked specifically to avoid the noise word "info".
And indeed an ExternalSourceConnector is not actually a complete of what you need to create an external source—there are a bunch of fields in SourceConnector::External.

Anyway, minor, let's definitely not hold up this PR on this point!

src/aws-util/src/s3.rs

src/dataflow-types/src/types.rs

benesch · 2021-01-04T21:55:47Z

src/sql/src/plan/statement.rs

+        None => extract("region")?
+            .map(|r| r.parse())
+            .transpose()?
+            // TODO: do we want to have a default region?


In my past experiences with AWS, I've always regretted any attempt to automatically infer the region or use a default. It's ugly for sure, but folks who are used to AWS seem very used to the idea that an AWS connection is a verbose affair that requires all three of (access key id, secret access key, region).

fair, removed this todo.

benesch · 2021-01-04T22:08:57Z

src/dataflow/src/source/s3.rs

+// Helper utilities
+
+/// Iterate over a `Vec<u8>`, yielding new Vecs newline-separated
+struct VecLinesIter {


Can you not use something like slice::split('\n').map(|s| s.to_vec())?

quodlibetor · 2021-01-05T18:32:21Z

@benesch I believe that this is entirely up to date re: your comments.

benesch

Everything I've looked at looks great, but I didn't look at the important bits—I think @umanwizard is doing that in the other PR?

benesch · 2021-01-07T17:55:22Z

src/dataflow-types/src/types.rs

+}
+
+#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize)]
+pub struct AwsConnectInfo {


I have to admit I'm unconvinced! The term "connector" was never meant to mean "complete description"; it was picked specifically to avoid the noise word "info".
And indeed an ExternalSourceConnector is not actually a complete of what you need to create an external source—there are a bunch of fields in SourceConnector::External.

Anyway, minor, let's definitely not hold up this PR on this point!

benesch · 2021-01-07T17:59:37Z

src/sql-parser/src/ast/defs/ddl.rs

+        bucket: String,
+        /// A glob-like pattern that objects must match
+        objects_pattern: String,
+    },


I still think that's a bit too specific since the syntax of that pattern is opaque to the SQL parser. Like I think the correct level of detail for these doc comments is something like

The `S3 ...` connector. S3 { /// The `BUCKET 'bucket'` clause. bucket: String, /// The `OBJECTS 'pattern'` clause. object_patterns: String, }

And then the semantics of these things would be fully-described in the user-facing docs, so that we focus our documentation efforts on what users will actually see. But again this is super minor so 💯 down to roll with what you have!

This will allow it to be shared between different services in later commits.

This is a more common idiom.

Before this credentials weren't fetched until we tried to use the client, causing credentials errors to show up well after the dataflow had been created. With this we can fail to create the source at all.

Individual objects are downloaded, and then split on newlines with individual records being sent through to the dataflow layer. There are some open questions about how MzOffsets and other things should map into an S3 world -- "partition" can be mapped to S3 objects, but we can grow to arbitrarily large numbers of S3 objects, so a naive implementation of that won't necessarily make sense. This commit still requires testing infrastructure, but ingesting an S3 object works correctly.

quodlibetor · 2021-01-14T16:04:22Z

I'm going to close this to centralize discussion on #5202, which includes this commit.

quodlibetor mentioned this pull request Jan 4, 2021

S3 Sources Epic #4914

Closed

17 tasks

benesch reviewed Jan 4, 2021

View reviewed changes

quodlibetor force-pushed the s3-sources branch 4 times, most recently from b2dd11f to f69e626 Compare January 5, 2021 18:06

quodlibetor requested a review from benesch January 5, 2021 18:32

quodlibetor mentioned this pull request Jan 6, 2021

s3 sources: support reading an entire bucket #5202

Merged

quodlibetor requested a review from umanwizard January 7, 2021 02:51

benesch reviewed Jan 7, 2021

View reviewed changes

quodlibetor force-pushed the s3-sources branch 2 times, most recently from c8fc7f9 to 906997b Compare January 8, 2021 21:01

quodlibetor added 6 commits January 14, 2021 09:37

Extract aws connection info into its own struct

de2b8bc

This will allow it to be shared between different services in later commits.

aws_util: Rename kinesis::kinesis_client -> kinesis::client

eb5c5ac

This is a more common idiom.

aws_util: Guard against typos in Kinesis client

1c62845

aws_util: Fail earlier if Kinesis credentials are not available

ed591df

Before this credentials weren't fetched until we tried to use the client, causing credentials errors to show up well after the dataflow had been created. With this we can fail to create the source at all.

aws_util: Correct and flesh out docs for kinesis::client

822e365

quodlibetor mentioned this pull request Jan 14, 2021

aws_util: Prepare for supporting S3 as well #5319

Merged

quodlibetor force-pushed the s3-sources branch from 906997b to 7117745 Compare January 14, 2021 15:58

quodlibetor closed this Jan 14, 2021

quodlibetor deleted the s3-sources branch January 14, 2021 16:04

benesch mentioned this pull request Feb 2, 2021

s3 sources: Refactor source creation to support multiple key sources #5562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 sources: Add the ability to read a single object from S3 #5194

s3 sources: Add the ability to read a single object from S3 #5194

quodlibetor commented Jan 4, 2021 •

edited

benesch left a comment

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

benesch Jan 7, 2021

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

benesch Jan 7, 2021

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

benesch Jan 4, 2021

quodlibetor Jan 5, 2021

quodlibetor commented Jan 5, 2021

benesch left a comment

benesch Jan 7, 2021

benesch Jan 7, 2021

quodlibetor commented Jan 14, 2021 •

edited

s3 sources: Add the ability to read a single object from S3 #5194

s3 sources: Add the ability to read a single object from S3 #5194

Conversation

quodlibetor commented Jan 4, 2021 • edited

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quodlibetor commented Jan 5, 2021

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quodlibetor commented Jan 14, 2021 • edited

quodlibetor commented Jan 4, 2021 •

edited

quodlibetor commented Jan 14, 2021 •

edited