New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate storage and compute #12216
Closed
Closed
Separate storage and compute #12216
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
petrosagg
force-pushed
the
separate-compute-storage
branch
from
May 3, 2022 11:57
98bcd18
to
ed261ce
Compare
petrosagg
force-pushed
the
separate-compute-storage
branch
from
May 4, 2022 10:06
6084c8a
to
8897a41
Compare
petrosagg
added a commit
to petrosagg/materialize
that referenced
this pull request
May 4, 2022
The previous API required that the user came up with an iterator of references to values but that meant that every called that had owned values had to write unecessary boilerplate. For example if a user had a `Vec<((K, V), T, D)` ready to go they would need to first make this an iterator over `&((K, V), T, D)` and then use a map adaptor to convert that to an iterator over `((&K, &V), &T, &D)`. This patch changes the `append` APIs to accept any combination of owned and borrowed values (a total of 32 combinations) which makes calling these APIs much nicer. The usecase for this change isn't visible in this PR because it is part of the larger MaterializeInc#12216 PR, but I thought I'd break it up for easier review. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
petrosagg
added a commit
to petrosagg/materialize
that referenced
this pull request
May 4, 2022
The previous API required that the user came up with an iterator of references to values but that meant that every called that had owned values had to write unecessary boilerplate. For example if a user had a `Vec<((K, V), T, D)>` ready to go they would need to first make this an iterator over `&((K, V), T, D)` and then use a map adaptor to convert that to an iterator over `((&K, &V), &T, &D)`. This patch changes the `append` APIs to accept any combination of owned and borrowed values (a total of 32 combinations) which makes calling these APIs much nicer. The original usecase for this change isn't visible in this PR because it is part of the larger MaterializeInc#12216 PR, but I thought I'd break it up for easier review. You can see however how the usage becomes nicer in `persist_open_loop_benchmark.rs` and also in `src/persist-client/src/lib.rs`. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
petrosagg
added a commit
to petrosagg/materialize
that referenced
this pull request
May 4, 2022
The previous API required that the user came up with an iterator of references to values but that meant that every called that had owned values had to write unecessary boilerplate. For example if a user had a `Vec<((K, V), T, D)>` ready to go they would need to first make this an iterator over `&((K, V), T, D)` and then use a map adaptor to convert that to an iterator over `((&K, &V), &T, &D)`. This patch changes the `append` APIs to accept any combination of owned and borrowed values (a total of 32 combinations) which makes calling these APIs much nicer. The original usecase for this change isn't visible in this PR because it is part of the larger MaterializeInc#12216 PR, but I thought I'd break it up for easier review. You can see however how the usage becomes nicer in `persist_open_loop_benchmark.rs` and also in `src/persist-client/src/lib.rs`. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
@petrosagg I understand this is the PR to end all PRs, to provide table persistence among other things. Please let me know when I can jump in and start testing it. |
petrosagg
force-pushed
the
separate-compute-storage
branch
8 times, most recently
from
May 10, 2022 12:51
f4a581a
to
eb2d4fb
Compare
danhhz
reviewed
May 11, 2022
petrosagg
force-pushed
the
separate-compute-storage
branch
10 times, most recently
from
May 16, 2022 13:33
665e3aa
to
534a49d
Compare
This comment was marked as resolved.
This comment was marked as resolved.
1 similar comment
This comment was marked as resolved.
This comment was marked as resolved.
petrosagg
force-pushed
the
separate-compute-storage
branch
from
May 25, 2022 17:51
5c3a94f
to
7112454
Compare
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
petrosagg
force-pushed
the
separate-compute-storage
branch
from
May 25, 2022 21:56
7112454
to
b33b650
Compare
umanwizard
pushed a commit
to umanwizard/materialize-1
that referenced
this pull request
May 25, 2022
…ble data through `persist`. NOTE: The original PR is at MaterializeInc#12216 On the storage side, when a `CreateSources` command is received the corresponding ingestion dataflow is immediately instantiated and datat starts flowing into `persist`. This is in contrast with previous behaviour where `storaged` process were waiting for `RenderSources` messages to instantiate ingestion dataflows and publish the data to the TCP boundary. Both the `RenderSources` command and the TCP boundary are removed. Additionally, `storaged` processes are no longer responsible for handling table data and the associated `Append` command is removed. Instead, the storage controller directly appends the data requested to be appended by ADAPTER into the appropriate persist shard. On the compute side, the `DataflowDescription` struct has been augmented with an additional storage metadata type which carries all the information required by the storage fat client running in `computed` processes to reach out to `persist` and read the correct data. Currently this type (named `CollectionMetadata`) carries the shard ids for the final storage collection and also the persist location details. Even though all data is transited through `persist`, the shard id used is **ephemeral**. In some sense this PR only goes as far as using persist as a glorified TCP protocol. There will be follow up work that will make the storage controller durably record its choices for persist shard for a particular storage collection id and also change ADAPTER to anticipate storage collection being already there (adapter/storage controller reconciliation). Fixes MaterializeInc#12593
Merged
umanwizard
pushed a commit
that referenced
this pull request
May 25, 2022
…ble data through `persist`. NOTE: The original PR is at #12216 On the storage side, when a `CreateSources` command is received the corresponding ingestion dataflow is immediately instantiated and datat starts flowing into `persist`. This is in contrast with previous behaviour where `storaged` process were waiting for `RenderSources` messages to instantiate ingestion dataflows and publish the data to the TCP boundary. Both the `RenderSources` command and the TCP boundary are removed. Additionally, `storaged` processes are no longer responsible for handling table data and the associated `Append` command is removed. Instead, the storage controller directly appends the data requested to be appended by ADAPTER into the appropriate persist shard. On the compute side, the `DataflowDescription` struct has been augmented with an additional storage metadata type which carries all the information required by the storage fat client running in `computed` processes to reach out to `persist` and read the correct data. Currently this type (named `CollectionMetadata`) carries the shard ids for the final storage collection and also the persist location details. Even though all data is transited through `persist`, the shard id used is **ephemeral**. In some sense this PR only goes as far as using persist as a glorified TCP protocol. There will be follow up work that will make the storage controller durably record its choices for persist shard for a particular storage collection id and also change ADAPTER to anticipate storage collection being already there (adapter/storage controller reconciliation). Fixes #12593
superseded by #12715 to get over CLA assistant being stuck |
auto-merge was automatically disabled
May 25, 2022 23:07
Pull request was closed
benesch
added a commit
to benesch/materialize
that referenced
this pull request
May 29, 2022
Since MaterializeInc#12216 separated storage and compute by transiting all data via the persist library, the compute layer no longer needs to communicate directly with the storage layer. So remove the configuration parameters that pertained to the old storage-compute network protocol.
1 task
benesch
added a commit
to benesch/materialize
that referenced
this pull request
May 29, 2022
Since MaterializeInc#12216 separated storage and compute by transiting all data via the persist library, the compute layer no longer needs to communicate directly with the storage layer. So remove the configuration parameters that pertained to the old storage-compute network protocol.
benesch
added a commit
to benesch/materialize
that referenced
this pull request
May 29, 2022
Since MaterializeInc#12216 separated storage and compute by transiting all data via the persist library, the compute layer no longer needs to communicate directly with the storage layer. So remove the configuration parameters that pertained to the old storage-compute network protocol.
benesch
added a commit
that referenced
this pull request
May 29, 2022
Since #12216 separated storage and compute by transiting all data via the persist library, the compute layer no longer needs to communicate directly with the storage layer. So remove the configuration parameters that pertained to the old storage-compute network protocol.
benesch
added a commit
to benesch/materialize
that referenced
this pull request
May 31, 2022
Since MaterializeInc#12216, tables are now entirely handled by the controller. There is no longer a need to send a `CreateSourceCommand` to the `storaged` process for table sources. We should eventually adjust the types here to enforce this statically, but this quick fix will unblock MaterializeInc#12770.
1 task
benesch
added a commit
to benesch/materialize
that referenced
this pull request
Jun 1, 2022
PR MaterializeInc#12082 converted source tokens to thread-safe `Arc`s to be compatible with the TCP storage/compute boundary, but since MaterializeInc#12216 replaced the TCP boundary with persist we can go back to Rcs.
benesch
added a commit
to benesch/materialize
that referenced
this pull request
Jun 1, 2022
PR MaterializeInc#12082 converted source tokens to thread-safe `Arc`s to be compatible with the TCP storage/compute boundary, but since MaterializeInc#12216 replaced the TCP boundary with persist we can go back to Rcs.
1 task
jkosh44
pushed a commit
to jkosh44/materialize
that referenced
this pull request
Jun 2, 2022
…ble data through `persist`. NOTE: The original PR is at MaterializeInc#12216 On the storage side, when a `CreateSources` command is received the corresponding ingestion dataflow is immediately instantiated and datat starts flowing into `persist`. This is in contrast with previous behaviour where `storaged` process were waiting for `RenderSources` messages to instantiate ingestion dataflows and publish the data to the TCP boundary. Both the `RenderSources` command and the TCP boundary are removed. Additionally, `storaged` processes are no longer responsible for handling table data and the associated `Append` command is removed. Instead, the storage controller directly appends the data requested to be appended by ADAPTER into the appropriate persist shard. On the compute side, the `DataflowDescription` struct has been augmented with an additional storage metadata type which carries all the information required by the storage fat client running in `computed` processes to reach out to `persist` and read the correct data. Currently this type (named `CollectionMetadata`) carries the shard ids for the final storage collection and also the persist location details. Even though all data is transited through `persist`, the shard id used is **ephemeral**. In some sense this PR only goes as far as using persist as a glorified TCP protocol. There will be follow up work that will make the storage controller durably record its choices for persist shard for a particular storage collection id and also change ADAPTER to anticipate storage collection being already there (adapter/storage controller reconciliation). Fixes MaterializeInc#12593
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR decouples STORAGE and COMPUTE by transiting all source and table data through
persist
.On the storage side, when a
CreateSources
command is received the corresponding ingestion dataflow is immediately instantiated and datat starts flowing intopersist
. This is in contrast with previous behaviour wherestoraged
process were waiting forRenderSources
messages to instantiate ingestion dataflows and publish the data to the TCP boundary. Both theRenderSources
command and the TCP boundary are removed. Additionally,storaged
processes are no longer responsible for handling table data and the associatedAppend
command is removed. Instead, the storage controller directly appends the data requested to be appended by ADAPTER into the appropriate persist shard.On the compute side, the
DataflowDescription
struct has been augmented with an additional storage metadata type which carries all the information required by the storage fat client running incomputed
processes to reach out topersist
and read the correct data. Currently this type (namedCollectionMetadata
) carries the shard ids for the final storage collection and also the persist location details.Even though all data is transited through
persist
, the shard id used is ephemeral. In some sense this PR only goes as far as using persist as a glorified TCP protocol. There will be follow up work that will make the storage controller durably record its choices for persist shard for a particular storage collection id and also change ADAPTER to anticipate storage collection being already there (adapter/storage controller reconciliation).Fixes #12593
Testing
Release notes
This PR includes the following user-facing behavior changes: