Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter: Change id column in mz_databases and mz_schemas #18719

Conversation

do-not-use-parker-timmerman
Copy link
Contributor

@do-not-use-parker-timmerman do-not-use-parker-timmerman commented Apr 11, 2023

Motivation

This PR adds a known-desirable feature: #18506

This PR namespaces database and schema ids, by either "user" or "system, currently everything is "user". There are two major changes:

  1. Update our stash types, DatabaseKey, SchemaKey, SchemaValue, and ItemValue, so they contain namespaces.
  2. Update all of the builtin tables and views that depend on either database id or schema id.

To add a namespace to DatabaseKey and SchemaKey I created new DatabaseNamespace and SchemaNamespace types, which are optional. This is necessary to support serializing the existing entries, and the new entries. Another approach would be to make DatabaseKey and SchemaKey enums, and then write a custom implementation of Serialize and Deserialize, but after prototyping it, that seemed trickier to maintain.

Tips for reviewer

I'm mostly unsure about the migrations of the builtin tables. I believe it should "just work" since the fingerprint of each builtin should change, but if someone could confirm that would be great!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@do-not-use-parker-timmerman do-not-use-parker-timmerman force-pushed the migration/mz_tables branch 4 times, most recently from 7227474 to 045fb9d Compare April 12, 2023 22:22
- add namespace to DatabaseKey, SchemaKey, and ItemValue
- add stash migration
- update Builtin tables and view
- update sqllogictests
- update testdrive
@do-not-use-parker-timmerman do-not-use-parker-timmerman changed the title [DRAFT] adapter: Change id column in mz_databases and mz_schemas adapter: Change id column in mz_databases and mz_schemas Apr 12, 2023
@do-not-use-parker-timmerman do-not-use-parker-timmerman marked this pull request as ready for review April 12, 2023 23:20
Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple of questions and comments. Also since we're adding a system variant to schemas, we may want to use it for the catalog schemas.

src/stash/src/lib.rs Show resolved Hide resolved
src/sql/src/names.rs Outdated Show resolved Hide resolved
src/adapter/src/catalog/storage.rs Outdated Show resolved Hide resolved
src/adapter/src/catalog/storage.rs Show resolved Hide resolved
src/adapter/src/catalog/storage.rs Show resolved Hide resolved
src/adapter/src/catalog/storage.rs Show resolved Hide resolved
src/adapter/src/catalog/storage.rs Show resolved Hide resolved
Comment on lines 630 to 637
/// A type that specifies how we serialize and deserialize a [`SchemaId`].
///
/// Note: JSON maps require their keys to be strings, and we use [`SchemaId`] as the key
/// in some maps, so we need to serialize them as plain strings, which is what this type
/// does.
#[derive(Clone, Debug, Serialize, Deserialize)]
#[serde(transparent)]
struct SchemaIdJson(String);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should belong in storage.rs or somewhere nearby, since this create and module don't normally care how things are stored on disk.

I have a slight preference (but only slight) to use

#[serde(into = "BTreeMap<String, _>")]
#[serde(try_from = "BTreeMap<String, _>")]

where ever this is being used as a key if that means we can move this logic out of the sql create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only necessary for dumping the catalog to a string. In terms of location, I did try putting it in storage.rs but it kind of felt out of place there since its not really related to how we persist things to disk, since it's only used for this one function. Instead of these custom impls, something else I tried was the following:

pub fn dump(&self) -> String {
  let database_by_str: BTreeMap<String, _> = self.database_by_id.iter()
    .map(|(key, value)| (key.to_string(), value.debug_json()))
    .collect();
  serde_json::to_string(database_by_str).expect("serialization cannot fail")
}

impl Database {
    fn debug_json(&self) -> serde_json::Value {
        let schemas_by_str: BTreeMap<String, _> = self
            .schemas_by_id
            .iter()
            .map(|(key, value)| (key.to_string(), value))
            .collect();
        
        serde_json::json!({
            name: self.name,
            id: self.id,
            schemas_by_id: schemas_by_str,
            schemas_by_name: self.schemas_by_name,
            owner_id: self.owner_id,
        })
    }
}

Maybe that's a bit better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer your proposed approach compared to the current approach. The current approach forces us to always serialize SchemaId as a string while your proposed approach only serializes SchemaId as a string when it's required (in the dump function. I can imagine in the future if SchemaId is serialized in the stash then we would want to serialize it normally and not as a string.

This is all subjective though, so I'll leave the final decision up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the debug_json(...) approach! Since we only need this "special" case serialization for a single method, I think putting the special behavior in that method makes the most sense

src/sql/src/names.rs Outdated Show resolved Hide resolved
// enums after this upgrade.
|txn: &mut Transaction<'_>, _now, _bootstrap_args| {
// Migrate all of our DatabaseKeys.
txn.databases.migrate(|key, value| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new struct fields and this migration can maybe be done in an easier way, if done before the big vec of migrations is run. You could write a stash function that calls iter_raw on these and attempts to marshal into the old struct. If successful, then you add a retraction to a batch and an addition with the new key. If the marshal into the old struct fails, it's already the new struct and should be skipped. The old structs would be copied and exist only within this migration function, and everything could then use the new enums elsewhere. Does that work? I think it'd simplify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to have enum DatabaseKey { User(u64), System(u64) }, and a enum SchemaKey of the same, would definitely simplify things! My hesitation to adding custom migration logic like this, or custom serde impls was to prevent any complications when we later try to improve migrations, i.e. #17466. Although off the top of my head I don't think what you're proposing should complicate anything too much, so I don't really feel strongly one way or another, so I'll defer to you here on what you think is best :)

src/stash/src/lib.rs Outdated Show resolved Hide resolved
@teskje teskje mentioned this pull request Apr 17, 2023
6 tasks
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants