Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coord: restrict queries run on mz_introspection #18276

Conversation

do-not-use-parker-timmerman
Copy link
Contributor

@do-not-use-parker-timmerman do-not-use-parker-timmerman commented Mar 20, 2023

Work in Progress

Motivation

  • add new adapter::coord::introspection module
  • check when sequencing, if we're on the mz_introspection cluster, we're only referencing system items

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

- add new adapter::coord::introspection module
- check when sequencing, if we're on the mz_introspection cluster, we're only referencing system items
@do-not-use-parker-timmerman
Copy link
Contributor Author

Hey @morsapaes, could you provide some insight into how our dbt adapter works? 🙂

This PR is an attempt to restrict what queries are run against the mz_introspection cluster. Thus far the dbt tests are not passing though. For example, test_base is failing because it seems like the base test is selecting against some predefined table which ends up using the mz_introspection cluster. With this PR the test fails with the error message:

19:37:48  Database Error in model swappable (models/swappable.sql)
19:37:48  querying the following items "materialize.test16793410477264782918_test_basic.base" is not allowed from the "mz_introspection" cluster. Use `SET CLUSTER = <cluster-name>` to change your cluster and re-run the query.
19:37:48  compiled Code at target/run/base/models/swappable.sql

For the Materialize connector, it appears as though we should be falling back to the default cluster if no cluster configuration is set (comment) but I'm not sure that's happening? I'm also not familiar with dbt though, so I could be misunderstanding how it works.

@morsapaes
Copy link
Contributor

morsapaes commented Mar 21, 2023

For the Materialize connector, it appears as though we should be falling back to the default cluster if no cluster configuration is set.

This is true, but revisiting the code there are some quirks that can be confusing. Leaving a rambling at the bottom of the comment that hopefully makes things clearer (while also helping me think things through). In theory, the issue with tests shouldn't be tied to where materializations are created, but to the queries that are run to verify that these materializations return the results we expect them to return. As an example, here's what the TestSimpleMaterializationsMaterialize basic test will attempt to do:

  • Open a connection to Materialize (which defaults to mz_introspection).

  • Run dbt seed using base.csv as input. This runs this code and creates a (static) materialized view in the default cluster named base with the contents of the seed file.

  • Run base_view_sql, base_table_sql and base_materialized_var_sql (defined here) as models with different materialization configurations. This runs the equivalent to:

    CREATE { VIEW | MATERIALIZED VIEW } AS
    SELECT * FROM base;

    You can see in the logs that it assumes the correct cluster, for example:

    [2023-03-20T19:11:07Z] DEBUG    file_log:eventmgr.py:57 19:10:35.347861 [debug] [Thread-5  ]: On model.base.swappable: /* {"app": "dbt", 
    "dbt_version": "1.4.1", "profile_name": "test", "target_name": "default", "node_id": "model.base.swappable"} */
    [2023-03-20T19:11:07Z] -- Creates a materialized view, not a table, in Materialize
    [2023-03-20T19:11:07Z] create materialized view "materialize"."test16793394333206546273_test_basic"."swappable"
    [2023-03-20T19:11:07Z]
    [2023-03-20T19:11:07Z] in cluster default
    [2023-03-20T19:11:07Z]
    [2023-03-20T19:11:07Z] as (
    [2023-03-20T19:11:07Z]
    [2023-03-20T19:11:07Z]
    [2023-03-20T19:11:07Z] select * from "materialize"."test16793394333206546273_test_basic"."base"
    [2023-03-20T19:11:07Z] );
    ...
    [2023-03-20T19:11:07Z]   querying the following items "materialize.test16793394333206546273_test_basic.base" is not allowed from the "mz_introspection" cluster. Use `SET CLUSTER = <cluster-name>` to change your cluster and re-run the query.
    [2023-03-20T19:11:07Z]   compiled Code at target/run/base/models/swappable.sql

    I...wouldn't expect this to fail? 🤔 Though it's possible that I have the wrong idea of how IN CLUSTER works under the hood.

All dbt tests follow this pattern of running a seed file with expected results, and validating those against the actual results that are observed. There are things failing that make sense to me, for example:

[2023-03-20T19:11:07Z] create view "materialize"."test16793394665858303751_test_utils"."current_ts"
[2023-03-20T19:11:07Z]   as (
[2023-03-20T19:11:07Z]     select now() as current_ts_column
[2023-03-20T19:11:07Z]   );
...
[2023-03-20T19:11:07Z] INFO     file_log:eventmgr.py:59 19:11:07.093932 [info ] [Thread-113]: 1 of 1 OK created sql view model test16793394665858303751_test_utils.current_ts  [CREATE VIEW in 0.08s]
...
[2023-03-20T19:11:07Z] ______ ERROR at setup of TestCurrentTimestamp.test_current_timestamp_type ______
[2023-03-20T19:11:07Z]
[2023-03-20T19:11:07Z] self = <test_utils.TestCurrentTimestamp object at 0x7f5d92030d30>
[2023-03-20T19:11:07Z] project = <dbt.tests.fixtures.project.TestProjInfo object at 0x7f5d8a9248e0>
[2023-03-20T19:11:07Z]
[2023-03-20T19:11:07Z]     @pytest.fixture(scope="class")
[2023-03-20T19:11:07Z]     def current_timestamp(self, project):
[2023-03-20T19:11:07Z]         run_dbt(["build"])
[2023-03-20T19:11:07Z]         relation = relation_from_name(project.adapter, "current_ts")
[2023-03-20T19:11:07Z] >       result = project.run_sql(f"select current_ts_column from {relation}", fetch="one")
...
[2023-03-20T19:11:07Z] E           psycopg2.errors.InternalError_: querying the following items
"materialize.test16793394665858303751_test_utils.current_ts" is not allowed from the "mz_introspection" cluster. Use
`SET CLUSTER = <cluster-name>` to change your cluster and re-run the query.

Since these tests run in CI, we could default to the default cluster on connection, but this would require us to add conditional logic to connections.py, or some other solution that caters for testing only. I'm a bit puzzled, though!

Forcing clusters after connection

We force the default cluster defined in profiles.yml if no cluster is passed as a model configuration for materialized views (materializedview, seed and table materializations) and indexes. For sources and sinks, it's a little more complicated to explicitly force the cluster (see #17313), but this shouldn't be required because:

  • If users specify the SIZE option, the object will be created in a linked cluster;
  • If users specify the IN CLUSTER option, the object will be created in the specified cluster.

Since Materialize will throw an error if none of the options are specified, there isn't a risk that sources or sinks end up being created in mz_introspection unless users type IN CLUSTER mz_introspection. A similar logic is valid for views, since these objects aren't associated with a cluster and all the work is done in environmentd.

@do-not-use-parker-timmerman
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Further restrict mz_introspection cluster
3 participants