Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Distributed table engine #14

Closed
Volodin-DD opened this issue Aug 27, 2021 · 26 comments
Closed

Support for Distributed table engine #14

Volodin-DD opened this issue Aug 27, 2021 · 26 comments
Labels
help wanted Extra attention is needed

Comments

@Volodin-DD
Copy link

Hi. We have urge to use distributed engine for our ELT process. I have some work done (very unstable, but somewhat tested) in my fork of your repo Still have many work to do, but dbt run creates distributed table on cluster and inserts values from select there. In very small test everything works as expected.

If you are already working on this feature or have plans (it seems good to have this feature in my opinion), maybe I can help.

@silentsokolov
Copy link
Collaborator

I am intrigued. Distributed table is a useful feature.

Community and I would be grateful if you could release this feature.

Tell me if you need help.

@Volodin-DD
Copy link
Author

Great, so I am making a pull request. You can always pm me via telegram (https://t.me/volodin_dd) for less formal discussion in russian). Thanks a lot.

@silentsokolov silentsokolov added the help wanted Extra attention is needed label Aug 27, 2021
@Volodin-DD
Copy link
Author

#15 link to pull request

@kjdeluna
Copy link

@silentsokolov Any updates here?

@Volodin-DD
Copy link
Author

Very interested too.

@silentsokolov
Copy link
Collaborator

I think more specifics are needed to implement this issue. Because the distributed tables are part of a cluster. And "cluster management" is not a part of the dbt. Of course, we can use distributed tables but creation and management ... I dont know.

@Volodin-DD
Copy link
Author

I think more specifics are needed to implement this issue. Because the distributed tables are part of a cluster. And "cluster management" is not a part of the dbt. Of course, we can use distributed tables but creation and management ... I dont know.

Maybe you are right. I was using my solution and then using solution with running models on each shard with distributed-like view in the end (table function cluster). And second one is good. May be there is no reason for adding distributed materialization in adapter. Ephemeral would be more interesting)

@Volodin-DD
Copy link
Author

Above means, maybe this issue should be closed)

@kjdeluna
Copy link

kjdeluna commented Feb 2, 2022

Running models on each shard is really a good solution but on my end, this is not possible. One of the solutions that I've thought of is creating the replicated tables manually and then letting DBT take care of the Distributed table.

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

@silentsokolov
Copy link
Collaborator

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

This sounds like a suggestion, I'll fix it soon

@Volodin-DD
Copy link
Author

Running models on each shard is really a good solution but on my end, this is not possible. One of the solutions that I've thought of is creating the replicated tables manually and then letting DBT take care of the Distributed table.

I think this would work but when I tried it, it's generating an error because Distributed Table doesn't have ORDER BY but ORDER BY is required in table creation

Maybe this is a solution. You can use table functions in dbt models. And keep all "dbt stuff". For example: SELECT * FROM cluster('cluster_name', {{ source('schema', 'table_name') }}). The main problem is that you cannot insert in such tables, just select data from cluster.

@kjdeluna
Copy link

kjdeluna commented Feb 4, 2022

Maybe this is a solution. You can use table functions in dbt models. And keep all "dbt stuff". For example: SELECT * FROM cluster('cluster_name', {{ source('schema', 'table_name') }}). The main problem is that you cannot insert in such tables, just select data from cluster.

Can't we just use the Distributed table for both selects and inserts?

@gfunc
Copy link
Contributor

gfunc commented Mar 28, 2022

using a fork with support for distributed engine as well, repo here

my solution to distributed tables was to create the on cluster distributed table with the model name, in the meanwhile create a "real" table with the name {{model_name}}_local and on cluster clause.

And each type of materialization (whether table or incremental) would require the SQL to run on only one node ( without on cluster clause) and thenINSERT INTO distributed table.

Would love to contribute once I reformat the code.

@gfunc
Copy link
Contributor

gfunc commented Sep 29, 2022

I started to merge my approach toward distributed table engine. And I want to start a discussion early on which is about the handling of unique_keys.

My production env has huge tables (200GB+ each shard), my approach was to use the OPTIMIZE syntax combined with the partition_by config. (ReplacingMergeTree engine's async optimize does not suit this context). logic as below:

SELECT DISTINCT {{partition_by}} FROM {{intermediate_relation}};
OPTIMIZE TABLE {{model_name}}_local ON CLUSTER {{cluster_name}} PARTITION {{partition_name}} [FINAL] DEDUPLICATE BY {{unique_keys}}, {{partition_column}};

But to be honest, despite the occasional lag of async distributed insert (which could be resolved by SYSTEM FLUSH DISTRIBUTED command) and incompleteness of OPTIMIZE (behavior controlled by replication_alter_partitions_sync setting), as of version 21.10, the performance of OPTIMIZE is not as lightning fast.

Any suggestions, any other solutions ? or is there any performance bump of OPTIMIZE in recent releases?

@genzgd
Copy link
Contributor

genzgd commented Sep 29, 2022

Optimize is by nature an expensive operation, since it most cases it rewrites all of the data in the table. (You can also OPTIMIZE a ReplacingMergeTree table, which is essentially the same operation as your DEDUPLICATE query).

Unfortunately there's no good answer for unique keys in ClickHouse -- this is a natural result of its sparse index/column oriented architecture. If you can't deduplicate before insert the available options are all less than ideal. The difficulties associated with deduplication/unique keys are one of main tradeoffs for performance.

@gfunc
Copy link
Contributor

gfunc commented Sep 30, 2022

Optimize is by nature an expensive operation, since it most cases it rewrites all of the data in the table. (You can also OPTIMIZE a ReplacingMergeTree table, which is essentially the same operation as your DEDUPLICATE query).

Unfortunately there's no good answer for unique keys in ClickHouse -- this is a natural result of its sparse index/column oriented architecture. If you can't deduplicate before insert the available options are all less than ideal. The difficulties associated with deduplication/unique keys are one of main tradeoffs for performance.

I agree. I tried to use select * from {{model_name}} final in my DBT ETLs for ReplacingMergeTree tables, and the final keyword has a drastic drawback of performance and I need to align how deduplication is handled when people start to use window functions all over the place, thus I decided that it is important to handle unique_keys within DBT logic for the sake of downstream ETLs.

My use cases for unique_keys are mainly for tables with incremental materialization, and the handling of this situation in the master branch at the moment seems to be a full insert into the new table and then a swap which is much more expensive than OPTIMIZE for partitions with huge tables and basically the same for small tables.

I would go ahead the implement my logic and give it a try.

@genzgd
Copy link
Contributor

genzgd commented Sep 30, 2022

I agree it's worth a try. The current approach is "okay" for smaller tables, but I'm not at all surprised that incremental materializations get unusably slow when you get to "ClickHouse" scale data. If you could limit the incremental materialization to only the partitions actually affected it might help, but that assumes a level of intelligence in DBT that is going to be a lot of work and a lot of fine tuning for particular tables.

@simpl1g
Copy link
Contributor

simpl1g commented Jan 24, 2023

@genzgd in today's dbt+ClickHouse webinar you answered that you don't need Distributed tables when using Replicated database, but I don't understand how is that possible? According to documentation https://clickhouse.com/docs/en/engines/database-engines/replicated/ we still need them. And I just checked in my cluster, I can't query data from all shards without creating Distributed tables. Am I missing something?

In dbt community this is the first question I see when somebody joins. "How can I work with cluster in dbt?" And standart answer - "You need to rewrite materializations to correctly support Distributed tables". It would be great to resolve this issue

@genzgd
Copy link
Contributor

genzgd commented Jan 24, 2023

Hi @simpl1g -- I believe you are correct. I've been thinking too much in ClickHouse Cloud terms which doesn't require sharding, but even with a Replicated Database it looks like you need a distributed table for multiple shards. We'll take another look at this issue and see where it fits on our roadmap.

@Keyeoh
Copy link

Keyeoh commented Mar 29, 2023

Hi there,

have there been any updates on this issue? In the Clickhouse integrations' documentation, I have seen the following:

imagen

Is this the recommended way to go? Does dbt create the distributed table by itself? How would it know the name of the cluster that needs to be passed as a parameter to the Distributed() engine definition?

We currently have an on-premises Clickhouse with four shards and no replicas, and we would like to port our current ELT pipelines to dbt, but this is still raising a lot of doubts...

Any hint or help would be much appreciated.

@gladkikhtutu
Copy link
Contributor

I also support @simple argument. In our company we have 6 nodes, and we counted that if we run dbt 6 times in every node, we will get 6*6 distributed select queries to get full data and 6 on cluster inserts instead of one. It is too expensive for our infrastructure.

@vrfn
Copy link

vrfn commented May 12, 2023

Hi @genzgd, you commented that you'd take a look and see where this issue fits on the roadmap. Do you have any update on this? We'd love to be able to use distributed tables.

@genzgd
Copy link
Contributor

genzgd commented May 12, 2023

@vrfn @simpl1g My apologies for not following up on this. The reality is that our team has had several other priorities over the past several months and the question of distributed tables and dbt is not currently on our development roadmap. We also don't have a good test framework or infrastructure for tests of distributed tables.

We would consider an updated version of the original PR here https://github.com/ClickHouse/dbt-clickhouse/pull/15/files that has been tested in a real world situation and with disclaimers that support is purely experimental. Ideally the feature would only be enabled with an explicit setting and be well separated from "core" dbt functionality so as to not break existing use case.

@gladkikhtutu
Copy link
Contributor

I added PR with distributed table materialization with actual structure, the main idea is the same as in above PR. #163
Hope for your reaction

@gladkikhtutu
Copy link
Contributor

Added also distributed incremental materialization PR #172

@genzgd
Copy link
Contributor

genzgd commented Nov 30, 2023

Experimental Distributed table materializations added in #172

@genzgd genzgd closed this as completed Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants