Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about distributed queries #39827

Open
arunmarathe opened this issue Aug 2, 2022 · 5 comments
Open

Question about distributed queries #39827

arunmarathe opened this issue Aug 2, 2022 · 5 comments
Labels
question Question?

Comments

@arunmarathe
Copy link

I have a couple of questions but because they are different from each other, I will put them in
separate issues to enable ease-of-tracking for future.

The first question is about writing distributed queries. In ClickHouse, can I write a distributed
query in which a subset of table(s) reside on an external dbms, say Presto or Trino?

Checking the documentation, I see the distributed query engine:
https://clickhouse.com/docs/en/engines/table-engines/special/distributed

However, this does not seem to fit the bill, unless someone can provide concrete examples.

Can anybody suggest a way to go about doing this? What I have mind is something like this (sketch):

SELECT ....
FROM clickhouse_table1, presto_table2, presto_table3
WHERE ...

Thanks,
Arun

@arunmarathe arunmarathe added the question Question? label Aug 2, 2022
@den-crane
Copy link
Contributor

SELECT ....
FROM clickhouse_table1, table_with_hdfs_engine, table_with_hdfs_engine2
WHERE ...

https://clickhouse.com/docs/en/sql-reference/table-functions/hdfs/
https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs

@arunmarathe
Copy link
Author

SELECT ....
FROM clickhouse_table1, table_with_hdfs_engine, table_with_hdfs_engine2
WHERE ...

https://clickhouse.com/docs/en/sql-reference/table-functions/hdfs/ https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs

Thanks; you might have misunderstood my question.

Presto allows several file formats in its hive connector (ORC, Parquet, Avro, ...). Suppose someone has already created, say for concreteness a Parquet table on Presto. And that customer has also created some ClickHouse tables and populated them with data.

How does he write a query that combines a ClickHouse table 'ch' with a Presto table 'pr'?
Should I use S3 to refer to 'pr' inside my ClickHouse query? Do I need to export data out of 'pr' into a ClickHouse table?
That would not be desirable though.

Thanks,
Arun

@den-crane
Copy link
Contributor

den-crane commented Aug 3, 2022

Thanks; you might have misunderstood my question.

No, I have not.

Read https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs
https://clickhouse.com/docs/en/engines/table-engines/integrations/hive
It allows to query ad-hoc data in ORC, Parquet, Avro, ... formats stored in hive.
With joining with CH tables and without.

@arunmarathe
Copy link
Author

arunmarathe commented Aug 3, 2022

Thanks; you might have misunderstood my question.

No, I have not.

Read https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs https://clickhouse.com/docs/en/engines/table-engines/integrations/hive It allows to query ad-hoc data in ORC, Parquet, Avro, ... formats stored in hive. With joining with CH tables and without.

Let me explain my scenario. I have a Presto table whose underlying storage is OBS (Huawei's object store). I access OBS using hive connector. So although hive connector is involved, HDFS is not involved. Recall that my original question did not mention HDFS. It said:

SELECT ....
FROM clickhouse_table1, presto_table2, presto_table3
WHERE ...

I tried the method that you gave a pointer to (thanks!) but it insists on hdfs:// in the URL. Maybe I am missing something?

While writing distributed queries, storage issues usually do not (or ideally, should not) pop up. A multi-part naming should capture the essence of a remote table. For example, "server.database.schema.table" or because people using inconsistently named concepts, "server.catalog.schema.table", etc.

I believe this is not possible in ClickHouse. I think I can still use the S3 table engine though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

2 participants