Question about distributed queries #39827

arunmarathe · 2022-08-02T14:40:35Z

I have a couple of questions but because they are different from each other, I will put them in
separate issues to enable ease-of-tracking for future.

The first question is about writing distributed queries. In ClickHouse, can I write a distributed
query in which a subset of table(s) reside on an external dbms, say Presto or Trino?

Checking the documentation, I see the distributed query engine:
https://clickhouse.com/docs/en/engines/table-engines/special/distributed

However, this does not seem to fit the bill, unless someone can provide concrete examples.

Can anybody suggest a way to go about doing this? What I have mind is something like this (sketch):

SELECT ....
FROM clickhouse_table1, presto_table2, presto_table3
WHERE ...

Thanks,
Arun

den-crane · 2022-08-02T19:30:16Z

SELECT ....
FROM clickhouse_table1, table_with_hdfs_engine, table_with_hdfs_engine2
WHERE ...

https://clickhouse.com/docs/en/sql-reference/table-functions/hdfs/
https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs

arunmarathe · 2022-08-03T15:46:15Z

SELECT ....
FROM clickhouse_table1, table_with_hdfs_engine, table_with_hdfs_engine2
WHERE ...
https://clickhouse.com/docs/en/sql-reference/table-functions/hdfs/ https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs

Thanks; you might have misunderstood my question.

Presto allows several file formats in its hive connector (ORC, Parquet, Avro, ...). Suppose someone has already created, say for concreteness a Parquet table on Presto. And that customer has also created some ClickHouse tables and populated them with data.

How does he write a query that combines a ClickHouse table 'ch' with a Presto table 'pr'?
Should I use S3 to refer to 'pr' inside my ClickHouse query? Do I need to export data out of 'pr' into a ClickHouse table?
That would not be desirable though.

Thanks,
Arun

den-crane · 2022-08-03T16:54:03Z

Thanks; you might have misunderstood my question.

No, I have not.

Read https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs
https://clickhouse.com/docs/en/engines/table-engines/integrations/hive
It allows to query ad-hoc data in ORC, Parquet, Avro, ... formats stored in hive.
With joining with CH tables and without.

arunmarathe · 2022-08-03T17:14:40Z

Thanks; you might have misunderstood my question.

No, I have not.

Read https://clickhouse.com/docs/en/engines/table-engines/integrations/hdfs https://clickhouse.com/docs/en/engines/table-engines/integrations/hive It allows to query ad-hoc data in ORC, Parquet, Avro, ... formats stored in hive. With joining with CH tables and without.

Let me explain my scenario. I have a Presto table whose underlying storage is OBS (Huawei's object store). I access OBS using hive connector. So although hive connector is involved, HDFS is not involved. Recall that my original question did not mention HDFS. It said:

SELECT ....
FROM clickhouse_table1, presto_table2, presto_table3
WHERE ...

I tried the method that you gave a pointer to (thanks!) but it insists on hdfs:// in the URL. Maybe I am missing something?

While writing distributed queries, storage issues usually do not (or ideally, should not) pop up. A multi-part naming should capture the essence of a remote table. For example, "server.database.schema.table" or because people using inconsistently named concepts, "server.catalog.schema.table", etc.

I believe this is not possible in ClickHouse. I think I can still use the S3 table engine though.

den-crane · 2022-08-05T02:54:59Z

OK, then https://clickhouse.com/docs/en/engines/table-engines/integrations/s3 https://clickhouse.com/docs/en/sql-reference/table-functions/s3

arunmarathe added the question Question? label Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about distributed queries #39827

Question about distributed queries #39827

arunmarathe commented Aug 2, 2022

den-crane commented Aug 2, 2022

arunmarathe commented Aug 3, 2022

den-crane commented Aug 3, 2022 •

edited

arunmarathe commented Aug 3, 2022 •

edited

den-crane commented Aug 5, 2022

Question about distributed queries #39827

Question about distributed queries #39827

Comments

arunmarathe commented Aug 2, 2022

den-crane commented Aug 2, 2022

arunmarathe commented Aug 3, 2022

den-crane commented Aug 3, 2022 • edited

arunmarathe commented Aug 3, 2022 • edited

den-crane commented Aug 5, 2022

den-crane commented Aug 3, 2022 •

edited

arunmarathe commented Aug 3, 2022 •

edited