Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV uploads for ClickHouse Cloud #236

Merged
merged 11 commits into from
May 17, 2024
59 changes: 31 additions & 28 deletions src/metabase/driver/clickhouse.clj
Original file line number Diff line number Diff line change
Expand Up @@ -44,33 +44,12 @@

(defmethod driver/database-supports? [:clickhouse feature] [_driver _feature _db] supported?))

(def ^:private ^{:arglists '([db])} cloud?
"Is this a cloud DB?"
(memoize/ttl
^{::memoize/args-fn (fn [[db]]
[(mdb.connection/unique-identifier) (:details db)])}
(fn [db]
(sql-jdbc.execute/do-with-connection-with-options
:clickhouse db nil
(fn [^java.sql.Connection conn]
(with-open [stmt (.prepareStatement conn "SELECT value='1' FROM system.settings WHERE name='cloud_mode'")
rset (.executeQuery stmt)]
(when (.next rset)
(.getBoolean rset 1))))))
;; cache the results for 60 minutes; TTL is here only to eventually clear out old entries/keep it from growing too
;; large
:ttl/threshold (* 60 60 1000)))

(defmethod driver/database-supports? [:clickhouse :uploads] [_driver _feature db]
(cloud? db))

(def ^:private default-connection-details
{:user "default" :password "" :dbname "default" :host "localhost" :port "8123"})

(defmethod sql-jdbc.conn/connection-details->spec :clickhouse
[_ details]
;; ensure defaults merge on top of nils
(let [details (reduce-kv (fn [m k v] (assoc m k (or v (k default-connection-details))))
(defn- connection-details->spec* [details]
(let [;; ensure defaults merge on top of nils
details (reduce-kv (fn [m k v] (assoc m k (or v (k default-connection-details))))
default-connection-details
details)
{:keys [user password dbname host port ssl use-no-proxy]} details
Expand All @@ -86,12 +65,36 @@
:ssl (boolean ssl)
:use_no_proxy (boolean use-no-proxy)
:use_server_time_zone_for_dates true
;; select_sequential_consistency is needed to guarantee that we can query data from any replica in CH Cloud
;; immediately after it is written
:select_sequential_consistency true
:product_name product-name}
(sql-jdbc.common/handle-additional-options details :separator-style :url))))

(def ^:private ^{:arglists '([db-details])} cloud?
"Is this a cloud DB?"
(memoize/ttl
(fn [db-details]
(sql-jdbc.execute/do-with-connection-with-options
:clickhouse
(connection-details->spec* db-details)
nil
(fn [^java.sql.Connection conn]
(with-open [stmt (.prepareStatement conn "SELECT value='1' FROM system.settings WHERE name='cloud_mode'")
rset (.executeQuery stmt)]
(when (.next rset)
(.getBoolean rset 1))))))
;; cache the results for 48 hours; TTL is here only to eventually clear out old entries
:ttl/threshold (* 48 60 60 1000)))

(defmethod sql-jdbc.conn/connection-details->spec :clickhouse
[_ details]
(cond-> (connection-details->spec* details)
(cloud? details)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, select_sequential_consistency just limits which replica you query from, and doesn't affect the performance of the query on that specific replica.

This logic seems like it will break consistency for multi-replica on prem installations, and needlessly disable it on single replica ones. It seems best to just to enable it unconditionally for now?

For any multi-replica installation, enabling it for all connections seems to wipe out most of the benefits of connecting to a distributed installation, but I don't see any way around this besides reworking metabase to support separate pools for different connection flavors.

Copy link
Contributor Author

@calherries calherries May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're mistaking the behaviour of select_sequential_consistency on CH Cloud (docs here), thinking it's the same as on-premise (the link you shared). We're only enabling select_sequential_consistency on CH Cloud, where it's much less of an issue.

There's more on SharedMergeTree in this blog post.

Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn’t need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query.

If your use case requires consistency guarantees that each server is delivering the same query result, then you can run the SYNC REPLICA system statement, which is a much more lightweight operation with the SharedMergeTree. Instead of syncing data (or metadata with zero-copy replication) between servers, each server just needs to fetch the current version of metadata from Keeper.

And I can't imagine fetching metadata from Keeper can be a bottleneck. I also got this DM from Serge on select_sequential_consistency:

With CH Cloud, there should not be any issues with that, as it uses “shared” merge tree and this is merely a metadata sync before executing the query (this is super fast, as it is a fetch of the meta from zookeeper).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, select_sequential_consistency is correctly enabled for the Cloud type of instance only.

For on-premise clusters (in a follow-up PR), we should just set insert_quorum equal to the nodes count so that replica sync won't be necessary and will be effectively done during the CSV insert, and it won't affect other read-only operations.

;; select_sequential_consistency guarantees that we can query data from any replica in CH Cloud
;; immediately after it is written
(assoc :select_sequential_consistency true)))

(defmethod driver/database-supports? [:clickhouse :uploads] [_driver _feature db]
(cloud? (:details db)))

(defmethod driver/can-connect? :clickhouse
[driver details]
(if config/is-test?
Expand Down Expand Up @@ -203,7 +206,7 @@
db-id
{:write? true}
(fn [^java.sql.Connection conn]
(let [sql (format "insert into %s (%s)" (quote-name table-name) (str/join ", " (map quote-name column-names)))]
(let [sql (format "INSERT INTO %s (%s)" (quote-name table-name) (str/join ", " (map quote-name column-names)))]
(with-open [ps (.prepareStatement conn sql)]
(doseq [row values]
(when (seq row)
Expand Down