Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV uploads for ClickHouse Cloud #236

Merged
merged 11 commits into from
May 17, 2024
75 changes: 74 additions & 1 deletion src/metabase/driver/clickhouse.clj
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"Driver for ClickHouse databases"
#_{:clj-kondo/ignore [:unsorted-required-namespaces]}
(:require [clojure.string :as str]
[honey.sql :as sql]
[metabase [config :as config]]
[metabase.driver :as driver]
[metabase.driver.clickhouse-introspection]
Expand All @@ -12,7 +13,11 @@
[metabase.driver.sql-jdbc [common :as sql-jdbc.common]
[connection :as sql-jdbc.conn]]
[metabase.driver.sql-jdbc.execute :as sql-jdbc.execute]
[metabase.driver.sql.query-processor :as sql.qp]
[metabase.driver.sql.util :as sql.u]
[metabase.query-processor.writeback :as qp.writeback]
[metabase.test.data.sql :as sql.tx]
[metabase.upload :as upload]
[metabase.util.log :as log]))

(set! *warn-on-reflection* true)
Expand All @@ -32,7 +37,9 @@
:test/jvm-timezone-setting false
:connection-impersonation false
:schemas true
:datetime-diff true}]
:uploads true
:datetime-diff true
:upload-with-auto-pk false}]
calherries marked this conversation as resolved.
Show resolved Hide resolved

(defmethod driver/database-supports? [:clickhouse feature] [_driver _feature _db] supported?))

Expand Down Expand Up @@ -112,6 +119,72 @@
:semantic-version {:major (.getInt rset 2)
:minor (.getInt rset 3)}})))))

(defmethod driver/upload-type->database-type :clickhouse
[_driver upload-type]
(case upload-type
::upload/varchar-255 "Nullable(String)"
::upload/text "Nullable(String)"
::upload/int "Nullable(Int64)"
::upload/float "Nullable(Float64)"
::upload/boolean "Nullable(Boolean)"
::upload/date "Nullable(Date32)"
::upload/datetime "Nullable(DateTime64(3))"
calherries marked this conversation as resolved.
Show resolved Hide resolved
;; FIXME: should be `Nullable(DateTime64(3))`
::upload/offset-datetime nil))

(defmethod driver/table-name-length-limit :clickhouse
[_driver]
;; FIXME: This is a lie because you're really limited by a filesystems' limits, because Clickhouse uses
;; filenames as table/column names. But its an approximation
206)

(defn- quote-name [s]
(let [parts (str/split (name s) #"\.")]
(str/join "." (map #(str "`" % "`") parts))))

(defn- create-table!-sql
[driver table-name column-definitions & {:keys [primary-key]}]
(str/join "\n"
[(first (sql/format {:create-table (keyword table-name)
:with-columns (mapv (fn [[name type-spec]]
(vec (cons name [[:raw type-spec]])))
column-definitions)}
:quoted true
:dialect (sql.qp/quote-style driver)))
"ENGINE = MergeTree"
(format "PRIMARY KEY (%s)" (str/join ", " (map quote-name primary-key)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this and put the primary-key in the ORDER BY clause instead.

i.e. the table definition should be just:

CREATE TABLE foo
(col1 Type, ...)
ENGINE MergeTree
ORDER BY (primary-key)

Copy link
Collaborator

@slvrtrn slvrtrn May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#order_by

ClickHouse uses the sorting key as a primary key if the primary key is not defined explicitly by the PRIMARY KEY clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've made that change Use order by to specify primary key

"ORDER BY ()"]))

(defmethod driver/create-table! :clickhouse
[driver db-id table-name column-definitions & {:keys [primary-key]}]
(let [sql (create-table!-sql driver table-name column-definitions :primary-key primary-key)]
(qp.writeback/execute-write-sql! db-id sql)))

(defmethod driver/insert-into! :clickhouse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use the default sql-jdbc implementation, I've chosen to use the approach recommended here, to allow for large uploads.

[driver db-id table-name column-names values]
(when (seq values)
(sql-jdbc.execute/do-with-connection-with-options
driver
db-id
{:write? true}
(fn [^java.sql.Connection conn]
(let [sql (format "insert into %s (%s)" (quote-name table-name) (str/join ", " (map quote-name column-names)))]
(with-open [ps (.prepareStatement conn sql)]
(doseq [row values]
(when (seq row)
(doseq [[idx v] (map-indexed (fn [x y] [(inc x) y]) row)]
(condp isa? (type v)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a bottleneck, but making a protocol like (ps-set! v ps idx) could remove the reflection of type and the O(n) branching of condp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, how do you imagine this being implemented exactly? The tricky part is this multimethod doesn't know the type of each column without using reflection, and values can be nil. I'm imagining this:

  • for each value, call (ps-set! v ps idx)
  • px-set! looks up the method to use for the idx in a volatile map, e.g. setString etc. If the map contains a method for the idx, it uses it. Otherwise calculate the method using a similar condp expression and save it to the map under the key idx, and execute the method.

That way we'll only use reflection once for every column.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but I was thinking we'd just implement the protocol directly on each of these java types for v, eschewing any reflection or map look-up. I think you can just use Object for the fallback case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, anyway this seems pretty inconsequential compared to all the slow stuff we're doing on the Metabase side. I think I'd prefer to keep it simple and not change this condp for now

java.lang.String (.setString ps idx v)
java.lang.Boolean (.setBoolean ps idx v)
java.lang.Long (.setLong ps idx v)
java.lang.Double (.setFloat ps idx v)
java.math.BigInteger (.setObject ps idx v)
java.time.LocalDate (.setObject ps idx v)
java.time.LocalDateTime (.setObject ps idx v)
(.setString ps idx v)))
(.addBatch ps)))
(doall (.executeBatch ps))))))))

;;; ------------------------------------------ User Impersonation ------------------------------------------

(defmethod driver.sql/set-role-statement :clickhouse
Expand Down
2 changes: 2 additions & 0 deletions test/metabase/test/data/clickhouse.clj
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@

(defmethod sql.tx/add-fk-sql :clickhouse [& _] nil) ; TODO - fix me

(defmethod sql.tx/session-schema :clickhouse [_] "default")

(defmethod tx/supports-time-type? :clickhouse [_driver] false)

(defn rows-without-index
Expand Down