Sometimes people new to Datomic datalog write queries that are not performant. Other times queries are generated by application logic and it's much simpler to generate them naively, and let some other part of the application decide how to improve the query later. In other cases, the optimal form for a Datomic query might change over the lifetime of an application, i.e. due to real changes in the domain data being tracked, the addition of batch jobs, etc. Slow queries can significantly impact service health, dev and data science feedback cycles, etc., so query performance is a justified optimization target.
This project provides a "good enough" Datomic query improver based on two heuristics to handle these prblems. The logic helps avoid basic mistakes, rather than finding unique high performance edge cases. If you are writing awesome Datomic queries, it won't likely make them more awesome and might make some of them less awesome. If you're writing or generating bad Datomic queries, it will probably make them significantly less bad, and might make a large set of them awesome.
Two heuristics are used. Both are described in the Datomic docs: join along
and most selective clause first.
These two heuristics are applied in this order to any where clause. Treating the original query's :where
clauses as an unordered set from which to draw the next :where
clause:
(1) Rank clauses by number of new variables introduced. (2) For all clauses that introduce the fewest new variables, pick (by attribute) the clause with the fewest datoms.
If a query contains arguments to :in
, those are used as the initial bindings. Otherwise, the first clause will be chosen based on criteria (2).
Each clause is then added in order to build a logical ordering based on this sort (determined by bindings introduced in relation to previous clauses).
No guarantees are made about ordering when exact ties occur.
In order to compute (2), counts of all datomis with each attribute must be calculated. Note that this is available through the
Datomic client API using db-stats
(although not yet documented). If using peer, or not wanting to depend on db-stats
, an
example stats calculating approach is in dev/statics.clj
. These statistics are updated lazily using datoms
and stored in the database.