Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Introducing Logical Optimizer

jpullokkaran edited this page Jul 14, 2016 · 7 revisions

In the upcoming release of SparklineData, we are introducing logical query optimizer. This improves OLAP query latency and reduces computational resources needed for query execution. OLAP queries, especially those generated by tools like Tableau, Microstrategy etc, can be complex and is often not optimized. Spark Catalyst framework performs some optimizations but these are limited and often does not optimize OLAP queries. Furthemore OLAP indexing technology introduces some additional invariants like non nullability of metrics and primary partition key.

As part of this change we now push Group By below Join. Often OLAP queries would include pattern that finds min/max of a table and then self joins (Cross Product) that with the table, which is then further aggregated. By Pushing secondary GB below Join we can improve the performance of join significantly.

The following query is one such sample query:

select c_name a, sum(l_quantity) b, mi c, ma d
from (select r1.c_name, r1.l_quantity, r2.mi, r2.ma
      from orderLineItemPartSupplier r1
      join (select min(l_quantity)mi, max(l_quantity) ma, count(1)
            from orderLineItemPartSupplier
            where not(l_shipdate is null) having count(1) > 0) r2
      ) r3
where not (c_name='NA')
group by c_name, mi, ma
order by a, b, c, d

The above query can be rewritten as

select a, b, mi c, ma d
from (select c_name a, sum(l_quantity) b
      from orderLineItemPartSupplier
      where not (c_name='NA')
      group by c_name) r1
      join
      (select min(l_quantity)mi, max(l_quantity) ma, count(1)
       from orderLineItemPartSupplier
       where not(l_shipdate is null) having count(1) > 0
       ) r2
order by a, b, c, d
Clone this wiki locally