Introducing Logical Optimizer

In the upcoming release of SparklineData, we are introducing logical query optimizer. This improves OLAP query latency and reduces computational resources needed for query execution. OLAP queries, especially those generated by tools like Tableau, Microstrategy etc, can be complex and is often not optimized. Spark Catalyst framework performs some optimizations but these are limited and often does not optimize OLAP queries. Furthemore OLAP indexing technology introduces some additional invariants like non nullability of metrics and primary partition key.

As part of this change we now push Group By below Join. Often OLAP queries would include pattern that finds min/max of a table and then self joins (Cross Product) that with the table, which is then further aggregated. By Pushing secondary GB below Join we can improve the performance of join significantly.

The following query is one such sample query:

select c_name a, sum(l_quantity) b, mi c, ma d
from (select r1.c_name, r1.l_quantity, r2.mi, r2.ma
      from orderLineItemPartSupplier r1
      join (select min(l_quantity)mi, max(l_quantity) ma, count(1)
            from orderLineItemPartSupplier
            where not(l_shipdate is null) having count(1) > 0) r2
      ) r3
where not (c_name='NA')
group by c_name, mi, ma
order by a, b, c, d

The above query can be rewritten as

select a, b, mi c, ma d
from (select c_name a, sum(l_quantity) b
      from orderLineItemPartSupplier
      where not (c_name='NA')
      group by c_name) r1
      join
      (select min(l_quantity)mi, max(l_quantity) ma, count(1)
       from orderLineItemPartSupplier
       where not(l_shipdate is null) having count(1) > 0
       ) r2
order by a, b, c, d

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing Logical Optimizer

Clone this wiki locally