Skip to content
forked from epfldata/squall

An streaming / online query processing / analytics engine based on Apache Storm

Notifications You must be signed in to change notification settings

2221758805/squall

 
 

Repository files navigation

#Squall Squall is an online query processing engine built on top of Storm. Similar to how Hive provides SQL syntax on top of Hadoop for doing batch processing, Squall executes SQL queries on top of Storm for doing online processing. Squall supports a wide class of SQL analytics ranging from simple aggregations to more advanced UDF join predicates and adaptive rebalancing of load. It is being actively developed by several contributors from the EPFL DATA lab. Squall is undergoing a continuous process of development, currently it supports the following:

  • SQL (Select-Project-Join) query processing over continuous streams of data.
  • Full fledged & full-history stateful computation essential for approximate query processing, e.g. Online Aggregation.
  • Time based Window Semantics for infinite data streams (currently in-progress).
  • Theta Joins: complex join predicates, including inequality, band, and arbitrary UDF join predicates. This gives a more comprehensive support and flexibility to data analytics. For example, Hive plans to support theta joins in response to user requests.
  • Continuous load balance and adaptation to data skew.
  • Usage: An API for arbitrary SQL query processing or a frontend query processor that parses SQL to a storm topology.
  • Throughput rates of upto Millions of tuples/second and latencies of milliseconds on a 16 machine cluster. Scalable to large cluster settings.
  • Out-of-Core Processing: Can operate efficiently under limited memory resources through efficient disk based datastructures and indexes.
  • Guarantees: At least-once or at most-once semantics. No support for exactly-once semantics yet, however it is planned for.
  • Elasticity: Scaling out according to the load.
  • DashBoard: Integrating support for real time visualizations.

Example:

Consider the following SQL query:

SELECT C_MKTSEGMENT, COUNT(O_ORDERKEY)
FROM CUSTOMER join ORDERS on C_CUSTKEY = O_CUSTKEY
GROUP BY C_MKTSEGMENT

We provide several way of running queries:

  1. The declarative interface of Squall (which is equipped with a cost-based optimizer) supports SQL directly

  2. The imperative interface of Squall supports the online distributed query plan (full code) as follows:

Component customer = new DataSourceComponent("customer", conf)
                            .add(new ProjectOperator(0, 6));
Component orders = new DataSourceComponent("orders", conf)
                            .add(new ProjectOperator(1));
Component custOrders = new EquiJoinComponent(customer, 0, orders, 0)
                            .add(new AggregateCountOperator(conf).setGroupByColumns(1));

Queries are mapped to operator trees in the spirit of the query plans of relational database systems. These are are in turn mapped to Storm workers. (There is a parallel implementation of each operator, so in general an operator is processed by multiple workers). Some operations of relational algebra, such as selections and projections, are quite simple, and assigning them to separate workers is inefficient. Rather than requiring the predecessor operator to send its output over the network to the workers implementing these simple operations, the simple operations can be integrated into the predecessor operators and postprocess the output there. This is typically also done in classical relational database systems, but in a distributed environment, the benefits are even greater. In the Squall API, query plans are built bottom-up from operators (called components or super-operators) such as data source scans and joins; these components can then be extended by postprocessing operators such as projections.

Documentation

Detailed documentation can be found on the Squall wiki.

Contributing to Squall

We'd love to have your help in making Squall better. If you're interested, please communicate with us your suggestions and get your name to the Contributors list.

License

Squall is licensed under Apache License v2.0.

About

An streaming / online query processing / analytics engine based on Apache Storm

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 83.4%
  • C 6.3%
  • Shell 4.1%
  • Perl 2.9%
  • Scala 1.4%
  • Ruby 1.2%
  • Other 0.7%