Switch branches/tags
2444 2575 ENT-20 ENT-29 GITHUB-982 SNAP-644 SNAP-788 SNAP-1009 SNAP-1271 SNAP-1272 SNAP-1309 SNAP-1325 SNAP-1452 SNAP-1529 SNAP-1603 SNAP-1702 SNAP-1713 SNAP-1745 SNAP-1780 SNAP-1786 SNAP-1852 SNAP-1938_2 SNAP-1965 SNAP-1971 SNAP-1975 SNAP-2013 SNAP-2049 SNAP-2050 SNAP-2062 SNAP-2074 SNAP-2074_1 SNAP-2082 SNAP-2088_2 SNAP-2089-2 SNAP-2089 SNAP-2093-UPDATE SNAP-2110 SNAP-2115 SNAP-2128 SNAP-2133 SNAP-2144 SNAP-2158 SNAP-2183 SNAP-2231 SNAP-2235 SNAP-2239 SNAP-2263 SNAP-2272 SNAP-2297 SNAP-2306 SNAP-2307 SNAP-2315 SNAP-2319 SNAP-2329 SNAP-2342_SNAP-2308_merge_jnj SNAP-2346 SNAP-2358-hardcoded SNAP-2358 SNAP-2366 SNAP-2401 SNAP-2432 SNAP-2440 SNAP-2446 SNAP-2450 SNAP-2451 SNAP-2458 SNAP-2459 SNAP-2467 SNAP-2477 SNAP-2487 SNAP-2497 SNAP-2512 SNAP-2514 SNAP-2517 SNAP-2529 SNAP-2566 SNAP-2570 SNAP-2572 SNAP-2585 SNAP-2601 SNAP-2616 SNAP-2624 SNAP-2643 SNAP-2701 SNAP-2709 SNAP-2719 SNAP-2745 SNAP-2751 SNAP-2754 SNAP-2758 SNAP-2760 SNAP-New-Connector SNAP-TMP SnappyData-Docs-GA SnappyDocs-GA SqlFunSuite TPCDS VAPOC ao_testing aqp_isight_asif aqp_isight bitemporal_poc branch-0.x-preview branch-0.1-preview branch-0.3-preview branch-0.4-preview branch-0.5-preview branch-0.5 branch-0.5.1 branch-0.5.2 branch-0.5.3 branch-0.6 branch-0.6.1 branch-0.6.2 branch-0.7 branch-0.7.1 branch-0.8 branch-0.8.1 branch-0.9 branch-1.0-rc branch-1.0.2-rc1 catalog-cleanup cdc-testing cdcTest conflation_test deploy doc-training docv-quickstartguide docv1.0.0 docv1.0.1 docv1.0.2_new docv1.0.2 docv1.0.2.1_delta docv1.0.2.1_temp docv1.0.2.1 docv102delta dsCrashIssue encoder_test encrypt_password feature/jenkins-ci gh-pages jnj_hotfix_merge_snap_2758_sw_patch li-master master-perf0.7 master perf_dev revert-950-SNAP-2192 securityTest smokePerf snap2358-hemanth snapshot-refactor spark-multiversion-support spark_2.2_merge spark_2.3_merge sqlfixes stream_sink streaming_join_test structured_streaming techpubs_lizy testing_upgrade training-docs trilok_travis_ci v2connector vectorSIMD vivek-try1
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
45 lines (22 sloc) 4.15 KB

Overview of Synopsis Data Engine (SDE)

This feature is available only in the Enterprise version of SnappyData.

!!! Note This is the beta version of the SDE feature which is still undergoing final testing before its official release.

The following topics are covered in this section:

The SnappyData Synopsis Data Engine (SDE) offers a novel and scalable system to analyze large datasets. SDE uses statistical sampling techniques and probabilistic data structures to answer analytic queries with sub-second latency. There is no need to store or process the entire dataset. The approach trades off query accuracy for fast response time.

For instance, in exploratory analytics, a data analyst might be slicing and dicing large datasets to understand patterns, trends or to introduce new features. Often the results are rendered in a visualization tool through bar charts, map plots and bubble charts. It would increase the productivity of the engineer by providing a near perfect answer that can be rendered in seconds instead of minutes (visually, it is identical to the 100% correct rendering), while the engineer continues to slice and dice the datasets without any interruptions.

When accessed using a visualization tool (Apache Zeppelin), users immediately get their almost-perfect answer to analytical queries within a couple of seconds, while the full answer can be computed in the background. Depending on the immediate answer, users can choose to cancel the full execution early, if they are either satisfied with the almost-perfect initial answer or if after viewing the initial results they are no longer interested in viewing the final results. This can lead to dramatically higher productivity and significantly less resource consumption in multi-tenant and concurrent workloads on shared clusters.

While in-memory analytics can be fast, it is still expensive and cumbersome to provision large clusters. Instead, SDE allows you to retain data in existing databases and disparate sources, and only caches a fraction of the data using stratified sampling and other techniques. In many cases, data explorers can use their laptops and run high-speed interactive analytics over billions of records.

Unlike existing optimization techniques based on OLAP cubes or in-memory extracts that can consume a lot of resources and work for a prior known queries, the SnappyData Synopses data structures are designed to work for any ad-hoc query.

How does it work?

The following diagram provides a simplified view of how the SDE works. The SDE is deeply integrated with the SnappyData store and its general purpose SQL query engine. Incoming rows (could come from static or streaming sources) are continuously sampled into one or more "sample" tables. These samples can be considered much like how a database utilizes indexes - for optimization. There can, however, be one difference, that is, the "exact" table may or may not be managed by SnappyData (for instance, this may be a set of folders in S3 or Hadoop). When queries are executed, the user can optionally specify their tolerance for error through simple SQL extensions. SDE transparently goes through a sample selection process to evaluate if the query can be satisfied within the error constraint. If so, the response is generated directly from the sample.

SDE Architecture

Using SDE

In the current release SDE queries only work for SUM, AVG and COUNT aggregations. Joins are only supported to non-samples in this release. The SnappyData SDE module will gradually expand the scope of queries that can be serviced through it. But the overarching goal here is to dramatically cut down on the load on current systems by diverting at least some queries to the sampling subsystem and increasing productivity through fast response times.