SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster
Roff Scala Java Shell Python Assembly Other
Switch branches/tags
Latest commit e73dad5 Jul 22, 2017 @hbhanawat hbhanawat committed on GitHub Fixes for issues found during concurrency testing (#730)
## Changes proposed in this pull request

While doing jfr analysis of the concurrency test, it was realized that PooledKryoSerializer.serialize is taking lot of time. On further analysis, it was found that WholeStageCodeGenRDD is serializing the code string which is taking lot of time. It was happening because the Output.writeString was calling Output.writeString_slow. This was happening because if the buffer size is not big enough, Output calls writeString_slow. To fix this, now we determine the size of code string in the PooledKryoSerializer and create a buffer of that size. The problem could also have been fixed by adding optimized writeString/readString specifically for the code body. But decided against it to avoid array copy which would have happenend during resize as the body of the code is around 16K for a simple select query. 

For low latency pool made its minShare as 2 and increased its weight to 2. This was done so that OLTP queries get some space to execute even if a OLAP query is already executing and also it gets priority next time. 



## Patch testing

Precheckin. Will run and report the hydra test results. 

## Other PRs 

SnappyDataInc/spark#63
SnappyDataInc/snappy-store#247
Permalink
Failed to load latest commit information.
.github * Renaming gemfirexd* jars to snappydata-store* jars. (#283) Jun 24, 2016
cluster Fixes for issues found during concurrency testing (#730) Jul 22, 2017
core Fixes for issues found during concurrency testing (#730) Jul 22, 2017
docs Doc 0.9 (#679) Jul 3, 2017
dtests Hydra Test coverage for alterTable (#733) Jul 22, 2017
dunit * Bump up the versions. SnappyData 0.9, RowStore 1.5.5, Spark 2.0.2.5… Jun 13, 2017
examples Changes for Apache Spark 2.1.1 merge (#695) Jul 11, 2017
gradle/wrapper [SNAP-606] Support for "spark.snappydata" properties (#231) May 9, 2016
python/pyspark a) Corrected Python objects to correctly use SparkSession APIs. (#460) Dec 9, 2016
release updating year in copyright header templates Jan 24, 2017
spark @ 9372c00 Fixes for issues found during concurrency testing (#730) Jul 22, 2017
spark-jobserver @ 2d176a4 * Bump up the versions. SnappyData 0.9, RowStore 1.5.5, Spark 2.0.2.5… Jun 13, 2017
store @ 9549b32 Fixes for issues found during concurrency testing (#730) Jul 22, 2017
tests TPCH Changes to execute test using smart connector mode Jul 20, 2017
.gitignore Move to Spark 2.0 (#276) Aug 17, 2016
.gitmodules Updated .gitmodules with 2.1 branch Jul 8, 2017
LICENSE Move to Spark 2.0 (#276) Aug 17, 2016
NOTICE Move to Spark 2.0 (#276) Aug 17, 2016
README.md * Updating the readme files with the latest release version. Jun 13, 2017
ReleaseNotes.txt * Bump up the versions. SnappyData 0.9, RowStore 1.5.5, Spark 2.0.2.5… Jun 13, 2017
build.gradle [SNAP-1777] increasing default member-timeout for SnappyData (#704) Jul 13, 2017
codeStyleSettings.xml moving to mavenCentral() to jcenter() which is supposed to be faster … Oct 10, 2015
gradle.properties Jdbc cdc streaming (#622) Jun 8, 2017
gradlew Snap 1523 (#608) May 30, 2017
gradlew.bat Snap 1523 (#608) May 30, 2017
mkdocs.yml Doc 0.9 (#679) Jul 3, 2017
publish-site.sh Some basic sanity put in place to fail publishing of docs when api do… Feb 17, 2017
scalastyle-config.xml Adding support to run scalaStyle in product build (SNAP-120) Jan 29, 2016
settings.gradle Update spark and store links and misc fixes Jul 9, 2017

README.md

SnappyData fuses Apache Spark with an in-memory database to deliver a data engine capable of processing streams, transactions and interactive analytics in a single cluster.

The Challenge with Spark and Remote Data Sources

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight.

The SnappyData Approach

At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

Essentially, we turn Spark into an in-memory operational database capable of transactions, point reads, writes, working with Streams (Spark) and running analytic SQL queries. Or, it is an in-memory scale out Hybrid Database that can execute Spark code, SQL or even Objects.

If you are already using Spark, experience 20x speed up for your query performance. Try out this test

Snappy Architecture

SnappyData Architecture

Getting Started

We provide multiple options to get going with SnappyData. The easiest option is, if you are already using Spark 2.0+. You can simply get started by adding SnappyData as a package dependency. You can find more information on options for running SnappyData here.

Downloading and Installing SnappyData

You can download and install the latest version of SnappyData from the SnappyData Release page. Refer to the documentation for installation steps.

If you would like to build SnappyData from source, refer to the documentation on building from source.

SnappyData in 5 Minutes!

Refer to the 5 minutes guide which is intended for both first time and experienced SnappyData users. It provides you with references and common examples to help you get started quickly!

Documentation

To understand SnappyData and its features refer to the documentation.

Community Support

We monitor channels listed below for comments/questions.

Stackoverflow Stackoverflow SlackSlack Gitter Gitter Mailing List Mailing List Reddit Reddit JIRA JIRA

Link with SnappyData Distribution

Using Maven Dependency SnappyData artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:

groupId: io.snappydata
artifactId: snappydata-core_2.11
version: 0.9

groupId: io.snappydata
artifactId: snappydata-cluster_2.11
version: 0.9

Using sbt If you are using sbt, add this line to your build.sbt for core SnappyData artifacts:

libraryDependencies += "io.snappydata" % "snappydata-core_2.11" % "0.9"

For additions related to SnappyData cluster, use:

libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "0.9"

You can find more specific SnappyData artifacts here

Ad Analytics using SnappyData

Here is a stream + Transactions + Analytics use case example to illustrate the SQL as well as the Spark programming approaches in SnappyData - Ad Analytics code example. Here is a screencast that showcases many useful features of SnappyData. The example also goes through a benchmark comparing SnappyData to a Hybrid in-memory database and Cassandra.

Contributing to SnappyData

If you are interested in contributing, please visit the community page for ways in which you can help.