Skip to content

Kundera with Spark

karthikprasad13 edited this page Aug 8, 2016 · 7 revisions

Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including [Spark SQL] (http://spark.apache.org/docs/latest/sql-programming-guide.html) for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

##Support Being a JPA provider, Kundera provides support for Spark. It allows to perform read-write operation & SQL querying over data in Cassandra & MongoDB. Along with these databases, support for File System(CSV/JSON) & HDFS is also added.

Why no Update and Delete:

Spark does not provide the update or delete operation over data. It's a data processing tool used for processing huge amount of data and most of its use cases are related to analytics. So, keeping Spark's philosophy in mind, Kundera provides support only for read-write and querying over data.

Kundera provides 4 modules with Spark:

  • spark-core : This is the core module & mandatory for using kundera-spark. Also, it deals with HDFS and FS(CSV & JSON) part.
  • spark-cassandra : This module is designed for Cassandra. It can be used to read-write and do SQL querying on data in cassandra.
  • spark-mongodb : Similarly, this module is designed for MongoDB.
  • spark-teradata : This module is designed for performing read/write operation with Teradata. Currently only read operations are supported in this module, for write operations support will be added soon. To use this module you need to add [teradata jdbc driver] (http://downloads.teradata.com/download/connectivity/jdbc-driver) in classpath for spark to be able to read via it.
Clone this wiki locally