Skip to content

Latest commit

 

History

History
133 lines (102 loc) · 6.61 KB

File metadata and controls

133 lines (102 loc) · 6.61 KB
tags displayed_sidebar
Enterprise Option
Private Preview
docsEnglish

Configuration of ScalarDB Analytics with Spark

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

:::warning

This version of ScalarDB Analytics with Spark was in private preview. Please use version 3.14 or later instead.

:::

There are two ways to configure ScalarDB Analytics with Spark:

  • By configuring the properties in spark.conf
  • By using the helper method that ScalarDB Analytics with Spark provides

Both ways are conceptually equivalent processes, so you can choose either one based on your preference.

Configure ScalarDB Analytics with Spark by using spark.conf

Since ScalarDB Analytics with Spark is provided as a Spark custom catalog plugin, you can enable ScalarDB Analytics with Spark via spark.conf.

spark.sql.catalog.scalardb_catalog = com.scalar.db.analytics.spark.datasource.ScalarDbCatalog
spark.sql.catalog.scalardb_catalog.config = /<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties
spark.sql.catalog.scalardb_catalog.namespaces = <YOUR_NAMESPACE_NAME_2>,<YOUR_NAMESPACE_NAME_2>
spark.sql.catalog.scalardb_catalog.license.key = {"your":"license", "key":"in", "json":"format"}
spark.sql.catalog.scalardb_catalog.license.cert_path = /<PATH_TO_YOUR_LICENSE>/cert.pem

:::note

The scalardb_catalog part is a configurable catalog name. You may choose any name you prefer.

:::

Available properties

The following is a list of available properties for ScalarDB Analytics with Spark:

Property name Required Description
spark.sql.catalog.{catalog_name} Yes Must be com.scalar.db.analytics.spark.datasource.ScalarDbCatalog
spark.sql.catalog.{catalog_name}.config Yes Path to the ScalarDB configuration file
spark.sql.catalog.{catalog_name}.namespaces Yes Comma-separated list of ScalarDB namespaces to import to the Spark side
spark.sql.catalog.{catalog_name}.license.key Yes Your license key in the JSON format
spark.sql.catalog.{catalog_name}.license.cert_path Either this or license.cert_pem is required Path to your license certificate file
spark.sql.catalog.{catalog_name}.license.cert_pem Either this or license.cert_path is required Your license certificate in the PEM format

Importing schemas

After properly setting spark.conf, you should have a catalog in your Spark environment, which contains tables connected to the underlying databases of ScalarDB. However, the catalog provides access to raw tables that contain transaction metadata, which is managed by ScalarDB. Instead, you may only be interested in the application-managed data without transaction metadata.

For this purpose, ScalarDB Analytics with Spark provides the SchemaImporter class, which creates views that interpret the transaction metadata and show only application-managed data. Those views have an equivalent schema to the ScalarDB tables, and users can use the views as if they were ScalarDB tables. The following is an example of how to run SchemaImporter with the properly set catalog.

import com.scalar.db.analytics.spark.view.SchemaImporter

class YourApp {
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder().appName("<YOUR_APPLICATION_NAME>").getOrCreate()
    new SchemaImporter(spark, "scalardb_catalog").run() // Import ScalarDB table schemas from the catalog named "scalardb_catalog"
    spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show()
    spark.stop()
  }
}

Configure ScalarDB Analytics with Spark by using the helper method

You can use a helper method that ScalarDB Analytics with Spark provides to get everything set up to run analytical queries, including configuring the catalog and importing the schemas. In addition, you can use the helper method to set up ScalarDB Analytics with Spark in application code. This would be useful for doing a quick test without prior configuration.

The helper method is provided through Java and Scala. In Java, you can use ScalarDbAnalyticsInitializer to specify the options, which are equivalent to the properties in spark.conf, as follows:

import com.scalar.db.analytics.spark.ScalarDbAnalyticsInitializer

class YourApp {
  public static void main(String[] args) {
    // Initialize SparkSession as usual
    SparkSession spark = SparkSession.builder().appName("<YOUR_APPLICATION_NAME>").getOrCreate()
    // Setup ScalarDB Analytics with Spark via helper class
    ScalarDbAnalyticsInitializer
      .builder()
      .spark(spark)
      .configPath("/<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties")
      .namespace("<YOUR_NAMESPACE_NAME_1>")
      .namespace("<YOUR_NAMESPACE_NAME_2>")
      .licenseKey("{\"your\":\"license\", \"key\":\"in\", \"json\":\"format\"}")
      .licenseCertPath("/<PATH_TO_YOUR_LICENSE>/cert.pem")
      .build()
      .run()
    // Run arbitrary queries
    spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show()
    // Stop SparkSession
    spark.stop()
  }
}

In Scala, the setupScalarDbAnalytics method is available as an extension of SparkSession:

import com.scalar.db.analytics.spark.implicits._

object YourApp {
  def main(args: Array[String]): Unit = {
    // Initialize SparkSession as usual
    val spark = SparkSession.builder.appName("<YOUR_APPLICATION_NAME>").getOrCreate()
    // Setup ScalarDB Analytics with Spark via helper method
    spark.setupScalarDbAnalytics(
      // ScalarDB config file
      configPath = "/<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties", 
      // Namespaces in ScalarDB to import
      namespaces = Set("<YOUR_NAMESPACE_NAME_1>", "<YOUR_NAMESPACE_NAME_2>"),
      // License information
      license = License.certPath("""{"your":"license", "key":"in", "json":"format"}""", "/<PATH_TO_YOUR_LICENSE>/cert.pem")
    )
    // Run arbitrary queries
    spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show()
    // Stop SparkSession
    spark.stop()
  }
}