Netezza Connector for Apache Spark
Scala Shell Python
Latest commit 00b3d65 Jul 22, 2016 @sureshthalamati sureshthalamati ISSUE-10 Adding meta information for registering netezza dasource imp…
…lemention for DataSourceRegister, this helps in using short name/alias
Permalink
Failed to load latest commit information.
.github Add pull request template with contributor agreement Mar 5, 2016
build
dev Run unit test from dev/run-tests Nov 30, 2015
project
src
.gitattributes
.gitignore Update git configurations Mar 1, 2016
.travis.yml
LICENSE
NOTICE
README.md Update readme with project descriptive name Jun 15, 2016
build.sbt Update version to 1.0.0-SNAPSHOT Jul 22, 2016
scalastyle-config.xml Moving scalastyle config to correct location to work with lint-scala,… Jun 3, 2016

README.md

Netezza Connector for Apache Spark

A data source library to load data into Apache Spark using SQL DataFrames from IBM® Netezza® database. This data source library is implemented using Netezza external table mechanism to transfer data from the Netezza host system to the Apache Spark system optimally.

Binary download:

You can download the Netezza Connector for Apache Spark assembly jars from here:

Apache Spark Version Release # Binary Location
1.5.2+ v0.1.1 spark-netezza-assembly-0.1.1.jar

Usage

Specify Package

Spark-netezza connector is a registered package on spark-packages site, so it supports command line option --packages to resolve maven dependencies. You can use spark-shell, spark-sql, pyspark or spark-submit based on your scenario to invoke spark-netezza connector. Dependency on netezza jdbc driver needs to be provided.

For example, for spark-shell:

spark-shell --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /path/to/nzjdbc.jar

Data Sources API

You can use spark-netezza connector via the Apache Spark Data Sources API in Scala, Java, Python or SQL, as follows:

Scala

import org.apache.spark.sql._

val sc = // existing SparkContext
val sqlContext = new SQLContext(sc)

val opts = Map("url" -> "jdbc:netezza://netezzahost:5480/database",
        "user" -> "username",
        "password" -> "password",
        "dbtable" -> "tablename",
        "numPartitions" -> "4")
val df = sqlContext.read.format("com.ibm.spark.netezza")
  .options(opts).load()

Java

import org.apache.spark.sql.*;

JavaSparkContext sc = // existing SparkContext
SQLContext sqlContext = new SQLContext(sc);

Map<String, String> opts = new HashMap<>();
opts.put("url", "jdbc:netezza://netezzahost:5480/database");
opts.put("user", "username");
opts.put("password", "password");
opts.put("dbtable", "tablename");
opts.put("numPartitions", "4");
DataFrame df = sqlContext.read()
             .format("com.ibm.spark.netezza")
             .options(opts)
             .load();

Python

from pyspark.sql import SQLContext

sc = # existing SparkContext
sqlContext = SQLContext(sc)

df = sqlContext.read \
  .format('com.ibm.spark.netezza') \
  .option('url','jdbc:netezza://netezzahost:5480/database') \
  .option('user','username') \
  .option('password','password') \
  .option('dbtable','tablename') \
  .option('numPartitions','4') \
  .load()

SQL

CREATE TABLE my_table
USING com.ibm.spark.netezza
OPTIONS (
  url 'jdbc:netezza://netezzahost:5480/database',
  user 'username',
  password 'password',
  dbtable 'tablename'
);

Configuration Overview

Configuration can be passed on DataFrame using option:

Name Description
url The url to connect with JDBC driver.
user Username to connect to the database.
password Password to connect to the database.
dbtable The table in the database to load.
numPartitions Number of partitions to specify to parallely execute data movement.

Building From Source

Scala 2.10

spark-netezza build supports Scala 2.10 by default, if Scala 2.11 artifact is needed, please refer to Scala 2.11 or Version Cross Build

Building General Artifacts

To generate regular binary, in the root directory run:

build/sbt package

To generate assembly jar, in the root directory run:

build/sbt assembly

The artifacts will be generated to:

spark-netezza/target/scala-{binary.version}/

To run the tests, in the root directory run:

build/sbt test

Scala 2.11

To build against scala 2.11, use '++' option with desired version number, for example:

build/sbt ++2.11.7 assembly
build/sbt ++2.11.7 test

Version Cross Build

To produce artifacts for both scala 2.10 and 2.11:

Start SBT:

 build/sbt

Run in the SBT shell:

 + package
 + test

Using sbt-spark-package Plugin

spark-netezza connector supports sbt-spark-package plugin, to publish to local ivy repository, run:

build/sbt spPublishLocal