Play around with spark

Cluster mode

spark = SparkSession \
    .builder \
    .master(<spark_host>) \   <- standalone mode on Cluster 
    .appName("meta_info") \
    .getOrCreate()

For jupyternote book, if using py3, following vars are required:

from os import environ
environ['PYSPARK_PYTHON']='/home/ubuntu/anaconda3/bin/python'
environ['PYSPARK_DRIVER_PYTHON']='/home/ubuntu/anaconda3/bin/jupyter'

spark_host can be found at Spark Cluster WebUI

Xml reader

Download Spark-xml reader:

wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-xml_2.10/0.4.1/spark-xml_2.10-0.4.1.jar -O $SPARK_HOME/jars/spark-xml_2.10-0.4.1.jar

- To use the package with pyspark shell:

pyspark --packages com.databricks:spark-xml_2.10:0.4.1

- To use the pacakgae with jupyter notebook:

from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

PostgreSQL interface

wget https://jdbc.postgresql.org/download/postgresql-42.2.2.jar -O $SPARK_HOME/jars/postgresql-42.2.2.jar

(or usr/local/spark/jars)

- To use the package with pyspark shell:

pyspark --jars usr/local/spark/jars/postgresql-42.2.2.jar

- To use the pacages with jupyter notebook:

???

Connect to PostgreSQL via jdbcDF

jdbcDF = spark.read \
    .format('jdbc') \
    .option('url', 'jdbc:postgresql://<instance_name>.<user_name>.us-east-1.rds.amazonaws.com:5432/<dbname>') \
    .option('dbtable', __credential__.table_name) \
    .option('user', __credential__.user) \
    .option('password', __credential__.password) \
    .load()

(Ref: https://aws.amazon.com/getting-started/tutorials/create-connect-postgresql-db/)

Redshift interface

Instructions:

https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver

https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

http://grepcode.com/file/repo1.maven.org/maven2/com.amazonaws/aws-java-sdk-s3/1.10.6/com/amazonaws/services/s3/AmazonS3URI.java

All 3 following packages are required:

wget http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.11/3.0.0-preview1/spark-redshift_2.11-3.0.0-preview1.jar -O $SPAKR_HOME/jars/spark-redshift_2.11-3.0.0-preview1.jar
wget http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar -O $SPAKR_HOME/jars/spark-avro_2.11-4.0.0.jar
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.12.1017/RedshiftJDBC41-1.2.12.1017.jar -O $SPAKR_HOME/jars/RedshiftJDBC41-1.2.12.1017.jar
wget https://github.com/ralfstx/minimal-json/releases/download/0.9.5/minimal-json-0.9.5.jar

Note: spark-redshift_2.11-3.0.0-preview1.jar is the only version of spark-redshift not causing S3 endpoint URI invalid or java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Play around with spark

Cluster mode

Xml reader

- To use the package with pyspark shell:

- To use the pacakgae with jupyter notebook:

PostgreSQL interface

- To use the package with pyspark shell:

- To use the pacages with jupyter notebook:

Connect to PostgreSQL via jdbcDF

Redshift interface

Clone this wiki locally