# Table of Contents
 <p><div class="lev1"><a href="#Introduction"><span class="toc-item-num">1 - </span>Introduction</a></div><div class="lev1"><a href="#Instructions-for-Running-This-Test-Workbook"><span class="toc-item-num">2 - </span>Instructions for Running This Test Workbook</a></div><div class="lev1"><a href="#Test-importing-of-R-modules"><span class="toc-item-num">3 - </span>Test importing of <code>R</code> modules</a></div><div class="lev1"><a href="#Set-CONSTANTS"><span class="toc-item-num">4 - </span>Set <em>CONSTANTS</em></a></div><div class="lev1"><a href="#Test-SparkContext-&amp;-SQLContext"><span class="toc-item-num">5 - </span>Test <code>SparkContext</code> &amp; <code>SQLContext</code></a></div>

# Introduction

This workbook tests the setups necessary to run our Relation Data tutorials in __`R`__ and __`SparkR`__ (the `R` API of `Apache Spark`.

We will test your installation of the below software. Please follow [__instructions__ here](https://github.com/ChicagoBoothAnalytics/RelationalData) to install such software if you are running on your personal computer.

- __`R`__ v3.2.2 or later;

- __`lubridate`__ `R` package for manipulating date-times data;

- __`stringr`__ `R` package for manipulating character string data;

- __`data.table`__ `R` package;

- __`RPostgreSQL`__ `R` package for working with `PostgreSQL` databases through `R`;

- __`sqldf`__ `R` package for running `SQL` commands in `R`;

- __`Apache Spark` v1.6.0 or later__, pre-built for Hadoop 2.6 or later;

- __`ggplot2`__ `R` package for visualization.

# Instructions for Running This Test Workbook

If you are running on your personal Mac & Windows computer, __edit the values of the `SPARK_HOME` and `WINDOWS_JAVA_HOME` constants__ below to suit the relevant setups on your computer.

In the navigation bar, go to __Cell__ and press __Run All__. This will run all of the commands below.

__WHAT SUCCESS LOOKS LIKE__: running of all workbook cells __without error messages__ _(note: warning messages are okay)_.

# Test importing of `R` modules

In [1]:
library(data.table)
library(lubridate)
library(RPostgreSQL)
library(sqldf)
library(stringr)


Attaching package: ‘lubridate’

The following objects are masked from ‘package:data.table’:

    hour, mday, month, quarter, wday, week, yday, year

Loading required package: DBI
Loading required package: gsubfn
Loading required package: proto
In doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
  Reason: image not foundCould not load tcltk.  Will use slower R code instead.
Loading required package: RSQLite
sqldf will default to using PostgreSQL


In [2]:
# import visualization packages
library(ggplot2)

# Set _CONSTANTS_

In [3]:
# detect system platform
SYS_NAME <- Sys.info()['sysname']

# detect if running on Amazon Web Services (AWS) Elastic MapReduce (EMR), local Mac or local Windows
AWS_EMR_MODE <- SYS_NAME == 'Linux'
MAC_OSX_LOCAL_MODE <- SYS_NAME == 'Darwin'
WINDOWS_LOCAL_MODE <- SYS_NAME == 'Windows'

# set Apache Spark-related constants, depending on AWS_EMR_MODE
if (AWS_EMR_MODE) {                                   # if running Spark on AWS Elastic MapReduce (EMR) YARN cluster
    SPARK_MODE <- 'yarn-client'
    SPARK_HOME <- '/usr/lib/spark'                    # default Spark installation folder on AWS EMR master node
} else {                                              # if running Spark on single machine
    SPARK_MODE <- 'local[*]'
    if (MAC_OSX_LOCAL_MODE) {                         # if running on Mac OS X
        SPARK_HOME <- '/Applications/spark-1.6.0'     # *** CHANGE TO SUIT YOUR PERSONAL COMPUTER ***
    } else if (WINDOWS_LOCAL_MODE) {    # if running on Windows
        SPARK_HOME <- 'C:/Applications/spark-1.6.0'   # *** CHANGE TO SUIT YOUR PERSONAL COMPUTER ***
    }   
}                                              

WINDOWS_JAVA_HOME <- 'C:/Program Files/Java/jre1.8.0_74'   # *** CHANGE TO SUIT YOUR (WINDOWS) COMPUTER ***

# Test `SparkContext` & `SQLContext`

In [4]:
if (!exists('sc')) {
    
    # set / clean up environment variables for Spark
    Sys.setenv(SPARK_HOME=SPARK_HOME)
    
    if (AWS_EMR_MODE) {
        Sys.setenv(SPARKR_SUBMIT_ARGS='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')
    } else if (WINDOWS_LOCAL_MODE) {
        Sys.setenv(JAVA_HOME=WINDOWS_JAVA_HOME)
    }
    
    system('rm -f derby.log')
    system('rm -f -r metastore_db')
    
    cat("Removing any existing 'SPARK_CLASSPATH' environment variable: ")
    spark_classpath <- Sys.getenv('SPARK_CLASSPATH')
    if (spark_classpath == '') {
        cat('done!\n')
    } else {
        cat(spark_classpath, 'done!\n')
    }
    
    # add SparkR to library search path
    .libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
    
    # load SparkR and set up SparkContext & HiveContext
    library(SparkR)
    
    sc <- sparkR.init(
        master=SPARK_MODE,
        appName='Test-R-Software_SetUps',
        sparkHome=SPARK_HOME)

    if (WINDOWS_LOCAL_MODE) {
        # if running on Windows, use SQLContext because HiveContext does not work on Windows
        sqlc <- sparkRSQL.init(sc)
    } else {
        # if running on Mac or AWS EMR YARN, use more advanced HiveContext
        # ref: http://stackoverflow.com/questions/33666545/what-is-difference-between-apache-spark-sqlcontext-vs-hivecontext
        sqlc <- sparkRHive.init(sc)
    }
}

cat('SparkContext:')
sc

cat('\nSQLContext / HiveContext:')
sqlc

Removing any existing 'SPARK_CLASSPATH' environment variable: done!



Attaching package: ‘SparkR’

The following object is masked from ‘package:RSQLite’:

    summary

The following object is masked from ‘package:RPostgreSQL’:

    summary

The following objects are masked from ‘package:lubridate’:

    hour, intersect, minute, month, quarter, second, year

The following objects are masked from ‘package:data.table’:

    between, hour, last, like, month, quarter, tables, year

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset,
    summary, table, transform



Launching java with spark-submit command /Applications/spark-1.6.0/bin/spark-submit   sparkr-shell /var/folders/_r/pclm3n6x2b54y_1frhn2jh3c0000gn/T//RtmpEGyrJ9/backend_port2e3a1aaecbe7 
SparkContext:

Java ref type org.apache.spark.api.java.JavaSparkContext id 0 


SQLContext / HiveContext:

Java ref type org.apache.spark.sql.hive.HiveContext id 2 