
# SparkR tutorial Notebook
This is a notebook version of the [SparkR Documentation](http://spark.apache.org/docs/2.4.0/sparkr.html)
## Overview
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.4.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

# SparkDataFrame
A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. SparkDataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing local R data frames.

All of the examples on this page use sample data included in R or the Spark distribution and can be run using the ./bin/sparkR shell.

## Starting Up: SparkSession
Y
ou can start SparkR from Jupyter R Kernel. You can connect your R program to a Spark cluster from RStudio, R shell, Rscript or other R IDEs. To start, make sure SPARK_HOME is set in environment (you can check Sys.getenv), load the SparkR package, and call sparkR.session as below. It will check for the Spark installation, and, if not found, it will be downloaded and cached automatically. Alternatively, you can also run install.spark manually.

In addition to calling sparkR.session, you could also specify certain Spark driver properties. Normally these Application properties and Runtime Environment cannot be set programmatically, as the driver JVM process would have been started, in this case SparkR takes care of this for you. To set them, pass them as you would other configuration properties in the sparkConfig argument to sparkR.session().

In [33]:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  Sys.setenv(SPARK_HOME = "/home/spark")
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

Java ref type org.apache.spark.sql.SparkSession id 1 

The following Spark driver properties can be set in sparkConfig with sparkR.session from RStudio:

Property Name | Property group | spark-submit equivalent
 --- | --- | --- 
spark.master |	Application Properties	| --master
spark.yarn.keytab|Application Properties|	--keytab
spark.yarn.principal|	Application Properties|	--principal
spark.driver.memory|	Application Properties|	--driver-memory
spark.driver.extraClassPath|	Runtime Environment|	--driver-class-path
spark.driver.extraJavaOptions|	Runtime Environment|	--driver-java-options
spark.driver.extraLibraryPath|	Runtime Environment|	--driver-library-path

## Creating SparkDataFrames
With a SparkSession, applications can create SparkDataFrames from a local R data frame, from a Hive table, or from other data sources.

### From local data frames
The simplest way to create a data frame is to convert a local R data frame into a SparkDataFrame. Specifically, we can use as.DataFrame or createDataFrame and pass in the local R data frame to create a SparkDataFrame. As an example, the following creates a SparkDataFrame based using the faithful dataset from R.

In [34]:
df <- as.DataFrame(faithful)

# Displays the first part of the SparkDataFrame
head(df)
##  eruptions waiting
##1     3.600      79
##2     1.800      54
##3     3.333      74

eruptions,waiting
3.6,79
1.8,54
3.333,74
2.283,62
4.533,85
2.883,55


### From Data Sources
SparkR supports operating on a variety of data sources through the SparkDataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources.

The general method for creating SparkDataFrames from data sources is read.df. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically. SparkR supports reading JSON, CSV and Parquet files natively, and through packages available from sources like Third Party Projects, you can find data source connectors for popular file formats like Avro. These packages can either be added by specifying --packages with spark-submit or sparkR commands, or if initializing SparkSession with sparkPackages parameter when in an interactive R shell or from RStudio.

In [35]:
sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")

Java ref type org.apache.spark.sql.SparkSession id 1 

We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

In [36]:
people <- read.df("/opt/spark/examples/src/main/resources/people.json", "json")
head(people)

age,name
,Michael
30.0,Andy
19.0,Justin


SparkR automatically infers the schema from the JSON file

In [37]:
printSchema(people)

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


Similarly, multiple files can be read with read.json

In [38]:
people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people.json"))

The data sources API natively supports CSV formatted input files. For more information please refer to SparkR [read.df](http://spark.apache.org/docs/latest/api/R/read.df.html) API documentation.

In [39]:
csvPath <- "/opt/spark/examples/src/main/resources/people.csv"
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA", delimiter=";")

In [40]:
head(df)

name,age,job
Jorge,30,Developer
Bob,32,Developer


# SparkDataFrame Operations
SparkDataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs:

## Selecting rows, columns

In [31]:
df <- as.DataFrame(faithful)

# Get basic information about the SparkDataFrame
df
## SparkDataFrame[eruptions:double, waiting:double]

# Select only the "eruptions" column
head(select(df, df$eruptions))
##  eruptions
##1     3.600
##2     1.800
##3     3.333

# You can also pass in column name as strings
head(select(df, "eruptions"))

# Filter the SparkDataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))
##  eruptions waiting
##1     1.750      47
##2     1.750      47
##3     1.867      48

SparkDataFrame[eruptions:double, waiting:double]

eruptions
3.6
1.8
3.333
2.283
4.533
2.883


eruptions
3.6
1.8
3.333
2.283
4.533
2.883


eruptions,waiting
1.75,47
1.75,47
1.867,48
1.75,48
2.167,48
2.1,49


## Data type mapping between R and Spark

| R | Spark |
| --- | --- |
|byte|byte|
|integer	|integer|
|float	|float|
|double	|double|
|numeric	|double|
|character	|string|
|string	|string|
|binary	|binary|
|raw	|binary|
|logical	|boolean|
|POSIXct	|timestamp|
|POSIXlt	|timestamp|
|Date	|date|
|array	|array|
|list	|array|
|env	|map|
