<h1>Background</h1>

<tt>Spark</tt> allows for parallel operations in a program to be executed on a cluster. The main abstractions are:

- A **dataset** which is a collection of elements partitioned across the nodes of the cluster which can be worked on in parallel

- **Shared variables** which can be shared across tasks or between tasks and the driver program.

This notebook introduces Spark, Scala and running <tt>.scala</tt> programs on a HPC.

<h1>Setting up Spark</h1>

Spark supports multiple programming languages including **Scala**, **Java**, **R** and **Python**. Throughout these tutorials, we will use **Scala**.

To run a piece of code in a Jupyter notebook, you can use the "Cell" menu or use the CTRL+Return combination when you've selected the relevant cell. SageMathCloud can only have one notebook running at a time, so to open another, you'll need to close the current one (using the menu "File" $\rightarrow$ "Close and halt" option).

To start our <tt>Spark</tt> application, we first set (and check) the settings

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

Add to the classpath

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

147 new artifact(s)


147 new artifacts in macro
147 new artifacts in runtime
147 new artifacts in compile




We need to import some Spark classes (more detail on these below)

In [3]:
import org.apache.spark.sql.SparkSession

[32mimport [36morg.apache.spark.sql.SparkSession[0m

<h2>Initializing Spark</h2>

Before Spark 2.0, it was necessary to create a SparkContext object and tell Spark how to access a cluster, using:

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    val conf = new SparkConf().setAppName(appName).setMaster(master)
    val sc = new SparkContext(conf)

In Spark 2.0, the two were subsumed by SparkSession

    import org.apache.spark.sql.SparkSession
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("my-spark-app")
      .config("spark.some.config.option", "config-value")
      .getOrCreate()
      
(Note we have already carried out the necessary <tt>import</tt> in the previous cell.) The underlying SparkContext, which is created when SparkSession is called (if it doesn't already exist), can still be accessed using the <tt>sparkContext</tt> method of a <tt>sparkSession</tt>. Now we create an instance of a SparkSession:

In [4]:
val sparkSession = SparkSession.builder
  .master("local[1]")
  .appName("Spark examples")
  .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/28 20:02:20 INFO SparkContext: Running Spark version 2.0.1
17/02/28 20:02:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/28 20:02:22 INFO SecurityManager: Changing view acls to: b97eec96efcb40779e247b002e047f82
17/02/28 20:02:22 INFO SecurityManager: Changing modify acls to: b97eec96efcb40779e247b002e047f82
17/02/28 20:02:22 INFO SecurityManager: Changing view acls groups to: 
17/02/28 20:02:22 INFO SecurityManager: Changing modify acls groups to: 
17/02/28 20:02:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(b97eec96efcb40779e247b002e047f82); groups with view permissions: Set(); users  with modify permissions: Set(b97eec96efcb40779e247b002e047f82); groups with modify permissions: Set()
17/02/28 20:02:23 INFO Utils: Successfully started service 

[36msparkSession[0m: [32mSparkSession[0m = org.apache.spark.sql.SparkSession@2cfd796a

<h2>Creating standalone programs</h2>

Since the amount of memory available to a Jupyter notebook is limited (which is why we have to close one to be able to open another), any larger scale programs have to be run on a 'proper' machine and so need us to create standalone programs. We will continue to demonstrate small scale examples in the notebooks, and you can try out small bits of code in a Jupyter cell, but you should try to create the corresponding <tt>.scala</tt> programs on <tt>iceberg</tt> and elsewhere. When you log into a node on <tt>iceberg</tt>, you may need to ask for more memory than the default <tt>qrsh</tt> gives you:

    ssh user@iceberg.shef.ac.uk
    qrsh -l mem=8G -l rmem=8G

<h3>Passing functions to Spark</h3>

We need to pass functions in the driver program to the cluster. One of the ways to do this is using static methods in a global singleton object. **Note** that applications should define a <tt>main()</tt> method. The following, along with the relevant imports, form self contained programs - starting with the usual <tt>Hello World</tt>:

In [6]:
object HelloWorld {
    def main(args: Array[String]): Unit = {
      
        val sparkSession = SparkSession.builder
          .master("local")
          .appName("Hello World")
          .getOrCreate()
        
        println("Hello, world!")
    }
}

HelloWorld.main(Array())

Hello, world!


defined [32mobject [36mHelloWorld[0m

As a standalone program, this is available <a href="files/HelloWorld.scala">here</a>. You will need to package this using <tt>sbt</tt> (the Scala build tool), using <tt>sbt package</tt>, and then use <tt>spark-submit</tt> to run the resulting package. (More details on this in the <a href="files/README">README</a>.)

<h2>Modifying data</h2>

There are lots of things we can do with data! For example, we can remove any instances that don't match our requirements:

In [7]:
// Extract the spark context

val sc = sparkSession.sparkContext

val input = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000)).map(_.toDouble)
val result = input.filter(x => x <= 10)
println(result.collect().mkString(","))

1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0


[36msc[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mSparkContext[0m = org.apache.spark.SparkContext@76e5eb51
[36minput[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = MapPartitionsRDD[1] at map at Main.scala:28
[36mresult[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = MapPartitionsRDD[2] at filter at Main.scala:31

<h2>Computing something!</h2>

Our next example program reads in data, starts a <tt>sparkSession</tt> and finds out the number of occurrences of individual letters within the data.

In [8]:
object LetterCountingApp {
    
    def main(args: Array[String]) {
        val inputFile = "files/TaleOfTwoCities.txt" // Should be some file on your system
        
        val sparkSession = SparkSession.builder
            .master("local")
            .appName("Letter counting app")
            .getOrCreate()
        
        val sc = sparkSession.sparkContext
        
        val inputData = sc.textFile(inputFile, 2).cache()
        val numAs = inputData.filter(line => line.contains("a")).count()
        val numBs = inputData.filter(line => line.contains("b")).count()
        println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
    }
}

LetterCountingApp.main(Array())

Lines with a: 6, Lines with b: 6


defined [32mobject [36mLetterCountingApp[0m

<h2>Statistics</h2>

[Basic statistics](https://spark.apache.org/docs/2.0.2/mllib-statistics.html) functions are implemented within Spark, which can be useful for computing the mean of data or other stats.

In [9]:
// Import for vectors
import org.apache.spark.mllib.linalg.Vectors

// Import for statistics
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(3.0, 30.0, 300.0)
  )
)

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)  // a dense vector containing the mean value for each column
println(summary.variance)  // column-wise variance
println(summary.numNonzeros)  // number of nonzeros in each column

[2.0,20.0,200.0]
[1.0,100.0,10000.0]
[3.0,3.0,3.0]


[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}[0m
[36mobservations[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32morg[0m.[32mapache[0m.[32mspark[0m.[32mmllib[0m.[32mlinalg[0m.[32mVector[0m] = ParallelCollectionRDD[7] at parallelize at Main.scala:25
[36msummary[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mmllib[0m.[32mstat[0m.[32mMultivariateStatisticalSummary[0m = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@3acc93cf

<h2>Exercises</h2>

There are many more examples in the Scala / Spark documentation than are shown in these notebooks - you will find it beneficial to consult these.

<h3>Exercise 1</h3>

Use the <tt>List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000)</tt> data introduced above along with the available <tt>statistics</tt> package to filter out any values that are more than 3 standard deviations away from the mean. You should be able to carry out this exercise in the cell below:

In [10]:
// Hint: standard deviation = sqrt ( variance )


// Import for statistics
import org.apache.spark.mllib.stat.{Statistics}

object Three_std {
    
    def main (args: Array[String]){
        
        val sparkSession = SparkSession.builder
            .master("local")
            .appName("Letter counting app")
            .getOrCreate()
        
        val sc = sparkSession.sparkContext
        
        val the_list = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000)).map(_.toDouble)

            
        // Compute column summary statistics.
        val mean = the_list.mean
        val variance = the_list.variance
        val std = math.sqrt(variance)
        
        val max = mean + 3*std 
        val min = mean - 3*std
        
        val result = the_list.filter(x => x<= max) //works because all values are positive
        println("The values that are more than 3 std away from the mean are: ", result.collect().mkString(","))
        
        //better way to do it
        val result_2 = the_list.filter(x => math.abs(mean - x) <= 3*std)
        println("The values that are more than 3 std away from the mean are: ", result_2.collect().mkString(","))
        
   }
}

Three_std.main(Array())


(The values that are more than 3 std away from the mean are: ,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0)
(The values that are more than 3 std away from the mean are: ,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0)


[32mimport [36morg.apache.spark.mllib.stat.{Statistics}[0m
defined [32mobject [36mThree_std[0m

<h3>Exercise 2</h3>

Now we'll use real data: the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/Iris). The webpage provides a [dataset description](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names) as well as the data for download. The [iris.data](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) file is available for the notebook to use at 

    files/iris.data

and the column data at
    
    files/sepal_length.data
    files/sepal_width.data
    files/petal_length.data
    files/petal_width.data
    
The column arrays are also provided for you in the cell below. Can you compute correlations between the various pairs of features? (You may find that a correlation function that you can use already exists.)

In [11]:
val sepalLength = sc.parallelize(Array(5.1,4.9,4.7,4.6,5.0,5.4,4.6,5.0,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5.0,5.0,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5.0,5.5,4.9,4.4,5.1,5.0,4.5,4.4,5.0,5.1,4.8,5.1,4.6,5.3,5.0,7.0,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5.0,5.9,6.0,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6.0,5.7,5.5,5.5,5.8,6.0,5.4,6.0,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5.0,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6.0,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6.0,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9))
val sepalWidth = sc.parallelize(Array(3.5,3.0,3.2,3.1,3.6,3.9,3.4,3.4,2.9,3.1,3.7,3.4,3.0,3.0,4.0,4.4,3.9,3.5,3.8,3.8,3.4,3.7,3.6,3.3,3.4,3.0,3.4,3.5,3.4,3.2,3.1,3.4,4.1,4.2,3.1,3.2,3.5,3.1,3.0,3.4,3.5,2.3,3.2,3.5,3.8,3.0,3.8,3.2,3.7,3.3,3.2,3.2,3.1,2.3,2.8,2.8,3.3,2.4,2.9,2.7,2.0,3.0,2.2,2.9,2.9,3.1,3.0,2.7,2.2,2.5,3.2,2.8,2.5,2.8,2.9,3.0,2.8,3.0,2.9,2.6,2.4,2.4,2.7,2.7,3.0,3.4,3.1,2.3,3.0,2.5,2.6,3.0,2.6,2.3,2.7,3.0,2.9,2.9,2.5,2.8,3.3,2.7,3.0,2.9,3.0,3.0,2.5,2.9,2.5,3.6,3.2,2.7,3.0,2.5,2.8,3.2,3.0,3.8,2.6,2.2,3.2,2.8,2.8,2.7,3.3,3.2,2.8,3.0,2.8,3.0,2.8,3.8,2.8,2.8,2.6,3.0,3.4,3.1,3.0,3.1,3.1,3.1,2.7,3.2,3.3,3.0,2.5,3.0,3.4,3.0))
val petalLength = sc.parallelize(Array(1.4,1.4,1.3,1.5,1.4,1.7,1.4,1.5,1.4,1.5,1.5,1.6,1.4,1.1,1.2,1.5,1.3,1.4,1.7,1.5,1.7,1.5,1.0,1.7,1.9,1.6,1.6,1.5,1.4,1.6,1.6,1.5,1.5,1.4,1.5,1.2,1.3,1.5,1.3,1.5,1.3,1.3,1.3,1.6,1.9,1.4,1.6,1.4,1.5,1.4,4.7,4.5,4.9,4.0,4.6,4.5,4.7,3.3,4.6,3.9,3.5,4.2,4.0,4.7,3.6,4.4,4.5,4.1,4.5,3.9,4.8,4.0,4.9,4.7,4.3,4.4,4.8,5.0,4.5,3.5,3.8,3.7,3.9,5.1,4.5,4.5,4.7,4.4,4.1,4.0,4.4,4.6,4.0,3.3,4.2,4.2,4.2,4.3,3.0,4.1,6.0,5.1,5.9,5.6,5.8,6.6,4.5,6.3,5.8,6.1,5.1,5.3,5.5,5.0,5.1,5.3,5.5,6.7,6.9,5.0,5.7,4.9,6.7,4.9,5.7,6.0,4.8,4.9,5.6,5.8,6.1,6.4,5.6,5.1,5.6,6.1,5.6,5.5,4.8,5.4,5.6,5.1,5.1,5.9,5.7,5.2,5.0,5.2,5.4,5.1))
val petalWidth = sc.parallelize(Array(0.2,0.2,0.2,0.2,0.2,0.4,0.3,0.2,0.2,0.1,0.2,0.2,0.1,0.1,0.2,0.4,0.4,0.3,0.3,0.3,0.2,0.4,0.2,0.5,0.2,0.2,0.4,0.2,0.2,0.2,0.2,0.4,0.1,0.2,0.1,0.2,0.2,0.1,0.2,0.2,0.3,0.3,0.2,0.6,0.4,0.3,0.2,0.2,0.2,0.2,1.4,1.5,1.5,1.3,1.5,1.3,1.6,1.0,1.3,1.4,1.0,1.5,1.0,1.4,1.3,1.4,1.5,1.0,1.5,1.1,1.8,1.3,1.5,1.2,1.3,1.4,1.4,1.7,1.5,1.0,1.1,1.0,1.2,1.6,1.5,1.6,1.5,1.3,1.3,1.3,1.2,1.4,1.2,1.0,1.3,1.2,1.3,1.3,1.1,1.3,2.5,1.9,2.1,1.8,2.2,2.1,1.7,1.8,1.8,2.5,2.0,1.9,2.1,2.0,2.4,2.3,1.8,2.2,2.3,1.5,2.3,2.0,2.0,1.8,2.1,1.8,1.8,1.8,2.1,1.6,1.9,2.0,2.2,1.5,1.4,2.3,2.4,1.8,1.8,2.1,2.4,2.3,1.9,2.3,2.5,2.3,1.9,2.0,2.3,1.8))
// must have the same number of partitions and cardinality



import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD



// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a
// method is not specified, Pearson's method will be used by default.
val correlation_1: Double = Statistics.corr(sepalLength, sepalWidth, "pearson")
println(s"Correlation of sepal is: $correlation_1")

val correlation_2: Double = Statistics.corr(petalLength, petalWidth, "pearson")
println(s"Correlation of petal is: $correlation_2")

val correlation_3: Double = Statistics.corr(sepalLength, petalLength, "pearson")
println(s"Correlation of lengths is: $correlation_3")

val correlation_4: Double = Statistics.corr(sepalWidth, petalWidth, "pearson")
println(s"Correlation of widths is: $correlation_4")

// val data: RDD[Vector] = sc.parallelize(
//   Seq(
//     Vectors.dense(1.0, 10.0, 100.0),
//     Vectors.dense(2.0, 20.0, 200.0),
//     Vectors.dense(5.0, 33.0, 366.0))
// )  // note that each Vector is a row and not a column

// // calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// // If a method is not specified, Pearson's method will be used by default.
// val correlMatrix: Matrix = Statistics.corr(data, "pearson")
// println(correlMatrix.toString)

Correlation of sepal is: -0.10936924995062468
Correlation of petal is: 0.9627570970509658
Correlation of lengths is: 0.8717541573048866
Correlation of widths is: -0.35654408961379946


[36msepalLength[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = ParallelCollectionRDD[15] at parallelize at Main.scala:26
[36msepalWidth[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = ParallelCollectionRDD[16] at parallelize at Main.scala:29
[36mpetalLength[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = ParallelCollectionRDD[17] at parallelize at Main.scala:32
[36mpetalWidth[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mDouble[0m] = ParallelCollectionRDD[18] at parallelize at Main.scala:35
[32mimport [36morg.apache.spark.mllib.linalg._[0m
[32mimport [36morg.apache.spark.mllib.stat.Statistics[0m
[32mimport [36morg.apache.spark.rdd.RDD[0m
[36mcorrelation_1[0m: [32mDouble[0m = [32m-0.10936924995062468[0m
[36mcorrelation_2[0m: [32mDouble[0m = [32m0.9627570970509658[0m
[36mcorrelation_3[0m: 

<h3>Exercise 3</h3>

Use the downloadable <tt>HelloWorld.scala</tt> program to create a standalone version on the HPC. Run this and check the output generates what you expect. (Note that this does not need to be submitted as a job, but can be run interactively.)

<h3>Exercise 4</h3>

Make the <tt>LetterCountingApp</tt> a standalone program executable on the HPC. Don't forget to correctly set the <tt>inputFile</tt> path. Can you make the program take the file path as an argument? (Again, this can be run interactively.)

<h3>Exercise 5 (optional)</h3>

The data description of the Iris dataset in Exercise 2 contains some statistics for this data. Can you reproduce these in the cell below?