# Big Data: Spark RDD

## Getting acquainted with Spark and Spark Notebook

 Never used a Notebook? 
 Find useful advice in the UI tour (in the help menu) or in the 
 [Spark Notebook documentation](https://github.com/spark-notebook/spark-notebook/blob/master/docs/exploring_notebook.md) itself.
 
 Take your time to practice using the Notebook environment, add new cells, split existing ones, switch between code and markdown, _etc. etc._

## Scala

The first weeks of the course, you had the chance to try out some Scala with a special docker container. Now, you can return to the exercises you tried, but enter the scala in this notebook instead of using an editor and standalone scala compiler. Refer to the following resources for additional info on Scala:
* Main [Scala site](http://scala-lang.org/), [tutorial](http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html) and [API documentation](http://www.scala-lang.org/api/current/index.html)


In [ ]:
// A Scala expression for you to execute:

1 to 4

res1: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4)


In [ ]:
// Empty cell for you to try out some more Scala tests here!
// Add a cell below to create more space for playing around with Scala.




_Just in case:_ some Scala background is definitely useful to get things done, but __do not get carried away__, _this_ course is about big data processing, not about functional programming!

## Spark

From now on, we consider Scala only as a __host language__ for the Spark big data platform. We access Spark from the host language through a special variable, the Spark Context, in these Spark Notebooks available as `sc`.

The basic data structure in Spark is the __Resilient Distributed Dataset (RDD)__, that represents collections of items stored _in memory_ on many different computers in the data center (similar to files in Hadoop being represented by one or more blocks in the Hadoop distributed filesystem, RDDs consist of one or more so-called _partitions_ that may reside on different worker nodes).

### Background information

The following two links give (1) an introduction to using Spark's RDDs to represent collections and (2) the complete programming guide discussing all operations you can apply to RDDs (the latter as a reference to check for more detailed information).

* http://spark.apache.org/examples.html
* http://spark.apache.org/docs/2.3.2/programming-guide.html

### My First RDD

RDDs can be initiated from in-memory collections or from files in the (distributed or local) file system.

Let's first initialize a new RDD from a collection of numbers created by Scala expression `0 to 999` using
operation `parallelize` on the `Spark Context` 
([see the documentation](http://spark.apache.org/docs/2.3.2/api/scala/index.html#org.apache.spark.SparkContext)).
The second parameter is optional, and instructs the platform to split the data in 8 partitions.

In [ ]:
val rdd = sc.parallelize(0 to 999,8)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:60


Evaluation of operations in Spark is lazy - only operations that require output to be materialized will actually trigger execution. Remember that evaluation is lazy, and only happens upon actions, not transformations; i.e., so far, nothing happened.

_Check:_ Spark UI: [stages](http://localhost:4040/stages/) is still empty.

In [ ]:
val sample = rdd.takeSample(false, 4)

sample: Array[Int] = Array(445, 53, 544, 897)


Only now, evaluation took place: see the [stages](http://localhost:4040/stages/) in the Spark UI.
Click on the links!

Try to explain for yourself: _Why would Spark have created two jobs?_

### Data

Use a shell escape to test if the Gutenberg data was correctly loaded on the docker container running the Spark Notebook.

Note: [Assignment 2](http://rubigdata.github.io/course/assignments/A2-mapreduce.html) gave detailed instructions how to get the Project Gutenberg Shakespeare texts on your Spark Notebook container; download the data (again, if needed) and use `docker cp 100.txt snb:/opt/docker/hadoop-2.9.2` if your container is new.

In [ ]:
:sh ls hadoop-2.9.2

LICENSE.txt
NOTICE.txt
README.txt
bin
etc
include
lib
libexec
sbin
share

import sys.process._




In [ ]:
:sh ls /opt/docker/hadoop-2.9.2/100.txt

ls: cannot access /opt/docker/hadoop-2.9.2/100.txt: No such file or directory
java.lang.RuntimeException: Nonzero exit value: 2
  at scala.sys.package$.error(package.scala:27)
  at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
  at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:103)
  ... 103 elided


### Counting words

We will use the Shakespeare data for the classic Big Data "Hello World" exercise, counting words.

_If you reached this part of the exercise before the first Spark lecture, feel free to continue, but definitely revisit this notebook **after** the lecture (you can clear previous output using the All Output in the Cell menu)._

In [ ]:
val lines = sc.textFile("/opt/docker/hadoop-2.9.2/100.txt")

Can you predict what the following commands will do?
Recognize the Map Reduce pattern on lines 2 and 3?

In [ ]:
println( "Lines:\t", lines.count, "\n" + 
         "Chars:\t", lines.map(s => s.length).
                           reduce((a, b) => a + b))

The map operator executes its parameter, the lambda function, on every item in the RDD.
Reduce is also defined using a lambda function.

_Note:_ if you never took a functional programming course, look at [this answer on StackExchange](http://stackoverflow.com/a/16509/2127435).

Now try to understand in detail the following example.
_Try to understand why we used `flatMap` and not `map`._

It is worth copying the cell, and inspecting output at intermediate steps (use `take()`, not `collect()`.
_After the first Spark lecture, in April, you should understand why!_

In [ ]:
val words = lines.flatMap(line => line.split(" "))
              .filter(_ != "")
              .map(word => (word,1))

In [ ]:
val wc = words.reduceByKey(_ + _)

In [ ]:
wc.take(10)

Take a look at how the platform processes this query:

In [ ]:
wc.toDebugString

Inspect the Spark UI to see the computations in the cluster → 
[see stages](http://localhost:4040/stages/) and their constituent tasks.

### To count or not to count
Ok, we can count words - let us find out which words Shakespeare used most often!

In [ ]:
val top10 = wc.takeOrdered(10)

Ok, not quite what we wanted!
See what's wrong?

Let's fix the result ordering as follows.

In [ ]:
val top10 = wc.takeOrdered(10)(Ordering[Int].reverse.on(x=>x._2))

You can render the collected results however you want to using the client programming language.

In [ ]:
top10.map({case(w,c) => "Word '%s' occurs %d times".format(w,c)}).map(println)

We can zoom in on specific word frequencies, that might be more interesting than stopwords!

In [ ]:
wc.filter(_._1 == "Romeo").collect

In [ ]:
wc.filter(_._1 == "Julia").collect

In [ ]:
wc.cache()

In [ ]:
wc.filter(_._1 == "Macbeth").collect

In [ ]:
wc.filter(_._1 == "Capulet").collect

Many different ways exist to compute the top N results. A few follow - _try to understand what actual work (for the cluster) is actually generated by the various alternatives._

In [ ]:
val oCounts = wc.map(x => x._2 -> x._1).sortByKey(false).map(x => x._2 -> x._1).cache()
oCounts.take(10).foreach(println)

In [ ]:
// Alternative way to achieve the same:
wc.sortBy(x => -x._2).take(10).foreach(println)

In [ ]:
// Preferred way if you really just want the top results
// Note that you do not first need to assign the ordering function to a variable - you could just pass along the Ordering.by expression instead.
val asc = (Ordering.by[(String, Int), Int](_._2))
wc.top(10)(asc).foreach(println)

In [ ]:
// Alternative formulation
val desc = (Ordering.by[(String, Int), Int](-_._2))
wc.takeOrdered(10)(desc).foreach(println)

The next section saves the results of word counting in the filesystem. 

We use a simple shell command to look into the directory that has been created.
(Alternatively, you can navigate the filesystem after issuing a `docker exec -it HASH bash` command on the machine running the notebook container.)

In [ ]:
words.saveAsTextFile("wc")

In [ ]:
:sh ls wc

_Q: Explain why there are multiple result files._

Inspect the files from the command line in the docker container (using `docker exec`).

Clean up the directory to save headaches when later rerunning the notebook.

In [ ]:
:sh rm -rf wc

### How to count?

In [ ]:
val words = lines.flatMap(line => line.split(" "))
              .map(w => w.toLowerCase().replaceAll("(^[^a-z]+|[^a-z]+$)", ""))
              .filter(_ != "")
              .map(w => (w,1))
              .reduceByKey( _ + _ )

In [ ]:
words.filter(_._1 == "macbeth").collect
  .map({case (w,c) => "%s occurs %d times".format(w,c)}).map(println)

_Q: why are the counts different?_