# 102 Spark basics

The goal of this lab is to get familiar with Spark programming.

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
- [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)

In [2]:
import org.apache.spark

import org.apache.spark


In [None]:
// DO NOT EXECUTE - this is needed just to avoid showing errors in the following cells
val sc = spark.SparkContext.getOrCreate()

## 102-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan

In [3]:
// Load the capra and divinacommedia datasets 

val rddCapra = sc.textFile("../../../../datasets/capra.txt")
val rddDivina = sc.textFile("../../../../datasets/divinacommedia.txt")

rddCapra: org.apache.spark.rdd.RDD[String] = ../../../../datasets/capra.txt MapPartitionsRDD[1] at textFile at <console>:27
rddDivina: org.apache.spark.rdd.RDD[String] = ../../../../datasets/divinacommedia.txt MapPartitionsRDD[3] at textFile at <console>:28


In [4]:
// show their content
rddCapra.collect()
//rddDivina.collect()

res1: Array[String] = Array(sopra la panca la capra campa, sotto la panca la capra crepa)


In [5]:
// count their rows
rddCapra.count()

res2: Long = 2


In [6]:
// split phrases into word
rddCapra.map(x => x.split(" ")).collect() // crea un array per ogni riga
rddCapra.flatMap(x => x.split(" ")).collect() // collassa tutto in un unico array

res3: Array[String] = Array(sopra, la, panca, la, capra, campa, sotto, la, panca, la, capra, crepa)


In [7]:
// try toDebugString to check the esecution plan
rddCapra.flatMap(x => x.split(" ")).collect()
rddCapra.toDebugString

res4: String =
(2) ../../../../datasets/capra.txt MapPartitionsRDD[1] at textFile at <console>:27 []
 |  ../../../../datasets/capra.txt HadoopRDD[0] at textFile at <console>:27 []


## 102-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [8]:
// WORD COUNT: count the number of occurrences of each word
rddCapra.flatMap(x => x.split(" "))
  .map(x => (x,1))
  .reduceByKey((x,y) => x + y)
  .collect()

res5: Array[(String, Int)] = Array((campa,1), (la,4), (panca,2), (sotto,1), (crepa,1), (sopra,1), (capra,2))


In [9]:
// WORD LENGTH COUNT: count the number of occurrences of word of given length
rddCapra.flatMap(x => x.split(" "))
  .map(x => x.length)
  .map(x => (x, 1))
  .reduceByKey((x,y) => x + y)
  .collect()

res6: Array[(Int, Int)] = Array((2,4), (5,8))


In [11]:
// Count the average length of words given their first letter
rddCapra.flatMap(x => x.split(" "))
  .map(x => (x.substring(0,1), x.length))
  .reduceByKey((x,y) => (x + y)/2)
  .collect()

res8: Array[(String, Int)] = Array((p,5), (l,2), (s,5), (c,5))


In [30]:
// Return the inverted index of words
rddCapra
  .zipWithIndex()
  .flatMap(x => x._1.split(" ").map(y => (y, x._2)))
  .distinct()
  .groupByKey()
  .collect()

res26: Array[(String, Iterable[Long])] = Array((campa,CompactBuffer(0)), (la,CompactBuffer(0, 1)), (panca,CompactBuffer(1, 0)), (sotto,CompactBuffer(1)), (crepa,CompactBuffer(1)), (sopra,CompactBuffer(0)), (capra,CompactBuffer(0, 1)))


## 103-3 Extra Spark jobs

Implement the following job.

- Co-occurrence count: count the number of co-occurrences in the text. A co-occurrence is defined as "two distinct words appearing in the same line".
  - In the first line of the *capra* dataset, co-occurrences are:
     - (sopra, la), (sopra, panca), (sopra, capra), (sopra, campa)
     - (la, sopra), (la, panca), (la, capra), (la, campa) 
     - (panca, sopra), (panca, la), (panca, capra), (panca, campa)
     - (capra, sopra), (capra, la), (capra, panca), (capra, campa)
     - (campa, sopra), (campa, la), (campa, panca), (campa, capra)

In [34]:
rddCapra.map(x => x.split(" "))
  .flatMap(w => w.combinations(2))
  .collect()

res30: Array[Array[String]] = Array(Array(sopra, la), Array(sopra, panca), Array(sopra, capra), Array(sopra, campa), Array(la, la), Array(la, panca), Array(la, capra), Array(la, campa), Array(panca, capra), Array(panca, campa), Array(capra, campa), Array(sotto, la), Array(sotto, panca), Array(sotto, capra), Array(sotto, crepa), Array(la, la), Array(la, panca), Array(la, capra), Array(la, crepa), Array(panca, capra), Array(panca, crepa), Array(capra, crepa))
