# 101 Spark basics

The goal of this lab is to get familiar with Spark programming.

- Scala
    - [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
    - [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
    - [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)
- Python
    - [Spark programming guide](https://spark.apache.org/docs/3.5.0/rdd-programming-guide.html)
    - [All RDD APIs](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.RDD.html)

Use `Tab` for autocompletion, `Shift+Tab` for documentation.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Local Spark") \
    .config('spark.ui.port', '4040') \
    .getOrCreate()
sc = spark.sparkContext

sc

## 101-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan
    - In PySpark, use ```toDebugString().decode("unicode_escape")```

In [3]:
rddCapra = sc.textFile("../../../../datasets/capra.txt")
rddDC = sc.textFile("../../../../datasets/divinacommedia.txt")

In [4]:
rddCapra.collect()
rddDC.collect()
rddCapra.count()
rddDC.count()
rddCapraSplit=rddCapra.flatMap(lambda x: x.split(" "))
rddCapraSplit.collect()
rddDCSplit=rddDC.flatMap(lambda x: x.split(" ")).filter(lambda x: len(x)>0) #Remove empty chars
#FlatMap returns a listof elements, while map returns a list of lists with each row as a list
print(rddCapraSplit.toDebugString().decode("unicode_escape"))

(1) PythonRDD[6] at collect at /tmp/ipykernel_476/3149173467.py:6 []
 |  ../../../../datasets/capra.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  ../../../../datasets/capra.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


## 101-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [18]:
#Word Count
rddCapraSplit.map(lambda x: (x, 1)).reduceByKey(lambda x,y: (x+y))
rddDCSplit.map(lambda x: (x, 1)).reduceByKey(lambda x,y: (x+y))
#Word Lenght Count
rddCapraSplit.map(lambda x: (len(x), 1)).reduceByKey(lambda x,y: (x+y))
rddDCSplit.map(lambda x: (len(x), 1)).reduceByKey(lambda x,y: (x+y))
#Average length of words given their first letter
rddCapraSplit.map(lambda x: (x[0], (len(x), 1))).reduceByKey(lambda x,y:(x[0]+y[0], x[1]+y[1])).map(lambda x: (x[0], x[1][0]/x[1][1]))
rddDCSplit.map(lambda x: (x[0], (len(x), 1))).reduceByKey(lambda x,y:(x[0]+y[0], x[1]+y[1])).mapValues(lambda x: x[0]/x[1])
#Return the inverted index of words
rddCapra.map(lambda x: x.split(" ")).zipWithIndex().map(lambda x: (x[1], x[0])).flatMapValues(lambda list: list).map(lambda x: (x[1], x[0])).distinct().groupByKey().mapValues(lambda v: list(v)).collect()

rddCapra.map(lambda x: x.split(" ")).zipWithIndex().flatMap(lambda el: [(w, el[1])for w in el[0]]).distinct().groupByKey().mapValues(lambda v: list(v)).collect()

[('sopra', [0]),
 ('la', [0, 1]),
 ('panca', [0, 1]),
 ('capra', [0, 1]),
 ('campa', [0]),
 ('sotto', [1]),
 ('crepa', [1])]

## 101-3 Extra Spark jobs

Implement the following job.

- Co-occurrence count: count the number of co-occurrences in the text. A co-occurrence is defined as "two distinct words appearing in the same line".
  - In the first line of the *capra* dataset, co-occurrences are:
     - (sopra, la), (sopra, panca), (sopra, capra), (sopra, campa)
     - (la, sopra), (la, panca), (la, capra), (la, campa) 
     - (panca, sopra), (panca, la), (panca, capra), (panca, campa)
     - (capra, sopra), (capra, la), (capra, panca), (capra, campa)
     - (campa, sopra), (campa, la), (campa, panca), (campa, capra)