# 101 Spark basics

The goal of this lab is to get familiar with Spark programming.

- Scala
    - [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
    - [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
    - [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)
- Python
    - [Spark programming guide](https://spark.apache.org/docs/3.5.0/rdd-programming-guide.html)
    - [All RDD APIs](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.RDD.html)

Use `Tab` for autocompletion, `Shift+Tab` for documentation.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Local Spark") \
    .config('spark.ui.port', '4040') \
    .getOrCreate()
sc = spark.sparkContext

sc

## 101-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan
    - In PySpark, use ```toDebugString().decode("unicode_escape")```

In [3]:
rddCapra = sc.textFile("../../../../datasets/capra.txt")
rddDC = sc.textFile("../../../../datasets/divinacommedia.txt")

In [4]:
rddCapraWords1 = rddCapra.map(lambda x : x.split(" ") )
rddCapraWords1.collect()

[['sopra', 'la', 'panca', 'la', 'capra', 'campa'],
 ['sotto', 'la', 'panca', 'la', 'capra', 'crepa']]

In [5]:
rddCapraWords1.count()

2

In [6]:
rddCapraWords2 = rddCapra.flatMap(lambda x : x.split(" ") )
rddCapraWords2.collect()

['sopra',
 'la',
 'panca',
 'la',
 'capra',
 'campa',
 'sotto',
 'la',
 'panca',
 'la',
 'capra',
 'crepa']

In [7]:
rddCapraWords2.count()

12

In [8]:
rddL = rddCapra. \
   flatMap(lambda x : x.split(" ") ). \
   map(lambda x : (x,1)). \
   reduceByKey(lambda x,y : x+y)
print(rddL.toDebugString().decode("unicode_escape"))

(1) PythonRDD[12] at RDD at PythonRDD.scala:53 []
 |  MapPartitionsRDD[11] at mapPartitions at PythonRDD.scala:160 []
 |  ShuffledRDD[10] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(1) PairwiseRDD[9] at reduceByKey at /tmp/ipykernel_364/3033110671.py:4 []
    |  PythonRDD[8] at reduceByKey at /tmp/ipykernel_364/3033110671.py:4 []
    |  ../../../../datasets/capra.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
    |  ../../../../datasets/capra.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


## 101-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [9]:
# Word count
rddCapra. \
  flatMap(lambda x : x.split(" ") ). \
  map(lambda x : (x,1)). \
  reduceByKey(lambda x,y : x + y). \
  map(lambda kv: (kv[1],kv[0])). \
  sortByKey(False). \
  collect()

[(4, 'la'),
 (2, 'panca'),
 (2, 'capra'),
 (1, 'sopra'),
 (1, 'campa'),
 (1, 'sotto'),
 (1, 'crepa')]

In [10]:
# Word length count
rddCapra. \
  flatMap(lambda x : x.split(" ") ). \
  map(lambda x : (len(x),1)). \
  reduceByKey(lambda x,y : x + y). \
  collect()

[(5, 8), (2, 4)]

In [11]:
# Average word length by initial
rddCapra. \
  flatMap(lambda x : x.split(" ") ). \
  filter(lambda x : len(x)>0 ). \
  map(lambda x : (x[0:1].lower(), (1,len(x)))). \
  reduceByKey(lambda x, y : (x[0] + y[0], x[1] + y[1])). \
  mapValues(lambda v : v[1]/v[0]). \
  collect()

[('s', 5.0), ('l', 2.0), ('p', 5.0), ('c', 5.0)]

In [12]:
# Average word length by initial (alternative on the final map)
rddCapra. \
  flatMap(lambda x : x.split(" ") ). \
  filter(lambda x : len(x)>0 ). \
  map(lambda x : (x[0:1].lower(), (1,len(x)))). \
  reduceByKey(lambda x, y : (x[0] + y[0], x[1] + y[1])). \
  map(lambda kv : (kv[0], kv[1][1]/kv[1][0])). \
  collect()

[('s', 5.0), ('l', 2.0), ('p', 5.0), ('c', 5.0)]

In [13]:
# Inverted index (word-based offset)
rddCapra. \
  flatMap(lambda x : x.split(" ") ). \
  zipWithIndex(). \
  groupByKey(). \
  mapValues(lambda v: list(v)). \
  collect()

[('sopra', [0]),
 ('la', [1, 3, 7, 9]),
 ('panca', [2, 8]),
 ('capra', [4, 10]),
 ('campa', [5]),
 ('sotto', [6]),
 ('crepa', [11])]

In [14]:
# Inverted index (sentence-based offset)
rddCapra. \
  zipWithIndex(). \
  flatMap(lambda kv : [(x,kv[1]) for x in kv[0].split(" ")]). \
  distinct(). \
  groupByKey(). \
  mapValues(lambda v: list(v)). \
  collect()

[('sopra', [0]),
 ('la', [0, 1]),
 ('panca', [0, 1]),
 ('capra', [0, 1]),
 ('campa', [0]),
 ('sotto', [1]),
 ('crepa', [1])]

In [15]:
# Inverted index (sentence-based offset) alternative
rddCapra.zipWithIndex(). \
  map(lambda kv : (kv[1],kv[0])). \
  flatMapValues(lambda x : x.split(" ") ). \
  map(lambda kv : (kv[1],kv[0])). \
  distinct(). \
  groupByKey(). \
  mapValues(lambda v: list(v)). \
  collect()

[('sopra', [0]),
 ('la', [0, 1]),
 ('panca', [0, 1]),
 ('capra', [0, 1]),
 ('campa', [0]),
 ('sotto', [1]),
 ('crepa', [1])]

## 101-3 Extra Spark jobs

Implement the following job.

- Co-occurrence count: count the number of co-occurrences in the text. A co-occurrence is defined as "two distinct words appearing in the same line".
  - In the first line of the *capra* dataset, co-occurrences are:
     - (sopra, la), (sopra, panca), (sopra, capra), (sopra, campa)
     - (la, sopra), (la, panca), (la, capra), (la, campa) 
     - (panca, sopra), (panca, la), (panca, capra), (panca, campa)
     - (capra, sopra), (capra, la), (capra, panca), (capra, campa)
     - (campa, sopra), (campa, la), (campa, panca), (campa, capra)