# 101 Spark basics

The goal of this lab is to get familiar with Spark programming.

- Scala
    - [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
    - [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
    - [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)
- Python
    - [Spark programming guide](https://spark.apache.org/docs/3.5.0/rdd-programming-guide.html)
    - [All RDD APIs](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.RDD.html)

Use `Tab` for autocompletion, `Shift+Tab` for documentation.

In [2]:
from pyspark.sql import SparkSession

In [3]:
# creates a local Spark session (running on your machine, not a cluster). 
# .master("local"): tells Spark to run locally using your CPU cores4
# .appName("Local Spark"): sets the name that appears in the Spark UI
# "local[*]" if you want to use all your lcoal cores 

spark = SparkSession.builder \
    .master("local") \
    .appName("Local Spark") \
    .config('spark.ui.port', '4040') \
    .getOrCreate()

# Gets the SparkContext, the low-level API that manages RDDs
sc = spark.sparkContext

sc

## Examples

In [4]:
# sc.parallelize() takes a Python collection (list, range, ect.) and converts 
# it into a Spark RDD.
data = [1, 2, 3, 4, 5, 6, 7, 8]
rdd = sc.parallelize(data, numSlices = 4) # Each partition is a chunk of data 
# that a task will process.
rdd.glom().collect()

# parallelize() is mainly for small datasets in memory. For large datasets, 
# it’s better to use sc.textFile or DataFrames.
# The number of partitions controls parallelism, not the total number of 
# elements.
# Use .getNumPartitions() to check:

[[1, 2], [3, 4], [5, 6], [7, 8]]

In [5]:
# wholeTextFiles keeps the entire file content together, and each element is a
# tuple (file_path, file_content)
rdd = sc.wholeTextFiles("../../../../datasets")
#rdd.collect()

## 101-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:

In [6]:
num_cores = sc.defaultParallelism
print("Available cores: ", num_cores)
rddCapra = sc.textFile("../../../../datasets/capra.txt", minPartitions = num_cores)
rddDC = sc.textFile("../../../../datasets/divinacommedia.txt")

Available cores:  1


In [7]:
# Since the file is stored on the local filesystem (inside Docker) - not HDFS -
# Spark cannot split it into multiple 128 MB blocks automatically
rddCapra.getNumPartitions()

1

- Show their content (```collect```)
- Count their rows (```count```)

In [8]:
rddCapra.collect()
# rddCapra.count()

['sopra la panca la capra campa', 'sotto la panca la capra crepa']

- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)

In [9]:
# What is lazy evaluation in Spark?
# In Spark, transformations like map() and flatMap() are lazy. 
# Lazy means: Spark does not execute them immediately when you define them.
# Spark just builds a plan (DAG) describing how to compute the RDD.
# The actual computations is triggered only when you call an action (collect())

In [10]:
rddCapraWords1 = rddCapra.map(lambda element : element.split(" "))
rddCapraWords1.collect()
# with map each element of the new RDD is a list of words, but nested structure
# remains (list inside RDD)

[['sopra', 'la', 'panca', 'la', 'capra', 'campa'],
 ['sotto', 'la', 'panca', 'la', 'capra', 'crepa']]

In [11]:
rddCapraWords1.count()

2

In [12]:
rddCapraWords2 = rddCapra.flatMap(lambda element : element.split(" "))
rddCapraWords2.collect()
# flatMap() flattens all the lists into a single RDD of words.

['sopra',
 'la',
 'panca',
 'la',
 'capra',
 'campa',
 'sotto',
 'la',
 'panca',
 'la',
 'capra',
 'crepa']

In [13]:
rddCapraWords2.count()

12

- Try the ```toDebugString``` function to check the execution plan
    - In PySpark, use ```toDebugString().decode("unicode_escape")```

In [14]:
# Every RDD in Spark keeps track of its lineage, i.e., the chain of 
# transformations that led to it. 
# toDebugString prints a textual representation of this DAG (Directed Acyclic 
# Graph). It's useful for debugging, understanding lazy evaluation, and seeing
# how many stages and partitions Spark will use.

rddL = rddCapra. \
   flatMap(lambda x : x.split(" ") ). \
   map(lambda x : (x,1)). \
   reduceByKey(lambda x,y : x+y)
rddL.collect()
print(rddL.toDebugString().decode("unicode_escape"))

(1) PythonRDD[16] at collect at /tmp/ipykernel_249/3570472243.py:11 []
 |  MapPartitionsRDD[15] at mapPartitions at PythonRDD.scala:160 []
 |  ShuffledRDD[14] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(1) PairwiseRDD[13] at reduceByKey at /tmp/ipykernel_249/3570472243.py:10 []
    |  PythonRDD[12] at reduceByKey at /tmp/ipykernel_249/3570472243.py:10 []
    |  ../../../../datasets/capra.txt MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:0 []
    |  ../../../../datasets/capra.txt HadoopRDD[4] at textFile at NativeMethodAccessorImpl.java:0 []


## 101-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the **average length of words given their first letter** (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the **inverted index of words** (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [19]:
# Word count
rddCapra. \
    flatMap(lambda x : x.split(" ")). \
    map(lambda x : (x,1)). \
    reduceByKey(lambda x,y : x + y). \
    map(lambda kv : (kv[1], kv[0])). \
    sortByKey(False). \
    collect()

[(4, 'la'),
 (2, 'panca'),
 (2, 'capra'),
 (1, 'sopra'),
 (1, 'campa'),
 (1, 'sotto'),
 (1, 'crepa')]

In [20]:
# Word length count
rddCapra. \
    flatMap(lambda x : x.split(" ")). \
    map(lambda x : (len(x), 1)). \
    reduceByKey(lambda x,y : x + y). \
    collect()

[(5, 8), (2, 4)]

In [33]:
# Average length of words given their first letter
rddCapra. \
    flatMap(lambda x : x.split(" ")). \
    filter(lambda x : len(x) > 0). \
    map(lambda x : (x[0].lower(), (1, len(x)))). \
    reduceByKey(lambda x,y : (x[0] + y[0], x[1] + y[1])). \
    mapValues(lambda v : v[1]/v[0]). \
    collect()

[('s', 5.0), ('l', 2.0), ('p', 5.0), ('c', 5.0)]

In [38]:
# Average length of words given their first letter (alternative on the final map)
rddCapra. \
    flatMap(lambda x : x.split(" ")). \
    filter(lambda x : len(x) > 0). \
    map(lambda x : (x[0].lower(), (1, len(x)))). \
    reduceByKey(lambda x,y : (x[0] + y[0], x[1] + y[1])). \
    map(lambda kv : (kv[0], kv[1][1]/kv[1][0])). \
    collect()

[('s', 5.0), ('l', 2.0), ('p', 5.0), ('c', 5.0)]

In [75]:
# For each word, list the numbers of lines in which they appear
rddCapra. \
    map(lambda x : x.split(" ")). \
    zipWithIndex(). \
    flatMap(lambda x : [(word, x[1]) for word in x[0]]). \
    groupByKey(). \
    mapValues(lambda idxs: tuple(sorted(set(idxs)))). \
    collect()

[('sopra', (0,)),
 ('la', (0, 1)),
 ('panca', (0, 1)),
 ('capra', (0, 1)),
 ('campa', (0,)),
 ('sotto', (1,)),
 ('crepa', (1,))]

In [None]:
# For each word, list the numbers of lines in which they appear
rddCapra \
    .map(lambda x : x.split(" ")) \
    .zipWithIndex() \
    .map(lambda kv : (kv[1], kv[0])) \
    .flatMapValues(lambda v : v) \
    .map(lambda )

## 101-3 Extra Spark jobs

Implement the following job.

- **Co-occurrence count**: count the number of co-occurrences in the text. A co-occurrence is defined as "two distinct words appearing in the same line".
  - In the first line of the *capra* dataset, co-occurrences are:
     - (sopra, la), (sopra, panca), (sopra, capra), (sopra, campa)
     - (la, sopra), (la, panca), (la, capra), (la, campa) 
     - (panca, sopra), (panca, la), (panca, capra), (panca, campa)
     - (capra, sopra), (capra, la), (capra, panca), (capra, campa)
     - (campa, sopra), (campa, la), (campa, panca), (campa, capra)

In [84]:
# Co-occurrence count
rddCapra \
    .map(lambda x : list(set(x.split(" ")))) \
    .map(lambda words: [(words[i], words[j])
                            for i in range(len(words))
                            for j in range(len(words))
                            if i != j]) \
    .collect()

[[('capra', 'panca'),
  ('capra', 'campa'),
  ('capra', 'la'),
  ('capra', 'sopra'),
  ('panca', 'capra'),
  ('panca', 'campa'),
  ('panca', 'la'),
  ('panca', 'sopra'),
  ('campa', 'capra'),
  ('campa', 'panca'),
  ('campa', 'la'),
  ('campa', 'sopra'),
  ('la', 'capra'),
  ('la', 'panca'),
  ('la', 'campa'),
  ('la', 'sopra'),
  ('sopra', 'capra'),
  ('sopra', 'panca'),
  ('sopra', 'campa'),
  ('sopra', 'la')],
 [('capra', 'panca'),
  ('capra', 'sotto'),
  ('capra', 'la'),
  ('capra', 'crepa'),
  ('panca', 'capra'),
  ('panca', 'sotto'),
  ('panca', 'la'),
  ('panca', 'crepa'),
  ('sotto', 'capra'),
  ('sotto', 'panca'),
  ('sotto', 'la'),
  ('sotto', 'crepa'),
  ('la', 'capra'),
  ('la', 'panca'),
  ('la', 'sotto'),
  ('la', 'crepa'),
  ('crepa', 'capra'),
  ('crepa', 'panca'),
  ('crepa', 'sotto'),
  ('crepa', 'la')]]