# Spark Core API by Examples

This notebook illustrates how spark works by means of simple examples. The notebook is executed in python through pyspark (and jupyter).

The PySpark Documentation is available at https://spark.apache.org/docs/latest/api/python/pyspark.html

#### General imports and starting Spark

In [None]:
#This is needed to start a Spark session from the notebook
#You may adjust the memory used by the driver program based on your machine's settings
import os 
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"

from pyspark.sql import SparkSession

Spark can run in multiple modes, including:
- **Local mode**: spark is run only on the same computer that runs this notebook. In this mode, spark can still exploit parallellism, by using all avaiable processor cores.
- **Cluster mode**: spark is run using the resources made available by a resource manager at a cluster. The resource manager is responsible for allocating so-called *executor* instances (each with a number of CPU cores and an amount of memory) to spark. For this class, we will unfortunately not be able illustrate spark running in cluster mode.

In [None]:
# -------------------------------
# Start Spark in LOCAL mode
# -------------------------------

#The following lines are just there to allow this cell to be re-executed multiple times:
#if a spark session was already started, we stop it before starting a new one
#(there can be only one spark context per jupyter notebook)
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass

# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

**Note**: after creating the spark context in local mode, you will be able to access the Spark GUI at http://localhost:4040

In [None]:
# check that we have a working spark context, p;rint its configuration
sc._conf.getAll()

### Word Count in Spark

The `data/books` folder contains books (in txt format) downloaded from project gutenberg (http://www.gutenberg.org/cache/epub/). In that folder you will also find a shell script to download more books (`data/books/download_more_books.sh`), should you wish to experiment with a larger collection of files.

In [None]:
# Let's list the files in data/books
!ls data/books/

In [None]:
# Load the contents of a file into an RDD. Note - when run on the cluster this load from HDFS (inside /user/$USER/)
# if you really want to load from HDFS, you can also put the full HDFS url, e.g.
# hdfs://public00:8020/user/<your_user_id_here>/data/books/pg20417.txt
fileName = 'data/books/pg20417.txt'
bookRDD = sc.textFile(fileName)

In [None]:
# Show 5 elements of bookRDD
bookRDD.take(5)
# When loading a textfile, it is hence converted to an RDD where each line is one element. 

In [None]:
# We can operate on each element of an RDD by invoking the map() transformation.
#
# MAP : Return a new RDD by applying a function to each element of this RDD.
# 
# Specifically, we split each line in multiples values by splitting on whitespace separators
# The result is hence an RDD where each element is itself a list.
tupleRDD = bookRDD.map(lambda line: line.split())
tupleRDD.take(3)

In [None]:
# Let's turn the bookRDD into an RDD of words - each word becomes an element in the RDD
# To do so, we use flatMap instead of map
#
# FLATMAP : Return a new RDD by first applying a function to all elements of this RDD, 
#           and then flattening the results.
# 
wordsRDD = bookRDD.flatMap(lambda line: line.split())
wordsRDD.take(5)

In [None]:
# do a wordCount by invoking the count() action
wordsRDD.count()

In [None]:
# Alternatively, we can get the same result by using the aggregate() action. 
#
# REDUCE : Reduces the elements of this RDD into a single value using the specified commutative and 
#          assocative binary operator. Currently reduces partitions locally.
# Note: we first map the RDD into an RDD of integers
wordsRDD.map(lambda x: 1).reduce(lambda a,b: a+b)

**Exercise**: what happens if you reduce wordsRDD without mapping it first to an RDD of integers?

In [None]:
# Your answer here: reduce wordsRDD without mapping it first to an RDD of integers. 
# Can you explain why this happens ?


In [None]:
# count how many unique words there are. The distinct() transformation removes duplicate elements.
wordsRDD.distinct().count()

In [None]:
# Count how many times each unique word appears in wordsRDD by invoking the countByValue() transformation
wordsRDD.countByValue()

### Do a wordcount on all files

In [None]:
allBooksRDD = sc.textFile('data/books/*.txt')

In [None]:
bookRDD.count()

In [None]:
allBooksRDD.count()

In [None]:
allWordsRDD = allBooksRDD.flatMap(lambda line: line.split())
allWordsRDD.countByValue()

### Other transformations: sample, distinct, filter

In [None]:
#Get a sample of 10% of the words
sample = allWordsRDD.sample(False, 0.1)
sample.count()

In [None]:
#map to lower case, ignore duplicates
lowerSample = sample.map(lambda w: w.lower()).distinct()
lowerSample.count()

In [None]:
# Filter allows to get the subset of elements of an RDD that satisfy a given predicate
startsWithARDD = allWordsRDD.filter(lambda w: w.startswith('a'))
print("First three elements starting with \'a\'", startsWithARDD.take(3))
print("Number of elements starting with \'a\'", startsWithARDD.count() )
print("Number of distinct elements starting with \'a\'", startsWithARDD.distinct().count())

**Exercise**: the code above counts all words that start with lowercase letter 'a'. Using the same filter predicate (i.e., `lambda w: w.startswith('a')`, calculate the number of words that start with either 'a' or 'A'.

In [None]:
# Your answer here


### Default aggregations on RDD of ints/RDD of doubles

In [None]:
# Various easy aggregation function are pre-defined on RDDs containing only integers
intsRDD = sc.parallelize(range(1,10))
intsRDD.collect()

In [None]:
intsRDD.sum()

In [None]:
intsRDD.mean()

In [None]:
intsRDD.stdev()

In [None]:
intsRDD.variance()

### Set operations

In [None]:
# UNION : Build the union of a list of RDDs; the result will have duplicates

one = sc.parallelize(range(1,10))
two = sc.parallelize(range(9,21))
print("One : ",one.collect())
print("Two : ",two.collect())
print("One union Two : ",one.union(two).collect())

In [None]:
# INTERSECTION : Return the intersection of this RDD and another one. 
# The output will not contain any duplicate elements, even if the input RDDs did.

one = sc.parallelize(range(1,10))
two = sc.parallelize(range(5,15))
print("One : ",one.collect())
print("Two : ",two.collect())
print("One intersection Two : ",one.intersection(two).collect())

In [None]:
# Subtract computes set difference (without duplicates)
one = sc.parallelize(range(1,10))
two = sc.parallelize(range(5,15))
print("One : ",one.collect())
print("Two : ",two.collect())
print("One difference Two : ",one.subtract(two).collect())

### Partitioning

In [None]:
# An RDD is partitioned transparently

# Using the glom() transformation we can inspect how the elements of an RDD are partitioned.
# glom() coalesces all elements within each partition into a list. 

# Using the repartition() function we can transform an RDD into an RDD  
# that has exactly numPartitions partitions. We Can increase or decrease the level of
# parallelism in this RDD. Internally, this uses a shuffle to redistribute data. 

# We ask the colletion to have 5 partitions
rdd = sc.parallelize(range(50), 5).map(str)
glomed = rdd.glom()
print("Initial RDD : ",rdd.collect())
print("Glomed RDD : ",glomed.collect())
print("# of partitions : ",rdd.getNumPartitions())

print("-----------------------------")
rdd = sc.parallelize(range(50), 4)        # We specify 4 partitions
print("# of partitions before ",rdd.getNumPartitions())
print("Data in partitions : ",rdd.collect())
glomed = rdd.glom()
print("Glomed RDD : ",glomed.collect())

print("-----------------------------")
repartitionedRDD = rdd.repartition(10)
print("# of partitions after ",repartitionedRDD.getNumPartitions())
print("Data in partitions : ",repartitionedRDD.collect())
glomed = repartitionedRDD.glom()
print("Glomed RDD : ",glomed.collect())


### Pair RDDs

In [None]:
# Pair RDDs are RDDs that contain pairs of the form (key, value)
elems = [ ('a', [1,2,3,4]), ('b', [2,3,4]), ('c', [5,9,11,12]), ('a', [100,101])]
pairRDD = sc.parallelize(elems)

In [None]:
# Pair RDDs have some special operations defined on them

# mapValues: pass each value in the key-value pair RDD through a map function without changing the keys
# this also retains the original RDD’s partitioning.
print("Mapping values to a string: ", pairRDD.mapValues(lambda value: str(value)).collect())
print("Mapping values to take only first element:", pairRDD.mapValues(lambda value: value[0]).collect())
print("Mapping values to sum of elements:", pairRDD.mapValues(lambda value: sum(value)).collect())

In [None]:
# We can also group all values that have the same key. Each key then has a value that is a collection
pairRDD.groupByKey().collect()

In [None]:
# We can actually do the usual kind of things on these collections
pairRDD.groupByKey().mapValues(lambda value: len(value)).collect()

In [None]:
# The following converts the value from a ResultIterable to a normal list
pairRDD.groupByKey().mapValues(lambda value: [x for x in value]).collect()

In [None]:
# Reducebykey is a parallel reduce operation, that reduces per key the set of all 
# associated values into a single value
pairRDD.reduceByKey(lambda x,y: x+y).collect()   #Note: + on lists is list concatenation

In [None]:
# Aggregate by key is similar, but it allows us more flexibility on how to combine
# In the following example, we start with an initial aggregate value 0. Then, 
# for each (key, val) pair in the RDD, val is first aggregated using lambda x, y: x+sum(y)
# Note that in this function, x is the aggregate value so far and y is the key-value (in our case it is a list)
# the new aggregate is obtained by adding the old aggregate value to the sum of the list. 
# Finally, multiple aggregate values for the same key are combined into a single one using lambda x,y: x+y,
# i.e., just by adding the per-pair aggregate values.
pairRDD.aggregateByKey(0, lambda x, y: x+ sum(y), lambda x, y: x + y).collect()

In [None]:
# CAUTION!
# Be careful about using pairRDD methods on non-pair RDDs. The result is not always what you expect.
elems = [ [1,2,3,4], [2,3,4,5]]
rdd = sc.parallelize(elems)
print(rdd.collect())
print(rdd.mapValues(lambda x: x).collect())

# Note that hence each list is viewed as a pair, where the first element is the key and the second
# element is the value, the rest is discarded

In [None]:
# Pair RDDs can also be sorted, either by key, or by some computed sort key
elems = [ ('a', [1,2,3,4]), ('b', [2,3,4]), ('c', [5,9,11,12]), ('a', [100,101])]
pairRDD = sc.parallelize(elems)
print("sorted by key:", pairRDD.sortByKey().collect())
print("sorted descendingly by length of the value:", pairRDD.sortBy(lambda x: len(x[1]), False).collect())

In [None]:
# Normal RDDs can be transformed into pair RDDs by the groupBy transformation
# GroupBy takes a function f that returns, for each element in the old RDD, its key in the new rdd
# All elements with the same key are grouped together in a value
elems = ['abcd', 'abracadabra', 'hello', 'hi', 'bottom', 'top']
rdd = sc.parallelize(elems)
print("grouped by first letter:\n", str(rdd.groupBy(lambda x: x[0]).collect()))
print("grouped by first letter, values mapped to lists:\n", rdd.groupBy(lambda x: x[0]).mapValues(lambda x: [y for y in x]).collect())    

## Cleaning up

It is **vital** that you stop your spark instance after you are done working with it. If you do not do this, the resources acquired by your spark instance (i.e., the number of cores and memory reserved for it) will be kept indefinetely, and *are hence would not available for others if when using spark in cluster mode!*

In [None]:
sc.stop()