# BD Lab - Spark practice

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Apache Spark has the following features.

Speed − Spark helps to run an application on a Hadoop cluster up to 100 times faster than in-memory, and 10 times faster when running on disk. This is possible by reducing the number of read/write operations to disk. It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

## Background Spark Context

There are a few spark concepts that we will now introduce before we get started. The sparkcontext object which allows us to interact with spark and the spark data structure of RDD.
Sparkcontext is basically just an entry point to any Spark functionality. Spark applications are run as independent sets of processes, coordinated by a Spark Context in a driver program.

![Title](https://annefou.github.io/pyspark/slides/images/SparkRuntime.png)

In [0]:
#check the Spark version you are running and the Databricks version
spark.version

Out[1]: '3.3.2'

The spark context may be automatically created (for instance if you call pyspark from the shells (the Spark context is then called sc).

You will see we will use the sparkcontext by invoking methods on the sc object to interact with spark.

QUICK NOTE: The spark context is actually replaced by the spark session in later versions of spark. We will talk about this in the following labs, but using the spark context now will mean you have familiarity with both.

## Background on RDD

We will start by looking at the spark concept of Resilient Distributed Datasets (RDD). RDD is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a <b> read-only, partitioned </b> collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

You can apply multiple operations on these RDDs to achieve a certain task. To apply operations on these RDD's, there are two ways −

<b>Transformation</b> − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

<b>Action</b> − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

<b>Laziness of Spark transformations </b>
It’s important to understand that transformations are evaluated lazily, meaning computation doesn’t take place until you invoke an action. Once an action is triggered on an RDD, Spark examines the RDD’s lineage and uses that information to build a “graph of operations” that needs to be executed in order to compute the action. Think of a transformation as a sort of diagram that tells Spark which operations need to happen and in which order once an action gets executed. We will work with examples that demonstrate this concept.

## Spark - Lazy evaluation

Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also.

PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. The majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them.

## Lazy evaluation and caching

In this example we will take a quick look at what lazy execution means for our code and how caching works. Take a look at the function we will use below (and run the cell). As you can see it is designed to take time to evaluate, this function just waits one second before continuing. Basically, we are representing a task we want to do that takes time.

In [0]:
import time

# Create a fucntion called slowLoad that we will use later
def slowLoad(x):
    time.sleep(1) # sleeps for 1 second
    return x

Let's define a quick dataset to work with, first we create data in Python. Then we will create an RDD based on this input. To do so we use the spark context object to parallelize the data. To return the data we use the collect() action which evaluates the RDD.

In [0]:
#create Python data (a list of the numbers 0 to 19)
dataPython = range(0,20)

#load the data into an RDD by spreading it out into our big data cluster
smallRDD = sc.parallelize(dataPython,1)

#return the data from the RDD using an 'action' called collect
smallRDD.collect()

Out[3]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Now we create a new RDD that uses our function. We use a map function (this is a transformation, we will talk more about this next), as with the map reduce framework we started with. Map allows us to apply any function to all our data. Here we are simulating a data transformation that takes a long time. For every item in our database, the function will take 1 second to evaluate (approximately). So with 20 items in our dataset applying the function 20 times should take 20 seconds.

In [0]:
#Use the map transformation to apply the function we made to the data
newRDD = smallRDD.map(slowLoad)

That cell probably evaluated quicker than you expected. In my case it took 0.04 seconds but the function we created should take 1 second for each item in our database with 20 items.

What do you think is currently in the new RDD we have just created that is called newRDD?

Right now nothing (or at least no data), but as soon as we try to look at the contents (or otherwise use the data) spark will need to perform the computation to create the new RDD. This is the difference between transformations (such as map) and actions (such as collect). So let’s apply the collect action to see what is in the RDD:

In [0]:
# Use the collect action to resturn the contents of our data
newRDD.collect()

Out[5]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

You should see that only now was our function actually being used (as it takes time), not when we created the newRDD, as only now was an action used.

Hopefully you can see this is completely different to running cells in Python normally.

Run the above cell **TWICE** to check that it takes the same amount of time (that in both cases our data transformation is being recalculated).

So a couple of key take-aways. We use transformations to apply functions to our data, for example we could have a function that will clean the data and use map to apply it. When we use the transformation we get a new RDD that has had the function applied to all of the data, so in our data cleaning example we would get a new RDD with clean data. **BUT** the new RDD doesn't actually contain any data, just the instructions for how to create the cleaned data. We can apply many transformations and no actual new data will have been created. But when we want to see the results of course the instructions to change the data will need to be executed, so when we perform an 'action', such as collect, which will return a value all the transformations will be applied.

Often we will want to work with transformed data in multiple queries, for example we might want to use our cleaned data multiple times without having to redo the calculations to clean the data multiple times. Or alternatively, we might want to use an iterative algorithm (such as training machine learning) and we don't want to have to redo all previous transformations in each iteration. In these cases we can 'persist' the database.

In [0]:
# Tell the RDD to store the results, not just the instructions for creating the results, for future use using the cache function.
newRDD.cache()

Out[6]: PythonRDD[2] at wrapper at <command-734436448619080>:2

Run the below cell **THREE TIMES** to see the difference. The first time should take 20 seconds again. But after the first time the data the data is calculated it will be held in memory so that it can be quickly retrieved.

In [0]:
# Use the collect action to get the contents of the RDD
newRDD.collect()

Out[7]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In the above case we have persisted the RDD to memory (RAM) we can also persist to disk (or a combination of the two). See https://data-flair.training/blogs/apache-spark-rdd-persistence-caching/ for more information.

In the above example we have been working with the basic data structure of spark: the RDD. You have seen how to create an RDD by parallelizing some existing python data and you have seen how to get results out by using an **action** 'collect' which returns the values from an RDD. There is another important operation we will be using with RDDs and that is the **transformation** which creates a new RDD from an existing one, usually with some change we have made to the data (a transformation of the data). For example we map load out data into an RDD, then decide to clean the data by applying a transformation and getting a new RDD with clean data. Remember, RDDs are immutable, this means they cannot be changed, so we can't just clean the data inside the RDD we loaded the data into, as this would be changing the contents of the RDD which is not possible, we need to work with transformations. Two of the most common transformations are explained below.

First we start by creating some data (a list with three strings) and parallelizing it into our cluster to create an RDD.

In [0]:
# Create the data as a list in Python
mylist = ["This is the first line", "Now the second line", "Finally the third line", "The Hittite capital of Hattusa is located in modern-day Turkey"]

# Load the data into spark to create an RDD with the data inside
myRDD = sc.parallelize(mylist)
                     

Take a look at the cell below, here I apply the map transformation to our RDD, the output of which is a new RDD which I have called 'splitListRDD'. The map transformation applies a function to all of the data in the RDD, here I have created a lambda function which will split the strings. To see what is inside the new RDD we then need to use the collect action to get the data out. Remember, that we have lazy execution so that when I applied the map transformation nothing actually happened, spark just keeps a record of "how to create" the data we want, it is only when we perform an action that any calculation is done.

In [0]:
#apply a funtion, in this case a lambda function that splits the words, with map to create a new RDD
splitListRDD = myRDD.map(lambda line: line.split(' '))

In [0]:
#return the results from the new RDD
splitListRDD.collect()

Out[10]: [['This', 'is', 'the', 'first', 'line'],
 ['Now', 'the', 'second', 'line'],
 ['Finally', 'the', 'third', 'line'],
 ['The',
  'Hittite',
  'capital',
  'of',
  'Hattusa',
  'is',
  'located',
  'in',
  'modern-day',
  'Turkey']]

So we see we have transformed our data, which was a list of three strings, into **a list of three lists: inside each of these three lists are the words split separately.** It is important to see that the number of items in the data set is the same, before we had three strings (three sentences), now we have three lists (lists of each word in the original sentence).

This is important as it is the difference to the other most common transformation we will be using called flatMap. Like  map, flatMap will apply a function to all out data in the database, however, this function 'flattens' out all the data. So imagine we wanted all the words from our data, maybe to count them, but our previous transformation didn't quite help as we ended up with words all in different lists. The flatmap transformation is the answer in that it removes this structure (flattens out the data so it is all on the same level of the list). Take a look at the below example and see the difference in the result returned, now we have all our words together in a single list. Also notice how I 'chain' together the transformation and the action into one line, this is a common way to apply transformations all together.

In [0]:
#apply a funtion, in this case a lambda function that list the words, with flatmap and return the results
myRDD.flatMap(lambda x: x.split(" ")).collect()

Out[11]: ['This',
 'is',
 'the',
 'first',
 'line',
 'Now',
 'the',
 'second',
 'line',
 'Finally',
 'the',
 'third',
 'line',
 'The',
 'Hittite',
 'capital',
 'of',
 'Hattusa',
 'is',
 'located',
 'in',
 'modern-day',
 'Turkey']

It's worth noting flatMap just removes one level of the data, take a look at the example below with lots of lists inside lists, flatmap has only removed the structure at one level. Take a look at 8 which was inside a list inside the main data list. After the flatmap 8 is just inside the data list.

In [0]:
#create the data
dataTest = [[0,[1, 2, 3, 4, 5],7], [8], [9, [10,[11, 12]]]]

#load the data into an RDD
dataTestRDD = sc.parallelize(dataTest)

#take a look at the contents of the RDD
dataTestRDD.collect()

Out[12]: [[0, [1, 2, 3, 4, 5], 7], [8], [9, [10, [11, 12]]]]

In [0]:
#This line of code applies that flatmap and returns the values
dataTestRDD.flatMap(lambda x: x).collect()


Out[13]: [0, [1, 2, 3, 4, 5], 7, 8, 9, [10, [11, 12]]]

Another popular transformation is filter that reduces the amount of data in the RDD to what we are interested in. For example in the code below I use filter and the lambda function to remove all numbers less than 0.

In [0]:
# create the data and load into an RDD
numberRDD = sc.parallelize([1,2,3,4,-1,-2,5,-5,6,0])

# filter the data using the filter transformation, this creates a new RDD with the filtered data
filteredRDD = numberRDD.filter(lambda x: x>=0)

# return the contents of the filtered data
filteredRDD.collect()

Out[14]: [1, 2, 3, 4, 5, 6, 0]

Let's look at a common mistake that people can make when using a map transfromation for the first time. I'm going to use a dataset with the number 0 to 19, and I'm going to create a function that will go through each number and double it, so we can have a new RDD with all the numbers doubled.

In [0]:
#create the data RDD
dataRDD = sc.parallelize(range(20))

#Define my new function (this function will not work without data) that doubles every value:
def broken_function(x):
  doubledDataList = []
  #iterate through all data
  for i in x:
    doubledDataList.append(i * 2)
  return doubledDataList

First let's test our function on some pythonData

In [0]:
# apply our function to a list of the number 0 to 19
broken_function(range(20))

Out[16]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]

It works perfectly, now lets use the map function to apply it to our RDD.

In [0]:
# create an RDD with the data values 0 to 19
dataRDD = sc.parallelize(range(20))

# use the map transformation to apply our function
doubledDataRDD = dataRDD.map(lambda x: broken_function(x))

The cell ran without any errors, so all looks good right? Unfortunately not, remember spark hasn't done any calculation yet. Let's see what happens when we try to get the data out.

In [0]:
doubledDataRDD.collect()

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-734436448619104>:1[0m
[0;32m----> 1[0m [43mdoubledDataRDD[49m[38;5;241;43m.[39;49m[43mcollect[49m[43m([49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m*[39;49m[38;5;241;43m*[39;49m[43mkwargs[49m[43m)[49m
[1;32m     49[0m     logger[38;5;241m.[39mlog_success(
[1;32m     50[0m         module_name, class_name, function_name, time[38;5;241m.[39mperf_counter() [38;5;241m-[39m start, signature
[1

Ok now we have an error, so firstly this is something we need to be careful of in spark, just because the code for our transformations runs does not mean it is correct, we only find this out when we try to get back the results.

*Any idea what is wrong with the function?*

The problem is that it is trying to iterate through all the data items. But `map` applies the function to each individual record in the data. So when I apply `map` it takes the first record, which is the value 0, and applies the function to it. The function then tries to iterate through the value 0 which does not work. Run the cell below which tries to apply the function to the value of 0, this is what the map transformation is doing when we use it with the RDD.

In [0]:
broken_function(0)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
File [0;32m<command-734436448619106>:1[0m
[0;32m----> 1[0m [43mbroken_function[49m[43m([49m[38;5;241;43m0[39;49m[43m)[49m

File [0;32m<command-734436448619098>:8[0m, in [0;36mbroken_function[0;34m(x)[0m
[1;32m      6[0m doubledDataList [38;5;241m=[39m []
[1;32m      7[0m [38;5;66;03m#iterate through all data[39;00m
[0;32m----> 8[0m [38;5;28;01mfor[39;00m i [38;5;129;01min[39;00m x:
[1;32m      9[0m   doubledDataList[38;5;241m.[39mappend(i [38;5;241m*[39m [38;5;241m2[39m)
[1;32m     10[0m [38;5;28;01mreturn[39;00m doubledDataList

[0;31mTypeError[0m: 'int' object is not iterable

Let's take a look at how to do this properly, we wrote a function that doubles an individual data item, not all the items together. Then apply it with `map` which applies a function to the data individually.

In [0]:
# we are going to work with our RDD which contains the values of 0 to 19

# here I define the function I will use
def double_a_number(x):
  return 2 * x

# now I apply the function with map
correct_doubled_RDD = dataRDD.map(lambda x: double_a_number(x))

# check the values
correct_doubled_RDD.collect()


Out[20]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]

It worked! Hopefully, you can see the difference between the function that didn't work (because it was written to take all of the data as an input) and the one that did (because it was written to take just one item of data as an input)

#### Let's practice your understanding with an exercise

First create a python variable with a list of the numbers 0 to 49, you can name the variable whatever you like.

In [0]:
#create your variable here
myList = range(0,50)

Now load the data from your variable into an RDD (again name the new RDD whatever you like) using the *spark context* 'sc' and the parallelize function as we have done several times above.

In [0]:
#create your RDD here
myRDD = sc.parallelize(myList)


Write a function that will return the square of any number. Call this function whatever you like.

In [0]:
#md write your function here 
def square(x):
  return x * x

Use `map` to apply your function to the data in the RDD creating a new RDD.

In [0]:
# create your new RDD with map here
newRDD = myRDD.map(square)


Use an action to return your results and make sure they are correct, the results should look something like this.  
Out[]: [0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361,
 400,
 441,
 484,
 529,
 576,
 625,
 676,
 729,
 784,
 841,
 900,
 961,
 1024,
 1089,
 1156,
 1225,
 1296,
 1369,
 1444,
 1521,
 1600,
 1681,
 1764,
 1849,
 1936,
 2025,
 2116,
 2209,
 2304,
 2401]

In [0]:
# use an action here to return the results from you RDD
newRDD.collect()

Out[25]: [0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361,
 400,
 441,
 484,
 529,
 576,
 625,
 676,
 729,
 784,
 841,
 900,
 961,
 1024,
 1089,
 1156,
 1225,
 1296,
 1369,
 1444,
 1521,
 1600,
 1681,
 1764,
 1849,
 1936,
 2025,
 2116,
 2209,
 2304,
 2401]

Use the filter transformation to filter out results in your RDD (your RDD where all values have been squared) that are less than 1500 and use an action to return the results. The results should look something like this:  
Out[]: [1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401]

In [0]:
# create a new RDD with filtered results and return the values
newRDD.filter(lambda x: x > 1500).collect()

Out[26]: [1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401]

## Parallelization example

**NOTE:** Unfortunately, this section will not produce different results for different degrees of parallelization in the community edition of Databricks. Basically, as we are only allocated in one cluster we cannot parallelize our data. However, I leave this exercise in the lab as it is important to see how this should work.

Let’s look at another example, a simple method to estimate the value of Pi. This method randomly selects points in a square, then checks if those points are inside a circle within the square. The proportion inside a circle should be equal to the area of that circle, from which we can calculate Pi.

Watch this short video which demonstrates what we are trying to do, estimate the value of Pi through Monte-Carlo simulation: https://www.youtube.com/watch?v=ELetCV_wX_c

The `inside` function below returns true if a dart randomly thrown at the square in the diagram above lands inside the red circle and false otherwise. We use the proportion that land inside to estimate Pi.

In [0]:
import random
import string
numSamples =  20000000

def inside(p):
    # this function does not use the input value, it selects two random number to represent coordinates of throwing darts at a square, then determines if those darts would lie inside a circle
    x, y = random.random(), random.random()
    return x ** 2 + y ** 2 <= 1


Next we create the RDD’s, here we will specify the level of parallelization 1 for the first and 3 for the second (we specify this with the second argument to parallelize).

*Note: one important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))*

In [0]:
# First we create two RDD's with a bunch of records
partition1 = sc.parallelize(range(0, numSamples), 1)
partition2 = sc.parallelize(range(0, numSamples), 5)

# Here we filter by darts that land inside a circle
insideDart1 = partition1.filter(inside)
insideDart2 = partition2.filter(inside)

Again this cell was quick to evaluate, as the calculation is only done when we decide to observe the results (or call another action). Let’s try the calculation with the data on one partition:

In [0]:
pi1 =(4.0 * insideDart1.count() / numSamples)
print("Pi is roughly: " + str(pi1))

Pi is roughly: 3.1412664


Let's try again with the RDD we split into 2 partitions. You should see the cell below evaluates much faster.

In [0]:
pi1 =(4.0 * insideDart2.count() / numSamples)
print("Pi is roughly: " + str(pi1))

Pi is roughly: 3.1412468


NOTE: If these two commands took the same amount of time, it is because we are running on a trial cluster (as we are using the free community version of databricks) and it only has one node (so it cannot execute in parallel).

 We can always check the number of partitions with the following function.

In [0]:
insideDart2.getNumPartitions()

Out[31]: 5

In the python code above we use two new RDD methods: `filter()` and `count()`. Can you tell which is a transformation and which is an action? (When did the job start execution)

Spark works with functional programming where lambda (or anonymous) functions are very useful. If you are unfamiliar with these functions please review:

https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/

It may be worth a review of the above page anyway to see examples of using lambda functions with `map` or `filter` functions. Check the example below, I use the lambda function and the map transformation to square every number in the partition1 RDD.

In [0]:
squaredRDD = partition1.map(lambda x: x*x)
squaredRDD.take(10)

Out[32]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

## Exercises

Lambda functions are a very common method to work with RDDs, especially at the ETL stage of data analysis. Write the code to create a new RDD called evenNumbers that contains only even values from the existing RDD partition1 (without using a lambda function). You can do this similarly to how we have worked above, define a new function that determines if a number is odd or even, then apply your function to the data using the filter function. Check how in the example above we are using the filter function to go from an RDD with all sampled random numbers, to an RDD with just samples that were inside the circle. For more information you can also check this explanation of the filter function in spark:

https://backtobazics.com/big-data/spark/apache-spark-filter-example/

You can check the first 10 values with the take() function ie. evenNumbers.take(10).

In [0]:
#define a function that returns true if the input is an even number (otherwise false)
def even(x):
  return (x % 2) == 0

In [0]:
#Apply the function using filter to the partition1 RDD to create a new RDD
myResult = partition1.filter(even)

In [0]:
# Check your results by using the take action to take the top 10 values from the result and check they are all even
myResult.take(10)

Out[35]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Use a lamda function to achieve the same result (Hint: the filter function can be helpful here again).

In [0]:
partition1.filter(lambda x: (x % 2) == 0).take(10)

Out[36]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

##Paired RDD

Spark Paired RDDs are nothing but RDDs containing a lot of key-value pairs. Basically, key-value pairs (KVP) consists of two linked data items. In pyspark we use tuples to store the linked key and value like this (key, value).  Here, the key is the identifier and the value is the data corresponding to the key. Spark operations work on RDDs containing any type of objects, however, key-value pair RDDs have a few extra operations. Such as:

groupByKey - groups the values of the RDD by key.

reduceByKey - performs aggregation on the grouped values corresponding to a key.

mapValues - transformation applies a function to each of the values of the pair RDD without changing the key.

We will see examples below with a dataset of sales, where the key is the country of sale and the value of the amount.

In [0]:
# Here we create the data and load into a RDD
data = [("Portugal",30.94),("Spain",61.43),("France",50.37),("Germany",67.51),("Germany",57.12),("Portugal",20.12),("Germany",76.92),("Portugal",32.53),("Spain",30.39),("Germany",21.11),("Spain",56.98),("France",64.99),("France",39.2),("Germany",27.99),("Portugal",60.59),("Portugal",60.58),("Germany",87.11),("Germany",77.25),("Germany",40.74),("Portugal",66.19),("France",44.47)]
paired_RDD = sc.parallelize(data)

The reduceByKey function works only for RDDs which contain key and value pairs of elements (i.e. RDDs having tuple as a data element). It is a transformation operation which means it is lazily evaluated. We need to pass one associative function as a parameter, which will be applied to the source RDD and will create a new RDD as with resulting values(i.e. key value pair).

The associative function accepts two arguments and returns a single element. It performs merging on the clusters remotely then locally using reduce function and then sends records across the partitions for preparing the final results.

Below we use a lambda function that takes two inputs (x and y) and sums them together. This returns the sum of sales by country in our case.

In [0]:
sumRDD = paired_RDD.reduceByKey(lambda x,y: x + y)
sumRDD.collect()

Out[38]: [('Spain', 148.79999999999998),
 ('Germany', 455.75000000000006),
 ('Portugal', 270.95),
 ('France', 199.03)]

As the name suggests, the groupByKey function in Apache Spark just groups all values with respect to a single key. Unlike reduceByKey it doesn’t perform any operation on the final output. It just groups the data and returns in the form of an iterator. Take a look at the example below:

In [0]:
groupRDD = paired_RDD.groupByKey()

#What is the key for the first group
print(groupRDD.collect()[0][0])

#What are the values in the first group
for i in groupRDD.collect()[0][1]:
  print(i)

Spain
61.43
30.39
56.98


When we use map() with a Pair RDD, we get access to both the key & value. Often we are only interested in accessing the value (and not key). In those cases, we can use mapValues() instead of map() to apply a function. Here I use the 'sorted' function and apply it to the grouped data.

In [0]:
sortedRDD = groupRDD.mapValues(lambda x: sorted(x))

#What is the key for the first group
print(sortedRDD.collect()[0][0])

#What are the values in the first group
for i in sortedRDD.collect()[0][1]:
  print(i)


Spain
30.39
56.98
61.43


#### Exercise

Use mapValues on the paired_RDD to round values to the nearest euro in a new RDD. Python has a 'round' function which can be useful for you here.

In [0]:
# Your code here
paired_RDD.mapValues(lambda x:  round(x)).take(3)

Out[41]: [('Portugal', 31), ('Spain', 61), ('France', 50)]

Use reduce to find the highest sale.

In [0]:
# Extra explanation below
paired_RDD.reduce(lambda x,y: x if (x[1] > y[1]) else y)

Out[42]: ('Germany', 87.11)

Use reduceByKey to find the highest sale by country.

In [0]:
# Your code here
paired_RDD.reduceByKey(lambda x,y: x if (x > y) else y).collect()

Out[43]: [('Spain', 61.43), ('Germany', 87.11), ('Portugal', 66.19), ('France', 64.99)]

Do you notice the difference when fetching the results?  
If you look at the cheatsheet provided you can see one is a transformation and the other is and action

#### Extra section about the reduce method  

Reduce or also refered as [fold](https://en.wikipedia.org/wiki/Fold_(higher-order_function) operation is an important concept in spark that might cause confusion when coming to functional programming environments like spark.
Below it's an example of a possible implementation of the reduce method in pure pyhton  

``` python
def reduce(function, iterable):
    # Transform the list into an iterator
    it = iter(iterable)
    # Take the first value of the list
    value = next(it)
    
    #Iterate over the rest list applying the function to each element of the list and the past result  
    for element in it:
        value = function(value, element)
    return value
```

And below is the previous exercise but in pure python

In [0]:
from functools import reduce

data = [
  ("Portugal",30.94),("Spain",61.43),("France",50.37),("Germany",67.51),
  ("Germany",57.12),("Portugal",20.12),("Germany",76.92),("Portugal",32.53),
  ("Spain",30.39),("Germany",21.11),("Spain",56.98),("France",64.99),
  ("France",39.2),("Germany",27.99),("Portugal",60.59),("Portugal",60.58),
  ("Germany",87.11),("Germany",77.25),("Germany",40.74),("Portugal",66.19),
  ("France",44.47)
]

def tuple_greater_than(x, y):
  if x[1] > y[1]:
    print(f"{x} or {y} -> {x}")
    return x
  else:Tome nota
    print(f"{x} or {y} -> {y}")
    return y

highest = reduce(tuple_greater_than, data)
print(f"The highest is {highest}")

('Portugal', 30.94) or ('Spain', 61.43) -> ('Spain', 61.43)
('Spain', 61.43) or ('France', 50.37) -> ('Spain', 61.43)
('Spain', 61.43) or ('Germany', 67.51) -> ('Germany', 67.51)
('Germany', 67.51) or ('Germany', 57.12) -> ('Germany', 67.51)
('Germany', 67.51) or ('Portugal', 20.12) -> ('Germany', 67.51)
('Germany', 67.51) or ('Germany', 76.92) -> ('Germany', 76.92)
('Germany', 76.92) or ('Portugal', 32.53) -> ('Germany', 76.92)
('Germany', 76.92) or ('Spain', 30.39) -> ('Germany', 76.92)
('Germany', 76.92) or ('Germany', 21.11) -> ('Germany', 76.92)
('Germany', 76.92) or ('Spain', 56.98) -> ('Germany', 76.92)
('Germany', 76.92) or ('France', 64.99) -> ('Germany', 76.92)
('Germany', 76.92) or ('France', 39.2) -> ('Germany', 76.92)
('Germany', 76.92) or ('Germany', 27.99) -> ('Germany', 76.92)
('Germany', 76.92) or ('Portugal', 60.59) -> ('Germany', 76.92)
('Germany', 76.92) or ('Portugal', 60.58) -> ('Germany', 76.92)
('Germany', 76.92) or ('Germany', 87.11) -> ('Germany', 87.11)
('Ger

Examples inspired from https://realpython.com/python-reduce-function/