<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_07_Advanced_concepts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz  
!tar xf spark-3.3.1-bin-hadoop3.tgz  
!rm spark-3.3.1-bin-hadoop3.tgz    
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [1]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2022-2023/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/spark-3.3.1-bin-hadoop3 /content/spark
!export SPARK_HOME=/content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

rm: cannot remove '/content/spark': No such file or directory
ln: failed to create symbolic link '/content/spark': No such file or directory
/content/spark/
DRIVE_DATA=/content/gdrive/My Drive/Enseignement/2022-2023/ING3/HPDA/BigDataFrameworks/data/


### Start a SparkSession
This will start a local Spark session.

In [1]:
!python -V

#import findspark
#findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

Python 3.9.2


/usr/local/lib/python3.9/dist-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/26 13:32:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PySpark version 3.3.1


                                                                                

[2, 3]

In [2]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


---


# 07 - Advanced concepts

We will show some additional concepts of Apache Spark

- How a Spark application is executed
- Use of broadcast variables and accummulators

## Execution of a Spark application

How the Spark code is executed

  - Logic and physical level
  - Jobs, stages and tasks

### Logic and physical plan

From a user code, Spark generates a *logic plan*

  -  A DAG with the operations to perform
  -  It does not include information on the physical system on which it is going to be executed
  -  The *Catalyst* optimiser generates an optimised, logic plan
  
From the optimised logic plan, a physical plan is created:

  - It specifies how the logic plan will be executed in the cluster
  - Different execution strategies wil be generated and compared using a cost model
      - For example, how to perform a join in function of the characteristics of the data (size, partitions, etc.)

The physical plan is executed in the cluster

  - The execution is performed on RDDs


### Jobs, stages and tasks
-   As seen, a Spark program defines a DAG connecting the different RDDs
    -   *Transformations* create children RDDs from the parent RDDs

-   *Actions* translate this DAG into an execution plan by generating a **Spark job**
    -   The driver sends a *job* to compute all the RDDs involved in the action
    -   A job comprises one or more *stages*
    -   Each stage is associated to one or more RDDs from the DAG
    -  Stages represent groups of *tasks* which run in parallel
        - The stages are processed in order, launching individual tasks to compute segments of the RDDs
        - Each task runs one or more transformations on a partition
        - Tasks are executed in the cluster nodes
    - A stage ends when a *shuffle* operation is performed
        - it implies data movement among the cluster nodes


-   Pipelining: several RDDs can be computed in the same stage if they verify that:
    -   The RDDs can be obtained from their parents without data movement (e.g. *select*, *filter* or *map*), or if any of the RDDs has been cached on memory or disk
    - The output of each operation is sent to the input of the following one without going down to disk

- Shuffling persistence
  -  Before a shuffling operation, data are written to a local disk
  -  That allows re-launching failed tasks without the need to recompute all the previous transformations
  -  Not performed is the data to shuffle have already been cached (using `cache` or `persist`)


-   The *Spark web interface* shows information about the stages and tasks (more information: `toDebugString()` method in the RDDs)

- The DataFrame's `explain` method, or RDD's `toDebugString` method shows the physical plan


In [3]:
from pyspark.sql.functions import sum,col

# Example to visualize the physical plan
df1 = spark.range(2, 10000000, 2)
df2 = spark.range(2, 10000000, 4)
step1 = df1.repartition(5)
step12 = df2.repartition(6)
step2 = step1.selectExpr("id * 5 as id")
step3 = step2.join(step12, ["id"])
step4 = step3.select(sum(col("id")))

step4.collect()
step4.explain()

                                                                                

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(7) HashAggregate(keys=[], functions=[sum(id#8L)])
   +- ShuffleQueryStage 4
      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=264]
         +- *(6) HashAggregate(keys=[], functions=[partial_sum(id#8L)])
            +- *(6) Project [id#8L]
               +- *(6) SortMergeJoin [id#8L], [id#2L], Inner
                  :- *(4) Sort [id#8L ASC NULLS FIRST], false, 0
                  :  +- AQEShuffleRead coalesced
                  :     +- ShuffleQueryStage 2
                  :        +- Exchange hashpartitioning(id#8L, 200), ENSURE_REQUIREMENTS, [plan_id=135]
                  :           +- *(3) Project [(id#0L * 5) AS id#8L]
                  :              +- ShuffleQueryStage 0
                  :                 +- Exchange RoundRobinPartitioning(5), REPARTITION_BY_NUM, [plan_id=68]
                  :                    +- *(1) Range (2, 10000000, step=2, splits=8)
                  +- *

### Broadcast variables

-   By default, all shared variables (not RDDs) are sent to all executors

    -   They are forwarded on each operation in which they appear

-   Broadcast variables: Send, in an efficient way, read-only variables to the workers

    -   They are sent only once


In [4]:
from operator import add

# dicc is a broadcast variable
dicc=sc.broadcast({"a":"alpha","b":"beta","c":"gamma"})

rdd=sc.parallelize([("a", 1),("b", 3),("a", -4),("c", 0)])

# python 2
#reduced_rdd = rdd.reduceByKey(add).map(lambda (x,y): (dicc.value[x],y))

# python 3
reduced_rdd = rdd.reduceByKey(add).map(lambda x: (dicc.value[x[0]],x[1]))

print(reduced_rdd.collect())



[('alpha', -3), ('beta', 3), ('gamma', 0)]


                                                                                

### Accumulators

Aggregate values from the *worker nodes*, which are then sent to the *driver*

-   Useful to count events

-   Only the driver can access its value

-   Accumulators used on RDDs transformations could be incorrect

    -   If the RDD is recalculated, the accumulator can be updated

    -   This problem does not happen with actions

-   By default, accumulators are integers or floats
-  "Custom accumulators" can be created using [`AccumulatorParam`](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html#pyspark.AccumulatorParam)

In [7]:
from pyspark.sql import Row
from pyspark.sql.types import *
from random import randint

# Create a DataFrame from a list of Row objects
# with random integers
l = [Row(randint(1,10)) for n in range(10000)]
df = spark.createDataFrame(l)
df.show()
# Define an accumulator
neven = sc.accumulator(0)

# if the number in a row is even, we increment the accumulator
def isEven(row):
    global neven
    if row["_1"]%2 == 0:
        neven += 1
print(neven)
# Execute the function once per row
df.foreach(isEven)

print("Number of even values: {0}".format(neven.value))

+---+
| _1|
+---+
|  5|
|  9|
|  5|
|  9|
| 10|
|  9|
|  8|
|  9|
|  4|
| 10|
|  2|
|  4|
|  2|
|  5|
|  9|
|  8|
|  6|
|  3|
|  1|
|  9|
+---+
only showing top 20 rows

0


[Stage 26:>                                                         (0 + 8) / 8]

Number of even values: 5045


                                                                                