In [18]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.getOrCreate()

The second kind of low-level API ins aprk is two types of "distributed shared variables": broadcast variables and accumulators. These are variables yoiu can use in your user-defined functions (e.g in a map function on an RDD or a DataFrame) that have special properties when running on a cluster. Specifically, accumulators let you add together data from all the tasks into a shared result (e.g to implement a counter so you can see how many of your job's input records fail to parse)


# Broadcast Variables
They are a way you can share an immutable value efficiency around the cluster without encapsulating that variable in a function closure. The normal way to use a variable in your driver node inside your tasks is to simply reference it in your function closures (e.g in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason is that when you use a variable ina  closure, it must be deserialized on the worker nodes many times (one per task). Moreover, if you use the same variable in multiple spark actions and jobs, it will be re-sent to the workers with every job instead of once. 
This is where broadcast variables come in. They are shared, immutable variables that are cached on every machine in the cluster instead of serailized with every task. The cannonical use case is to pass around a lookup table that fits in memory on the executors and use that in a function. 

In [19]:
my_collection = "Spark the Definitive Guide : Big Data Processing Mde Simple".split(' ')

words = spark.sparkContext.parallelize(my_collection, 2)

You would like to supplement your words with other information that you have, which is a great size. This is technically a right join if we think of it in terms of SQL

In [20]:
supplementData = {'Spark': 1000, 'Definitive': 200, 'Big': 300, "Simple": 100}


Let us broadcast this across spark. This value is immutable and is lazily replicated across all nodes in the cluster when we trigger an action

In [21]:
suppBroadcast = spark.sparkContext.broadcast(supplementData)

We reference this variable via the value method, which returns the exact value that we had earlier. This method is accessible within serialized functions without having to serialize the data.

In [22]:
suppBroadcast.value

{'Spark': 1000, 'Definitive': 200, 'Big': 300, 'Simple': 100}

Now we can transform our RDD using this value. In this instance, we will create a key-value pair according to the value we might have in the map. If we lack the value, we simple replace with 0

In [23]:
words.map(lambda word: (word, suppBroadcast.value.get(word, 0)))\
    .sortBy(lambda wordPair: wordPair[1])\
    .collect()

                                                                                

[('the', 0),
 ('Guide', 0),
 (':', 0),
 ('Data', 0),
 ('Processing', 0),
 ('Mde', 0),
 ('Simple', 100),
 ('Definitive', 200),
 ('Big', 300),
 ('Spark', 1000)]

# Accumulators
They are a way of updating a value inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way. Accumulators provide a mutable variable that a spark cluster can safely update on a per-row basis. You can usethese for debugging purposes(say to track the values of a certain variable per partition in order to intelligently use it over timje) or to create low-level aggregation. 
Accumulators are variables that are added to only through an associative and commutative operation and can therefore be efficiently supported in parallel. You can use them to implement counters and sums. For accumulator updates performed inside actions only, spark guarantees that each task's update to the accumulator will be applied once, meaning that restarted tasks will not update the value. In transformations, you should be aware that each task's update can be applied more than once if tasks or job stages are rexecuted.

In [26]:
flights = spark.read\
                .parquet('/home/kevin/Desktop/Big-Data-with-Pyspark/data/flight-data/parquet/2010-summary.parquet')
flights.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



Let us create an accumulator that will count the number of flights to or from china. 

In [27]:
accChina = spark.sparkContext.accumulator(0)

In [28]:
def accChinaFunc(flight_row):
    destination = flight_row['DEST_COUNTRY_NAME']
    origin = flight_row['ORIGIN_COUNTRY_NAME']
    if destination == 'China':
        accChina.add(flight_row['count'])
    if origin == 'China':
        accChina.add(flight_row['count'])

In [29]:
flights.foreach(lambda flight_row: accChinaFunc(flight_row))

In [30]:
accChina.value

953

In [31]:
spark.stop()