## PySpark Broadcast Variables
https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/ <br>

In PySpark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, PySpark distributes broadcast variables to the workers using efficient broadcast algorithms to reduce communication costs.


In [35]:
# https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/
"""
Let me explain with an example when to use broadcast variables,
assume you are getting a two-letter country state code in a file 
and you wanted to transform it to full state name, (for example CA to California, NY to New York e.t.c)
by doing a lookup to reference mapping. In some instances, this data could be large
and you may have many such lookups (like zip code e.t.c).
"""


states = {"NY":"New York", "CA":"California", "FL":"Florida"}
broadcastStates = spark.sparkContext.broadcast(states)

print(broadcastStates.value)

data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]

rdd = spark.sparkContext.parallelize(data)

def state_convert(code):
    return broadcastStates.value[code]

result = rdd.map(lambda x: (x[0],x[1],x[2],state_convert(x[3]))).toDF(columns)

columns = ["firstname","lastname","country","state"]

result.show(truncate=False)


{'NY': 'New York', 'CA': 'California', 'FL': 'Florida'}
+---------+--------+-------+----------+
|firstname|lastname|country|state     |
+---------+--------+-------+----------+
|James    |Smith   |USA    |California|
|Michael  |Rose    |USA    |New York  |
|Robert   |Williams|USA    |California|
|Maria    |Jones   |USA    |Florida   |
+---------+--------+-------+----------+



## diff between  Accumulator and Broadcast Variables

An accumulator is also a variable that is broadcasted to the worker nodes. <br>
The key difference between a broadcast variable and an accumulator is that while <br>
the broadcast variable is read-only, the accumulator can be added to <br>

Accumulator can be used to implement counters (as in MapReduce) or sums. <br>
Spark natively supports accumulators of numeric types, and programmers can add support for new types.

## PySpark Accumulator with Example


In [39]:

num = sc.accumulator(10) 
print("accumulator=",num)
def f(x): 
   global num 
   num+=x 
rdd = sc.parallelize([20,30,40,50]) 
rdd.foreach(f) 
final = num.value 
print ("Accumulated value is -> %i" % (final))


accumulator= 10
Accumulated value is -> 150
