#### PySpark Broadcast Variables

In PySpark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, PySpark distributes broadcast variables to the workers using efficient broadcast algorithms to reduce communication costs.

### Use case

Let me explain with an example when to use broadcast variables, assume you are getting a two-letter country state code in a file and you wanted to transform it to full state name, (for example CA to California, NY to New York e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you may have many such lookups (like zip code e.t.c). 

Instead of distributing this information along with each task over the network (overhead and time consuming), we can use the broadcast variable to cache this lookup info on each machine and tasks use this cached info while executing the transformations.
#### How does PySpark Broadcast work?

Broadcast variables are used in the same way for RDD, DataFrame.
When you run a PySpark RDD, DataFrame applications that have the Broadcast variables defined and used, PySpark does the following.

1. PySpark breaks the job into stages that have distributed shuffling and actions are executed with in the stage.
2. Later Stages are also broken into tasks
3. Spark broadcasts the common data (reusable) needed by tasks within each stage.
4. he broadcasted data is cache in serialized format and deserialized before executing each task.

You should be creating and using broadcast variables for data that shared across multiple stages and tasks.

Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will be sent to executors when they are first used.

In [1]:
from pyspark import SparkContext
sc = SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/27 15:34:35 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [12]:
states = {"NY":"New York", "CA":"California", "FL":"Florida"}
broadcast_variable = sc.broadcast(states)

In [29]:
broadcast_variable.value['CA']

'California'

In [15]:
broadcast_array = sc.broadcast([1,2,3,4])

In [16]:
broadcast_array.value

[1, 2, 3, 4]

In [17]:
rdd = sc.parallelize([('a',1),('b',2),('a',4),('b',3)])

In [22]:
def fun(val):
    if 'a' == val[0]:
        return broadcast_array.value
    else: return val[1]

In [23]:
rdd.map(lambda data: (data[0], fun(data))).collect()

[('a', [1, 2, 3, 4]), ('b', 2), ('a', [1, 2, 3, 4]), ('b', 3)]

In [24]:
data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]

In [25]:
rdd2 = sc.parallelize(data)
rdd2.take(3)

[('James', 'Smith', 'USA', 'CA'),
 ('Michael', 'Rose', 'USA', 'NY'),
 ('Robert', 'Williams', 'USA', 'CA')]

In [30]:
def fun2(val):
    return broadcast_variable.value[val]

In [31]:
rdd2.map(lambda data: ("{} - {}".format(data[0], data[1]), data[2], fun2(data[3]))).collect()

[('James - Smith', 'USA', 'California'),
 ('Michael - Rose', 'USA', 'New York'),
 ('Robert - Williams', 'USA', 'California'),
 ('Maria - Jones', 'USA', 'Florida')]

#### pyspark.Broadcast.destroy¶
Destroy all data and metadata related to this broadcast variable. Use this with caution; once a broadcast variable has been destroyed, it cannot be used again.

In [32]:
broadcast_variable.destroy()

In [33]:
rdd2.map(lambda data: ("{} - {}".format(data[0], data[1]), data[2], fun2(data[3]))).collect()

22/12/27 15:56:35 ERROR Utils: Exception encountered
org.apache.spark.SparkException: Attempted to use Broadcast(1) after it was destroyed (destroy at NativeMethodAccessorImpl.java:0) 
	at org.apache.spark.broadcast.Broadcast.assertValid(Broadcast.scala:144)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$writeObject$1(TorrentBroadcast.scala:223)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1470)
	at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:222)
	at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
	at java.io.ObjectOutputStream.writeOrdinaryObjec

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:444)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:416)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:163)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2491)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: org.apache.spark.SparkException: Attempted to use Broadcast(1) after it was destroyed (destroy at NativeMethodAccessorImpl.java:0) 
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1477)
	at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:222)
	at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at java.util.ArrayList.writeObject(ArrayList.java:768)
	at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
	... 24 more
Caused by: org.apache.spark.SparkException: Attempted to use Broadcast(1) after it was destroyed (destroy at NativeMethodAccessorImpl.java:0) 
	at org.apache.spark.broadcast.Broadcast.assertValid(Broadcast.scala:144)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$writeObject$1(TorrentBroadcast.scala:223)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1470)
	... 59 more


In [34]:
## will get error after destroying variable

https://sparkbyexamples.com/spark/broadcast-join-in-spark/

In [65]:
studentData = [['si1','Robin','M'],
                ['si2','Maria','F'],
                ['si3','Julie','F'],
                ['si4','Bob',  'M'],
                ['si6','William','M']]

subjectsData = [['si1','Python'],
                 ['si3','Java'],
                 ['si1','Java'],
                 ['si2','Python'],
                 ['si3','Ruby'],
                 ['si4','C++'],
                 ['si4','Python'],
                 ['si2','Java']]

studentrdd = sc.parallelize(studentData, 2)
subjectrdd = sc.parallelize(subjectsData, 2)

In [66]:
data = sc.broadcast(studentrdd.map(lambda data:(data[0], [data[1], data[2]])).collectAsMap())

In [67]:
data.value

{'si1': ['Robin', 'M'],
 'si2': ['Maria', 'F'],
 'si3': ['Julie', 'F'],
 'si4': ['Bob', 'M'],
 'si6': ['William', 'M']}

In [68]:
def fun3(val):
    
    return data.value[val]
fun3('si1')

['Robin', 'M']

In [69]:
subjectrdd.map(lambda data: (data[0], data[1], fun3(data[0]))).collect()

[('si1', 'Python', ['Robin', 'M']),
 ('si3', 'Java', ['Julie', 'F']),
 ('si1', 'Java', ['Robin', 'M']),
 ('si2', 'Python', ['Maria', 'F']),
 ('si3', 'Ruby', ['Julie', 'F']),
 ('si4', 'C++', ['Bob', 'M']),
 ('si4', 'Python', ['Bob', 'M']),
 ('si2', 'Java', ['Maria', 'F'])]

#### PySpark Accumulator with Example

The PySpark Accumulator is a shared variable that is used with RDD and DataFrame to perform sum and counter operations similar to Map-reduce counters. These variables are shared by all executors to update and add information through aggregation or computative operations.

In this article, I’ve explained what is PySpark Accumulator, how to create, and using it on RDD and DataFrame with an example.
#### What is PySpark Accumulator?

Accumulators are write-only and initialize once variables where only tasks that are running on workers are allowed to update and updates from the workers get propagated automatically to the driver program. But, only the driver program is allowed to access the Accumulator variable using the value property.
How to create Accumulator variable in PySpark?

Using accumulator() from SparkContext class we can create an Accumulator in PySpark programming. Users can also create Accumulators for custom types using AccumulatorParam class of PySpark.

Some points to note..

1. sparkContext.accumulator() is used to define accumulator variables.
2. add() function is used to add/update a value in accumulator
3. value property on the accumulator variable is used to retrieve the value from the accumulator.

We can create Accumulators in PySpark for primitive types int and float. Users can also create Accumulators for custom types using AccumulatorParam class of PySpark.

In [70]:
accum = sc.accumulator(0)

In [71]:
accum.value

0

In [72]:
accum.add(2)

In [73]:
accum.value

2

In [74]:
rdd = sc.parallelize(range(0,10))
rdd.foreach(lambda data: accum.add(data))

In [75]:
accum.value

47

In [76]:
## Spark Context 

In [80]:
sc.applicationId

'application_1672135434203_0003'

In [81]:
sc.appName

'pyspark-shell'

In [82]:
sc.defaultParallelism

2

In [83]:
sc.defaultMinPartitions

2

In [86]:
rdd = sc.emptyRDD()
rdd.collect()

[]

In [87]:
sc.environment

{'PYTHONHASHSEED': '0'}

In [91]:
sc.getCheckpointDir()

In [93]:
sc.getConf()

<pyspark.conf.SparkConf at 0x7fd0519a6ad0>

In [94]:
sc.master

'yarn'

In [97]:
sc.sparkHome

In [98]:
sc.stop()

<bound method SparkContext.stop of <SparkContext master=yarn appName=pyspark-shell>>

In [99]:
sc.setLogLevel('INFO')

In [100]:
sc.version

'3.3.1'

In [101]:
from pyspark import SparkConf
sf = SparkConf()

In [104]:
sf.contains('master')

False

In [105]:
sf.get('master')

In [106]:
sf.getAll()

[('spark.eventLog.enabled', 'true'),
 ('spark.history.fs.logDirectory', 'hdfs://localhost:9000/spark-logs'),
 ('spark.history.ui.port', '18080'),
 ('spark.ui.proxyBase', '/proxy/application_1672135434203_0003'),
 ('spark.executor.extraJavaOptions',
  '-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"'),
 ('spark.app.submitTime', '1672135472013'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.driver.memory', '512m'),
 ('spark.sql.warehouse.dir', 'hdfs://localhost:9000/user/hive/warehouse'),
 ('spark.master', 'yarn'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.submit.pyFiles', ''),
 ('spark.yarn.isPython', 'true'),
 ('spark.submit.deployMode', 'client'),
 ('spark.history.fs.update.interval', '10s'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.eventLog.dir', 'hdfs://localhost:9000/spark-logs')]

In [108]:
sf.set('spark.history.fs.update.interval','11s')

<pyspark.conf.SparkConf at 0x7fd0510c1060>

In [109]:
sf.get('spark.history.fs.update.interval')

'11s'

In [110]:
sf.setAppName('okay')

<pyspark.conf.SparkConf at 0x7fd0510c1060>

In [111]:
sf.getAll()

[('spark.eventLog.enabled', 'true'),
 ('spark.history.fs.logDirectory', 'hdfs://localhost:9000/spark-logs'),
 ('spark.history.ui.port', '18080'),
 ('spark.ui.proxyBase', '/proxy/application_1672135434203_0003'),
 ('spark.executor.extraJavaOptions',
  '-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"'),
 ('spark.app.submitTime', '1672135472013'),
 ('spark.driver.memory', '512m'),
 ('spark.sql.warehouse.dir', 'hdfs://localhost:9000/user/hive/warehouse'),
 ('spark.master', 'yarn'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.history.fs.update.interval', '11s'),
 ('spark.app.name', 'okay'),
 ('spark.submit.pyFiles', ''),
 ('spark.yarn.isPython', 'true'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.eventLog.dir', 'hdfs://localhost:9000/spark-logs')]

In [112]:
sf.setAll([('spark.eventLog.enabled', 'true'),
 ('spark.history.fs.logDirectory', 'hdfs://localhost:9000/spark-logs'),
 ('spark.history.ui.port', '18080'),
 ('spark.ui.proxyBase', '/proxy/application_1672135434203_0003'),
 ('spark.executor.extraJavaOptions',
  '-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"'),
 ('spark.app.submitTime', '1672135472013'),
 ('spark.driver.memory', '512m'),
 ('spark.sql.warehouse.dir', 'hdfs://localhost:9000/user/hive/warehouse'),
 ('spark.master', 'yarn'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.history.fs.update.interval', '11s'),
 ('spark.app.name', 'okay'),
 ('spark.submit.pyFiles', ''),
 ('spark.yarn.isPython', 'true'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.eventLog.dir', 'hdfs://localhost:9000/spark-logs')])

<pyspark.conf.SparkConf at 0x7fd0510c1060>