### Usecase
An example of when to use broadcast variables, assume you are getting a two-letter country state code in a file and you wanted to transform it to full state name, (for example CA to California, NY to New York e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you may have many such lookups (like zip code e.t.c).

Instead of distributing this information along with each task over the network (overhead and time consuming), we can use the broadcast variable to cache this lookup info on each machine and tasks use this cached info while executing the transformations.

In [7]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkLearning").getOrCreate()

In [8]:
states = {"NY":"New York","CA":"California","FL":"Florida"}
data = [("James","Smith","USA","CA"),
        ("Michael","Rose","USA","NY"),
        ("Robert","Williams","USA","CA"),
        ("Maria","Jones","USA","FL")
      ]

In [11]:
broadCasteStates = spark.sparkContext.broadcast(states) 

In [12]:
rdd = spark.sparkContext.parallelize(data)

In [13]:
def state_convert(code):
    return broadCasteStates.value[code]

In [17]:
result = rdd.map(lambda x : (x[0], x[1], x[2], state_convert(x[3]))).collect()
print(result)

[('James', 'Smith', 'USA', 'California'), ('Michael', 'Rose', 'USA', 'New York'), ('Robert', 'Williams', 'USA', 'California'), ('Maria', 'Jones', 'USA', 'Florida')]
