In [2]:
#
# The findspark Python module makes it easier to install
# Spark in local mode on your computer. This is convenient
# for practicing Spark syntax locally. 
# However, the workspaces already have Spark installed and you do not
# need to use the findspark module
#

import findspark
# findspark.init('spark-2.3.2-bin-hadoop2.7')


In [4]:
import pyspark

# creating a spark context
# the SparkContext object has a method parallelize that take a python object
# and distributes then across the machine in your cluster
configure =  pyspark.SparkConf().setAppName("maps_and_lazy_evaluation_example").setMaster("local")
sc = pyspark.SparkContext(conf=configure)

log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "despacito",
        "All the stars"
]

# parallelize the log_of_songs to use with Spark
distributed_song_log = sc.parallelize(log_of_songs)

Exception: Java gateway process exited before sending its port number

In [None]:
# a function that converts a song title to lowercase
def convert_song_to_lowercase(song):
    return song.lower()

convert_song_to_lowercase("Havana")

In [None]:
# applying this function using a map step
# The map step will go through each song in the list and
# apply the convert_song_to_lowercase() function.

# x.map(f(x))
distributed_song_log.map(convert_song_to_lowercase)

The code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.

"RDD" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data.

To get Spark to actually run the map step, you need to use an "action". One available action is the collect method. The collect() method takes the results from all of the clusters and "collects" them into a single list on the master node.

In [None]:
distributed_song_log.map(convert_song_to_lowercase).collect()

In [None]:
# Spark is not changing the original data set:
# Spark is merely making a copy. You can see this by running collect() on the original dataset.
distributed_song_log.collect()

You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower().

Anonymous functions are actually a Python feature for writing functional style programs.


In [None]:
distributed_song_log.map(lambda song: song.lower()).collect()