Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset or RDD. In a subsequent lab exercise, you will learn more about the details of RDD. RDDs have actions, which return values, and transformations, which return pointers to new RDD.
This set of labs uses Skills Network (SN) Labs to provide an interactive environment to develop applications and analyze data. It is available in either Scala or Python shells. Scala runs on the Java VM and is thus a good way to use existing Java libraries.
In this lab exercise, we will set up our environment in preparation for the later labs. After completing this set of hands-on labs, you should be able to: 1. Perform basic RDD actions and transformations 2. Use caching to speed up repeated operations