Skip to content
Supporting content (slides and exercises) for the Addison-Wesley (Pearson) video series covering best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.
Jupyter Notebook Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Building Spark Applications Live Lessons

Join the chat at Binder

This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.


The corresponding videos can be found on the following sites for purchase:

In addition to the videos there are many other resources to provide you support in learning this new technology:

And/or please do not hesitate to reach out to me directly via email at or over twitter @clearspandex

If you find any errors in the code or materials, please open a Github issue in this repository

Skill Level


What You Will Learn

  • How to install and set up a Spark environment locally and on a cluster
  • The differences between and the strengths of the Python, R, and SQL programming interfaces
  • How to build a machine learning model for text
  • Common data science use cases that Spark is especially well-suited to solve
  • How to tune a Spark application for performance
  • The internals of the Spark framework and its execution model
  • How to use Spark in a data science application workflow
  • The basics of the larger Spark ecosystem

Who Should Take This Course

  • Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
  • Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.


  • Basic understanding of programming (Python a plus).
  • Familiarity with the data science process and machine learning are a plus.


SparkR with a Notebook

  1. Install IRKernel
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('', getOption('repos')))

  1. Set environment variables:
# Example: Set this to where Spark is installed

# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

# if these two lines work, you are all set
sc <- sparkR.init(master="local")


IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

  • In the IPython console type sales.[TAB]

  • Autocomplete will show you all the methods that are available.

  • To find more information about a specific method, say .cov type help(sales.cov)

  • This will display the API documentation for that method.

Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

  • Go to

  • Navigate using tabs to the following areas in particular.

  • Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.

  • Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.

  • More > Configuration, Monitoring, Tuning Guide.


Staying Involved

You can’t perform that action at this time.