Building Spark Applications Live Lessons
This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.
The corresponding videos can be found on the following sites for purchase:
In addition to the videos there are many other resources to provide you support in learning this new technology:
And/or please do not hesitate to reach out to me directly via email at firstname.lastname@example.org or over twitter @clearspandex
If you find any errors in the code or materials, please open a Github issue in this repository
What You Will Learn
- How to install and set up a Spark environment locally and on a cluster
- The differences between and the strengths of the Python, R, and SQL programming interfaces
- How to build a machine learning model for text
- Common data science use cases that Spark is especially well-suited to solve
- How to tune a Spark application for performance
- The internals of the Spark framework and its execution model
- How to use Spark in a data science application workflow
- The basics of the larger Spark ecosystem
Who Should Take This Course
- Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
- Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.
- Basic understanding of programming (Python a plus).
- Familiarity with the data science process and machine learning are a plus.
SparkR with a Notebook
- Install IRKernel
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos'))) IRkernel::installspec()
# Example: Set this to where Spark is installed Sys.setenv(SPARK_HOME="/Users/[username]/spark") # This line loads SparkR from the installed directory .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) # if these two lines work, you are all set library(SparkR) sc <- sparkR.init(master="local")
IPython Console Help
Q: How can I find out all the methods that are available on DataFrame?
In the IPython console type
Autocomplete will show you all the methods that are available.
To find more information about a specific method, say
This will display the API documentation for that method.
Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?
Navigate using tabs to the following areas in particular.
Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.
Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.
More > Configuration, Monitoring, Tuning Guide.
- Learning Spark: Lightning-Fast Big Data Analytics (O'Reilly)
- Advanced Analytics with Spark (O'Reilly)
- Spark Knowledge Base
- Spark Reference Applications
- Mastering Apache Spark