Skip to content

Supporting content (slides and exercises) for the Addison-Wesley (Pearson) video series covering best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Building Spark Applications Live Lessons

Join the chat at https://gitter.im/zipfian/building-spark-applications-live-lessons Binder

This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

Materials

The corresponding videos can be found on the following sites for purchase:

In addition to the videos there are many other resources to provide you support in learning this new technology:

And/or please do not hesitate to reach out to me directly via email at jondinu@gmail.com or over twitter @clearspandex

If you find any errors in the code or materials, please open a Github issue in this repository

Skill Level

Beginning/Intermediate

What You Will Learn

  • How to install and set up a Spark environment locally and on a cluster
  • The differences between and the strengths of the Python, R, and SQL programming interfaces
  • How to build a machine learning model for text
  • Common data science use cases that Spark is especially well-suited to solve
  • How to tune a Spark application for performance
  • The internals of the Spark framework and its execution model
  • How to use Spark in a data science application workflow
  • The basics of the larger Spark ecosystem

Who Should Take This Course

  • Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
  • Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.

Prerequisites

  • Basic understanding of programming (Python a plus).
  • Familiarity with the data science process and machine learning are a plus.

Setup

SparkR with a Notebook

  1. Install IRKernel
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()
  1. Set environment variables:
# Example: Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/Users/[username]/spark")

# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

# if these two lines work, you are all set
library(SparkR)
sc <- sparkR.init(master="local")

Data

IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

  • In the IPython console type sales.[TAB]

  • Autocomplete will show you all the methods that are available.

  • To find more information about a specific method, say .cov type help(sales.cov)

  • This will display the API documentation for that method.

Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

  • Go to https://spark.apache.org/docs/latest

  • Navigate using tabs to the following areas in particular.

  • Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.

  • Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.

  • More > Configuration, Monitoring, Tuning Guide.

Books

Staying Involved

About

Supporting content (slides and exercises) for the Addison-Wesley (Pearson) video series covering best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages