Mastering Big Data Analytics with PySpark [Video]

This is the code repository for Mastering Big Data Analytics with PySpark [Video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish. Authored by: Danny Meijer

About the Video Course

PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.

By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.

What You Will Learn

Gain a solid knowledge of vital Data Analytics concepts via practical use cases
Create elegant data visualizations using Jupyter
Run, process, and analyze large chunks of datasets using PySpark
Utilize Spark SQL to easily load big data into DataFrames
Create fast and scalable Machine Learning applications using MLlib with Spark
Perform exploratory Data Analysis in a scalable way
Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming

Instructions and Navigation

Assumed Knowledge

This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data.

If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you!

A working knowledge of Python assumed.

Technical Requirements

Minimum Hardware Requirements

For successful completion of this course, students will require the computer systems with at least the following:

OS: Windows, Mac, or Linux Processor: Any processor from the last few years Memory: 2GB RAM Storage: 300MB for the Integrated Development Environment (IDE) and 1GB for cache

Recommended Hardware Requirements

For an optimal experience with hands-on labs and other practical activities, we recommend the following configuration:

OS: Windows, Mac, or Linux Processor: Core i5 or better (or AMD equivalent) Memory: 8GB RAM or better Storage: 2GB free for build caches and dependencies

Software Requirements

Operating system: Windows, Mac, or Linux Docker

Follow the instructions below to download the data belonging to the course as well as

setting up your interactive development environment.

Downloading Data for this Course

Once you have cloned this repository locally, simply navigate to the folder you have stored the repo in and run: python download_data.py

This will populate the data-sets folder in your repo with a number of data sets that will be used throughout the course.

Docker Image Bundled with the Course

About

The Docker Image bundled with this course (see Dockerfile) is based on the pyspark-notebook, distributed and maintained by Jupyter

Github link Original copyright (c) Jupyter Development Team. Distributed under the terms of the Modified BSD License.

This Course's Docker image extends the pyspark-notebook with the following additions:

enables Jupyter Lab by default
exposes correct ports for JupyterLab and SparkUI
sets numerous default settings to improve Quality of Life for the user
installs numerous add-ons (such as pyspark-stubs and blackcellmagic) using jupyter_contrib_nbextensions

Instructions for use

There are 2 ways to access the Docker container in this course:

Through the bundled run_me.py script (recommended to use)
Through the Docker CLI (only for advanced users)

Using the bundled script to run the container

The easiest way to run the container that belongs to this course is by running python run_me.py from the course's repository. This will automatically build the Docker image, set up the Docker container, download the data, and set up the necessary volume mounts.

Using Docker CLI

If you rather start the Docker container manually, use the following instructions:

Download the data
```
python download_data.py
```

Build the image

docker build --rm -f "Dockerfile" -t mastering_pyspark_ml:latest .

Run the image Ensure that you replace /path/to/mastering_pyspark_ml/repo/ in the following command, and run it in a terminal or command prompt:

docker run  -v /path/to/mastering_pyspark_ml/repo/:/home/jovyan/ --rm -d -p 8888:8888 -p 4040:4040 --name mastering_pyspark_ml mastering_pyspark_ml .

Open Jupyter lab once Docker image is running Navigate to http://localhost:8888/lab

To Stop the Docker Image

Once you are ready to shutdown the Docker container, you can use the following command:

docker stop mastering_pyspark_ml

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Section 1 - Python and Spark a match made in heaven		Section 1 - Python and Spark a match made in heaven
Section 2 - Working with PySpark/2.5		Section 2 - Working with PySpark/2.5
Section 3 - Preparing Data using SparkSQL		Section 3 - Preparing Data using SparkSQL
Section 4 - Machine Learning with Spark MLlib		Section 4 - Machine Learning with Spark MLlib
Section 5 - Classification and Regression		Section 5 - Classification and Regression
Section 6 - Analyzing Big Data		Section 6 - Analyzing Big Data
Section 8 - Machine Learning in Real-Time		Section 8 - Machine Learning in Real-Time
Section 9 - The Power Of PySpark/packaged_application_example		Section 9 - The Power Of PySpark/packaged_application_example
conf		conf
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
download_data.py		download_data.py
requirements.txt		requirements.txt
requirements_notebook.txt		requirements_notebook.txt
run_me.py		run_me.py

License

PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

Mastering Big Data Analytics with PySpark [Video]

About the Video Course

What You Will Learn

Instructions and Navigation

Assumed Knowledge

Technical Requirements

Minimum Hardware Requirements

Recommended Hardware Requirements

Software Requirements

Follow the instructions below to download the data belonging to the course as well as

Downloading Data for this Course

Docker Image Bundled with the Course

About

Instructions for use

Using the bundled script to run the container

Using Docker CLI

To Stop the Docker Image

Related Products

About

Resources

License

Stars

Watchers

Forks

Languages