Skip to content

PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark

Repository files navigation

Mastering Big Data Analytics with PySpark [Video]

This is the code repository for Mastering Big Data Analytics with PySpark [Video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish. Authored by: Danny Meijer

About the Video Course

PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.

By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.

What You Will Learn

  • Gain a solid knowledge of vital Data Analytics concepts via practical use cases
  • Create elegant data visualizations using Jupyter
  • Run, process, and analyze large chunks of datasets using PySpark
  • Utilize Spark SQL to easily load big data into DataFrames
  • Create fast and scalable Machine Learning applications using MLlib with Spark
  • Perform exploratory Data Analysis in a scalable way
  • Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming

Instructions and Navigation

Assumed Knowledge

This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data.

If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you!

A working knowledge of Python assumed.

Technical Requirements

Minimum Hardware Requirements

For successful completion of this course, students will require the computer systems with at least the following:

OS: Windows, Mac, or Linux Processor: Any processor from the last few years Memory: 2GB RAM Storage: 300MB for the Integrated Development Environment (IDE) and 1GB for cache

Recommended Hardware Requirements

For an optimal experience with hands-on labs and other practical activities, we recommend the following configuration:

OS: Windows, Mac, or Linux Processor: Core i5 or better (or AMD equivalent) Memory: 8GB RAM or better Storage: 2GB free for build caches and dependencies

Software Requirements

Operating system: Windows, Mac, or Linux Docker

Follow the instructions below to download the data belonging to the course as well as

setting up your interactive development environment.

Downloading Data for this Course

Once you have cloned this repository locally, simply navigate to the folder you have stored the repo in and run: python download_data.py

This will populate the data-sets folder in your repo with a number of data sets that will be used throughout the course.

Docker Image Bundled with the Course

About

The Docker Image bundled with this course (see Dockerfile) is based on the pyspark-notebook, distributed and maintained by Jupyter

Github link Original copyright (c) Jupyter Development Team. Distributed under the terms of the Modified BSD License.

This Course's Docker image extends the pyspark-notebook with the following additions:

  • enables Jupyter Lab by default
  • exposes correct ports for JupyterLab and SparkUI
  • sets numerous default settings to improve Quality of Life for the user
  • installs numerous add-ons (such as pyspark-stubs and blackcellmagic) using jupyter_contrib_nbextensions

Instructions for use

There are 2 ways to access the Docker container in this course:

  1. Through the bundled run_me.py script (recommended to use)
  2. Through the Docker CLI (only for advanced users)

Using the bundled script to run the container

The easiest way to run the container that belongs to this course is by running python run_me.py from the course's repository. This will automatically build the Docker image, set up the Docker container, download the data, and set up the necessary volume mounts.

Using Docker CLI

If you rather start the Docker container manually, use the following instructions:

  1. Download the data

    python download_data.py
  2. Build the image

    docker build --rm -f "Dockerfile" -t mastering_pyspark_ml:latest .
  3. Run the image Ensure that you replace /path/to/mastering_pyspark_ml/repo/ in the following command, and run it in a terminal or command prompt:

    docker run  -v /path/to/mastering_pyspark_ml/repo/:/home/jovyan/ --rm -d -p 8888:8888 -p 4040:4040 --name mastering_pyspark_ml mastering_pyspark_ml .
  4. Open Jupyter lab once Docker image is running Navigate to http://localhost:8888/lab

To Stop the Docker Image

Once you are ready to shutdown the Docker container, you can use the following command:

docker stop mastering_pyspark_ml

Related Products

About

Mastering Big Data Analytics with PySpark, Published by Packt

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published