Machine Learning Pipelines With Spark - Layer Technical Guide

This repository contains code samples and materials for Layer's technical guide on "Machine Learning Pipelines with Apache Spark".

emr_files contains an end-to-end code sample for building an ML Pipeline with Spark on Amazon Elastic MapReduce (EMR) and deploying Spark Pipeline on AWS SageMaker with MLeap as the execution engine.

To run the code on EMR, ensure you use the necessary configuration files to create a cluster with Spark:

boostrap_actions.sh contains the shell script for custom actions during the cluster creation process. Ensure you upload it to an S3 bucket so you can use the URI to set up custom actions when configuring your cluster
emr_config.json should go under the classifications for your software eocnfiguration.
If you are struggling to find where to use these files, follow this blog post.

The code was tested with the following configurations:

Release: emr-5.23.0
Spark v2.4.0 and Livy v0.5.0 applications.

Running Code Sample On Colab

To run the code sample on Colab, check out the notebook.

Everything works fine except using MLeap. Colab does not work well with MLeap (=<1.18.0) as MLeap will require Docker to be installed on local machine--which Colab does not support currently or plan on supporting anytime soon.

Running Code Sample On Local Jupyter Notebook

Ensure you install both PySpark and MLeap properly and use the notebook by starting from the header "If you are using Local Jupyter Notebook".

Credits

Dataset Credits:

The dataset is credited to Ronny Kohavi and Barry Becker (from this paper) and was drawn from the 1994 United States Census Bureau data and involves using personal details such as education level to predict whether an individual will earn more or less than $50,000 per year.

Resources

Concepts from the Colab notebook are heavily inspired by Janani Ravi's course on "Building Machine Learning Models in Spark 2".

Other References

Imbalanced Classification with the Adult Income Dataset.

Apache Spark ML documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
emr_files		emr_files
Colab_Jupyter_NB_Customer_Income.ipynb		Colab_Jupyter_NB_Customer_Income.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

emr_files

emr_files

Colab_Jupyter_NB_Customer_Income.ipynb

Colab_Jupyter_NB_Customer_Income.ipynb

README.md

README.md

Repository files navigation

Machine Learning Pipelines With Spark - Layer Technical Guide

Running Code Sample On Colab

Running Code Sample On Local Jupyter Notebook

Credits

Dataset Credits:

Resources

Other References

About

Releases

Packages

Languages

NonMundaneDev/layer-ml-pipelines-spark

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Pipelines With Spark - Layer Technical Guide

Running Code Sample On Colab

Running Code Sample On Local Jupyter Notebook

Credits

Dataset Credits:

Resources

Other References

About

Resources

Stars

Watchers

Forks

Languages