GitHub - Shef-AI/ScalableML: COM6012 Scalable Machine Learning

COM6012 Scalable Machine Learning - University of Sheffield

In this module, we will learn how to do machine learning at large scale using Apache Spark. We will use the High Performance Computing (HPC) cluster systems of our university. You must use VPN (Virtual Private Network) to connect to the HPC.

This edition uses PySpark 3.0.1, the latest stable release of Spark (Sep 02, 2020), and has 10 sessions below. You can refer to the overview slides for more information, e.g. timetable and assessment information.

Session 1: Introduction to Spark and HPC
Session 2: RDD, DataFrame, ML pipeline, & parallelization
Session 3: Scalable matrix factorisation for collaborative filtering in recommender systems
Session 4: Scalable k-means clustering and Spark configuration
Session 5: Scalable PCA for dimensionality reduction and Spark data types
Session 6: Scalable decision trees and ensemble models
Session 7: Scalable logistic regression
Session 8: Scalable generalized linear models
Session 9: Scalable neural networks
Session 10: Apache Spark in the Cloud (guest lecture by Dr Michael Smith)

You can also download the Spring 2020 version for preview or reference.

The materials are built with references to the following sources:

The official Apach Spark documentations. Note: the latest information is here.
The PySpark tutorial by Wenqiang Feng with PDF - Learning Apache Spark with Python Release v1.0. Also see GitHub Project Page. Note: last update in Feb 2020.
The Introduction to Apache Spark course by A. D. Joseph, University of California, Berkeley. Note: archived.
The book Learning Spark: Lightning-Fast Data Analytics, 2nd Edition, O'Reilly by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee.
The book Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. There is also a Repository for code from the book.

Many thanks to

Mike Croucher, Neil Lawrence, Will Furnass, Twin Karmakharm, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.
Our teaching assistants and students who have contributed in many ways since 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
Code		Code
Data		Data
HPC		HPC
Output		Output
Slides		Slides
.gitattributes		.gitattributes
.gitignore		.gitignore
Lab 1 - Introduction to Spark and HPC.md		Lab 1 - Introduction to Spark and HPC.md
Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md		Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md
Lab 3 - Scalable matrix factorisation for collaborative filtering.md		Lab 3 - Scalable matrix factorisation for collaborative filtering.md
Lab 4 - Scalable k-means clustering and Spark configuration.md		Lab 4 - Scalable k-means clustering and Spark configuration.md
Lab 5 - Scalable PCA for dimensionality reduction and Spark data types.md		Lab 5 - Scalable PCA for dimensionality reduction and Spark data types.md
Lab 6 - Scalable Decision trees.md		Lab 6 - Scalable Decision trees.md
Lab 7 - Scalable logistic regression.md		Lab 7 - Scalable logistic regression.md
Lab 8 - Scalable generalized linear models.md		Lab 8 - Scalable generalized linear models.md
Lab 9 - Scalable neural networks.md		Lab 9 - Scalable neural networks.md
README.md		README.md