Skip to content

COM6012 Scalable Machine Learning - University of Sheffield

Notifications You must be signed in to change notification settings

Shef-AI/ScalableML

 
 

Repository files navigation

COM6012 Scalable Machine Learning - University of Sheffield

Spring 2021 by Haiping Lu (1-5) and Mauricio A Álvarez (6-10)

In this module, we will learn how to do machine learning at large scale using Apache Spark. We will use the High Performance Computing (HPC) cluster systems of our university. You must use VPN (Virtual Private Network) to connect to the HPC.

This edition uses PySpark 3.0.1, the latest stable release of Spark (Sep 02, 2020), and has 10 sessions below. You can refer to the overview slides for more information, e.g. timetable and assessment information.

  • Session 1: Introduction to Spark and HPC
  • Session 2: RDD, DataFrame, ML pipeline, & parallelization
  • Session 3: Scalable matrix factorisation for collaborative filtering in recommender systems
  • Session 4: Scalable k-means clustering and Spark configuration
  • Session 5: Scalable PCA for dimensionality reduction and Spark data types
  • Session 6: Scalable decision trees and ensemble models
  • Session 7: Scalable logistic regression
  • Session 8: Scalable generalized linear models
  • Session 9: Scalable neural networks
  • Session 10: Apache Spark in the Cloud (guest lecture by Dr Michael Smith)

You can also download the Spring 2020 version for preview or reference.

Acknowledgement

The materials are built with references to the following sources:

Many thanks to

  • Mike Croucher, Neil Lawrence, Will Furnass, Twin Karmakharm, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.
  • Our teaching assistants and students who have contributed in many ways since 2017.

About

COM6012 Scalable Machine Learning - University of Sheffield

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 61.2%
  • Shell 38.8%