Note
Note: I am no longer teaching this course, so this material does not necessarily reflect the current curriculum.
In this course you will learn how to formulate and organize practical machine learning problems, identify and estimate appropriate machine learning models for prediction and clustering, evaluate and select among different machine learning models and algorithms and implement machine learning models and algorithms in a programming language.
The course gives you knowledge about machine learning that is used within marketing, finance, economics, textual analysis, digital humanities and social sciences. You will encounter many different forms of data, including images and text.
The course covers a number of machine learning methods with a focus on prediction. The course deals with supervised and unsupervised machine learning as well as semi-supervised and active learning. The course includes flexible regression and classification, regularization, methods for predictive model performance evaluation, Gaussian processes, clustering algorithms and mixture models.
Mattias Villani
Professor of Statistics, Stockholm and Linköping University
Probabilistic machine learning and Bayesian methods
Frank Miller
Professor of Statistics, Stockholm University
Experimental design, active learning and optimization methods
Karl Sigfrid
PhD student in Statistics, Stockholm University
The formal course description document with all the details about grading etc is here.
The course will use the following book as the main course literature:
- Machine Learning - a first course for engineers and scientists (MLES) by Lindholm et al. (2021). Forthcoming at Cambridge University Press. A free PDF version is available here. The previous title of the book was 'Supervised Machine Learning'.
- Additional course material linked from this page, such as articles and tutorials.
The course schedule on TimeEdit is here: Schedule.
Material under Extra are extra material that will help you understand the course content.
Material under Bonus are not required course material, but may be of interest to the curious student.
Lecture 1 - Introduction, k-NN and decision trees
Reading: MLES 1-2 | Slides
Bonus: Python Jupyter notebook for linear regression | Python code for nonlinear regression
Lecture 2 - Regularized non-linear regression and classification
Reading: MLES 3 | Slides
Code: Spline regression: Notebook pdf html | Spline package demo \
Lecture 3 - Evaluating predictive performance and hyperparameter learning
Reading: MLES 4 | Slides
Bonus: Some slides about entropy
Lecture 4 - Ensemble methods
Reading: MLES 7 | Slides
Extra: Gradient boosting visualized | Gradient boosting playground
Lecture 5 - Learning from large-scale data
Reading: MLES 5 | Slides
Lecture 6 - Neural networks and Deep learning
Reading: MLES 6.1-6.2 | Slides
Code: Neural net MNIST in keras
Extras: Video on Neural networks | Video on learning a neural network | keras cheat sheet
Lecture 7 - Image data and convolutional neural networks
Reading: MLES 6.3-6.4 | Slides
Code: ConvNet MNIST in keras
Extras: Filter spreadsheet
Lecture 8 - Gaussian process regression and classification
Reading: MLES 9 | Slides
Extras: GP visualization
Lecture 9 - Unsupervised learning - mixture models and clustering
Reading: MLES 10.1-10.3 | Slides
Code: EM for univariate Gaussian mixtures | EM for multivariate Gaussian mixtures
Lecture 10 - Textual data and topic models
Reading: Multinomial-Dirichlet analysis | Topic models intro | Slides
Lecture 11 - Semi-supervised learning
Reading: MLES 10.1 | Slides
Lecture 12 - Active learning
Reading: Settles (2010), especially Sections 1, 2, 3.1, 3.5, 3.6, 7.1 | Slides
Code: Active learning - illustrating example
-
The three computer labs are central to the course. Expect to allocate substantial time for each lab. Many of the exam questions will be computer based, so working on the labs will also help you with the exam.
-
R will be used as the course's programming language, see below for more info.
-
The labs should be done in pairs of students.
-
Each lab report should be submitted as a PDF along with the .R file with code. Submission is done through Athena.
-
There are four hours of computer time allocated to each lab. The idea is that you:
- should start working on the lab before the computer session
- so that you are in a position to ask questions at the session
- and then finish up the report after the lab session.
Computer Lab 1 - Regularized nonlinear regression and classification.
Lab 1a: Regularized regression: R notebook | pdf version | html version
Lab 1b: Regularized classification: R notebook | pdf version | html version
Submission: Athena.
Computer Lab 2 - Neural Networks and Gaussian Processes.
Lab 2: R notebook | pdf version | html version
Submission: Athena.
Computer Lab 3 - Unsupervised, semisupervised and active learning.
Lab 3: R notebook | pdf version | html version
Submission: Athena.
Lab assistant: Karl Sigfrid
The course examination consists of:
- Written lab reports (deadlines given in Athena)
- Computer exam
-
Analyzing data in R will be big part of the course, so you need to know a little R programming. The course R programming 7.5 credits or equivalent course is a prerequisite for this course. If you feel a little rusty on R, you can find a lot of material for studying it online, including tutorials, videos and free books. Here are some material:
- Download R
- RStudio - probably the best software/editor for R.
- Official introduction to R
- R Cheat sheets
- The labs and exam will be done using R notebooks in RStudio.
-
Here are some machine learning packages in R:
- Machine learning R packages on CRAN.
- caret - a meta package for predictive ML models in R. See the Caret package vignette and a list of available models in Caret.
- keras - a package that brings Tensorflow for deep learning to R. Here is the quick start to keras.