# Machine Learning

Goals:

* Understand what is meant by the term "machine learning", and learn the rudiments of the language that goes with it.

* See how some of the methods of basic machine learning, and the underlying philosophy, relate to the statistical inference we have been studying in the rest of the course.

### Further Reading

Ivezic Chapters 6-9 covers the main topics in Machine Learning, both supervised and unsupervised.

> Credit: some of the material (including the graphics) in this lesson comes from Andy Mueller's `scikit-learn` tutorial from the 2015 Astro Hack Week.

## What is Machine Learning ?

* The umbrella term "machine learning" describes methods for *automated data analysis*, developed by computer scientists and statisticians in response to the appearance of ever larger datasets.


* The goal of automation has led to an emphasis on non-parametric models (that adapt to dataset size and complexity), and a very uniform terminology that enables multiple models to be implemented and compared on an equal footing.


* Machine learning can be divided into two types: *supervised* and *unsupervised.* 

## Supervised Learning

* Supervised learning is also known as *predictive* learning. Given *inputs* $X$, the goal is to construct a machine that can accurately predict a set of *outputs* $y$, usually so that _decisions_ can be made. 


* The "supervision" refers to the education of the machine, via a *training set* $D$ of input-output pairs that we provide. Prediction accuracy is then tested on *validation* and *test* sets.

## Supervised Learning

* At the heart of the prediction machine is a *model* $M$ that can be *trained* to give accurate predictions.


* Supervised learning is about making predictions by characterizing ${\rm Pr}(y_k|x_k,D,M)$.

## Supervised Learning

* The outputs $y$ are said to be *response variables* - predictions of $y$ will be generated by our model. 

* The variables $y$ can be either *categorical* ("labels") or *nominal* (real numbers). 

## Supervised Learning

* When the $y$ are categorical, the problem is one of *classification* ("is this an image of a `kitten`, or a `puppy`?"). 


* When the $y$ are numerical, the problem is a *regression* ("how should we interpolate between these numerical values?").

## Unsupervised Learning

* Also known as *descriptive* learning. Here the goal is "knowledge discovery" - detection of patterns in a dataset, that can then be used in supervised/model-based analyses. 


* Unsupervised learning is about *density estimation* - characterizing ${\rm Pr}(x|\theta,H)$.

## Unsupervised Learning

* Examples of unsupervised learning activities include:

  * Clustering analysis of the $x$.
  * Dimensionality reduction: principal component analysis, independent component analysis, etc.
  

* In this chunk we will focus on supervised learning, and some applications in astronomy.


## Data Representations

* Each input $x$ is said to have $P$ *features* (or *attributes*), and represents a *sample* (assumed to have been drawn from a sampling distribution). Each sample input $x$ is associated with an output $y$.


* Our $N$ input *samples* are packaged into an $N \times P$ *design matrix* $X$ (with $N$ rows and $P$ columns).

<img src="graphics/ml_data_representation.svg" width=100%>


* Typically a supervised learning model is "trained" on a subset of the data, and then its ability to make predictions about new data "tested" on the remainder.

<img src="graphics/ml_supervised_workflow.svg" width=100%>

<img src="graphics/ml_train_test_split_matrix.svg" width=100%>

### Questions:

Talk to your neighbor for a few minutes about the things you have just heard about machine learning. 

* In this course so far have we been doing supervised or unsupervised learning problems?

* Have we been talking about regression or classification problems? 

* Which of our example astronomical datasets is most similar to the `boston` dataset in `SciKit-Learn`? How do they differ?