# Introduction to Machine Learning

Goals:

* Understand what is meant by the term, and learn the rudiments of the language that goes with it.

* Start to explore the `scikit-learn` ML python package.

### Further Reading

Ivezic Chapters 6-9 covers the main topics in Machine Learning, both supervised and unsupervised.

> Credit: much of the material in this chunk is based on Andy Mueller's `scikit-learn` tutorial from the 2015 edition of "Astro Hack Week".

## What is Machine Learning ?

* The umbrella term "machine learning" describes methods for *automated data analysis*, developed by computer scientists and statisticians in response to the appearance of ever larger datasets.


* The goal of automation has led to an emphasis on non-parametric models, and a very uniform terminology, enabling multiple models to be implemented and compared on an equal footing.


* Machine learning can be divided into two types: *supervised* and *unsupervised.* 

## Supervised Learning

* Supervised learning is also known as *predictive* learning. Given *inputs* $X$, the goal is to construct a machine that can accurately predict a set of *outputs* $y$, usually so that _decisions_ can be made. 


* The "supervision" refers to the education of the machine, via a *training set* $D$ of input-output pairs that we provide. Prediction accuracy is then tested on *validation* and *test* sets.

## Supervised Learning

* At the heart of the prediction machine is a *model* $M$ that can be *trained* to give accurate predictions.


* Supervised learning is about making predictions by characterizing ${\rm Pr}(y_k|x_k,D,M)$.

## Supervised Learning

* The outputs $y$ are said to be *response variables* - predictions of $y$ will be generated by our model. 

* The variables $y$ can be either *categorical* ("labels") or *nominal* (real numbers). 

## Supervised Learning

* When the $y$ are categorical, the problem is one of *classification* ("is this an image of a `kitten`, or a `puppy`?"). 


* When the $y$ are numerical, the problem is a *regression* ("how should we interpolate between these numerical values?").

## Unsupervised Learning

* Also known as *descriptive* learning. Here the goal is "knowledge discovery" - detection of patterns in a dataset, that can then be used in supervised/model-based analyses. 


* Unsupervised learning is about *density estimation* - characterizing ${\rm Pr}(x|\theta,H)$.

## Unsupervised Learning

* Examples of unsupervised learning activities include:

  * Clustering analysis of the $x$.
  * Dimensionality reduction: principal component analysis, independent component analysis, etc.
  

* In this chunk we will focus on supervised learning, and some applications in astronomy.


## Data Representations

* Each input $x$ is said to have $P$ *features* (or *attributes*), and represents a *sample* (assumed to have been drawn from a sampling distribution). Each sample input $x$ is associated with an output $y$.


* Our $N$ input *samples* are packaged into an $N \times P$ *design matrix* $X$ (with $N$ rows and $P$ columns).

<img src="../graphics/ml_data_representation.svg" width=100%>

## Dataset Split

* We *train* our machine learning models on a subset of the data, and then test them against the remainder.

<img src="../graphics/ml_supervised_workflow.svg" width=100%>

<img src="../graphics/ml_train_test_split_matrix.svg" width=100%>

## Simple Example: The Digits Dataset

* Let's take a look at one of the `SciKit-Learn` example datasets, `digits`

In [None]:
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

In [None]:
digits.images.shape

In [None]:
print(digits.images[0])

In [None]:
plt.matshow(digits.images[23], cmap=plt.cm.Greys)

In [None]:
digits.data.shape

In [None]:
digits.target.shape

In [None]:
digits.target[23]

## Simple Example: The Digits Dataset

* In `SciKit-Learn`,  `data` contains the design matrix $X$, and is a `numpy` array of shape $(N, P)$


* `target` contains the response variables $y$, and is a `numpy` array of shape $(N)$

In [None]:
print(digits.DESCR)

## Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(digits.data, digits.target, test_size=0.25)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

## Other Example Datasets

`SciKit-Learn` provides 5 "toy" datasets for tutorial purposes, all `load`-able in the same way:

Name        | Description
------------|:---------------------------------------
`boston`	| Boston house-prices, with 13 associated measurements (R)
`iris`	    | Fisher's iris classifications (based on 4 characteristics) (C)
`diabetes`	| Diabetes (x vs y) (R)
`digits`	| Hand-written digits, 8x8 images with classifications (C)
`linnerud`	| Linnerud: 3 exercise and 3 physiological data (R)


* "R" and "C" indicate that the problem to be solved is either a regression or a classification, respectively.

## Looking for Structure

* A model's ability to make predictions depends on there being _structure_ in the data

* If structure is present the data are informative, and vice versa.

* Feature design takes thought; thinking is aided by _data visualization_

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

In [None]:
# Visualizing the Boston house price data:

import corner

X = boston.data
y = boston.target

plot = np.concatenate((X, np.atleast_2d(y).T), axis=1)
labels = np.append(boston.feature_names,'MEDV')

corner.corner(plot, labels=labels);

### Questions:

Talk to your neighbor for a few minutes about the things you have just heard about machine learning. 

* In this course so far have we been doing supervised or unsupervised learning problems?

* Have we been talking about regression or classification problems? 

* Which of our example astronomical datasets is most similar to the `boston` dataset in `SciKit-Learn`? How do they differ?