### Reminder!

After pulling down the tutorial notebook, immediately make a copy. Then do not modify the original. Do your work in the copy. This will prevent the possibility of git conflicts should the version-controlled file change at any point in the future. (The same exhortation applies to homeworks.)

# Week 9 Tutorial 

## Introduction to Machine Learning

In this notebook you will start to explore the `scikit-learn` ML python package, and see how it supports a range of machine learning models with a uniform terminology and API, and emphasize model evaluation by cross-validation. 

> Credit: some of the material in this tutorial is based on Andy Mueller's `scikit-learn` tutorial from the 2015 edition of "Astro Hack Week". The SDSS examples are based on a tutorial by Josh Bloom. 

### Requirements

You will need to `pip install scikit-learn` and check that you have v0.18 or higher as a result.

## 1. Simple Example: The Digits Dataset

* Let's take a look at one of the `SciKit-Learn` example datasets, `digits`

In [None]:
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

In [None]:
digits.images.shape

In [None]:
print(digits.images[0])

In [None]:
plt.matshow(digits.images[23], cmap=plt.cm.Greys)

In [None]:
digits.data.shape

In [None]:
digits.target.shape

In [None]:
digits.target[23]


* In `SciKit-Learn`,  `data` contains the design matrix $X$, and is a `numpy` array of shape $(N, P)$


* `target` contains the response variables $y$, and is a `numpy` array of shape $(N)$

In [None]:
print(digits.DESCR)

### Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(digits.data, digits.target, test_size=0.25)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

### Other Example Datasets

`SciKit-Learn` provides 5 "toy" datasets for tutorial purposes, all `load`-able in the same way:

Name        | Description
------------|:---------------------------------------
`boston`	| Boston house-prices, with 13 associated measurements (R)
`iris`	    | Fisher's iris classifications (based on 4 characteristics) (C)
`diabetes`	| Diabetes (x vs y) (R)
`digits`	| Hand-written digits, 8x8 images with classifications (C)
`linnerud`	| Linnerud: 3 exercise and 3 physiological data (R)


* "R" and "C" indicate that the problem to be solved is either a regression or a classification, respectively.

### Looking for Structure

* A model's ability to make predictions depends on there being _structure_ in the data

* If structure is present the data are informative, and vice versa.

* Feature design takes thought; thinking is aided by _data visualization_

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

In [None]:
# Visualizing the Boston house price data:

import corner

X = boston.data
y = boston.target

plot = np.concatenate((X, np.atleast_2d(y).T), axis=1)
labels = np.append(boston.feature_names,'MEDV')

corner.corner(plot, labels=labels);

## 2. Fitting a Straight Line with `SciKit-Learn`

For a first application, let's see the straight line fitting problem solved via the machine learning approach. In the process, we'll look at how model (prediction) accuracy is quantified, and then generalized via cross-validation.  

### Further Reading

Ivezic Sections 8.1 and 8.2 (linear regression), and Section 8.11 for cross-validation

### Linear Regression

Straight line fitting is a [linear regression](http://scikit-learn.org/stable/modules/linear_model.html) problem - and an example of predictive learning. 

A predictive model can be said to have been "fitted" to the data when an assumed _loss function_ has been minimized. A popular choice of minimized loss function is the following, corresponding to the method of "ordinary least squares":

$$ \text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b  - y_i||^2 $$

> While this loss function is derivable from statistical principles, the machine only needs it to be encoded. 


If we fit a straight line to a subset of the data (the training set), the accuracy of the linear model's predictions in the remainder of the data (the test set) can be checked. 

Let's fit some test data with a straight line using the `SciKit-Learn` library, and see how accurately we can make our predictions.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import datasets, linear_model

In [None]:
# Code source: Jaques Grobler
# License: BSD 3 clause

# Load the boston dataset, and focus on just one attribute: 
# LSTAT (attribute 12)
boston = datasets.load_boston()

# Package into design matrix X and target vector y:
X = np.atleast_2d(boston.data[:,12]).T
y = np.atleast_2d(boston.target).T

# Make a training/test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

In [None]:
plt.scatter(X_test, y_test, color='black')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

### The `Linear` Model

`scikit-learn` machine learning models have a common API:

* a `fit` method that optimizes the model's internal parameters given training data features and target values

* a `predict` method that returns model-predicted target values given test data features

* various `score`s for quantifying performance 

In [None]:
# Create linear regression model object:
model = linear_model.LinearRegression()

# Train the model using the training set:
model.fit(X_train, y_train)

# The coefficients:
print("Coefficients:", model.coef_)

### Scoring

* The "mean squared error" between the model predictions and the truth is a useful metric: minimizing MSE corresponds to minimizing the "empirical risk," defined as the mean value loss function averaged over the available data samples, where the loss function is quadratic:


$\;\;\;\;\;{\rm MSE} = \mathcal{E} \left[ (\hat{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y} + \bar{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y})^2 \right] + (\bar{y} - y^{\rm true})^2$

$\;\;\;\;\;\;\;\;\;\;\;\;\; = {\rm var}(\hat{y}) + {\rm bias}^2(\hat{y})$


* In general, different models reach different balances between the variance and bias of their predictions

* A particular choice of loss function leads to a corresponding minimized risk

In [None]:
# The mean square prediction error:
print("Training data: MSE = %.2f"
      % np.mean((model.predict(X_train) - y_train) ** 2))
print("Test data: MSE = %.2f"
      % np.mean((model.predict(X_test) - y_test) ** 2))

# The "explained variance" R2 score: 1 is perfect prediction:
print('Training data: R^2 score = %.2f' % model.score(X_train, y_train))
print('Test data: R^2 score = %.2f' % model.score(X_test, y_test))

### R2 scores

* The "explained variance" $R^2$ "score" is defined as $(1 - u/v)$, where $u$ is the *regression sum of squares* $\sum (y_{\rm true} - y_{\rm pred})^2$ and $v$ is the *residual sum of squares* $\sum (y_{\rm true} - \overline{y_{\rm true}})^2)$. 


* The best possible $R^2$ score is 1.0. A model that has mean squared error equal to the variance in the data gets a score of zero. Models that do systematically worse than this have negative $R^2$ scores.


* In general we expect the training score to be higher than the test score. 

In [None]:
# Plot outputs:
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, model.predict(X_test), color='blue', linewidth=3)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

### Questions:

* How is this procedure different from previous occasions we have fitted a straight line? 

* Is it "Frequentist" or "Bayesian"? Why? 

* Does it follow the likelihood principle?

### Optimizing Model Prediction Accuracy

* In supervised machine learning the usual goal is to make the most accurate predictions we can - which means neither over-fitting nor under-fitting the data 

* Above, we made one training/test split, and computed the (mean squared) prediction error. The model that minimizes the *generalized prediction error* can be found (approximately) with *cross validation*.

* In cross validation we consider multiple training/test splits, and look at the _mean score_ across all of these _"folds."_

In [None]:
from sklearn.model_selection import cross_val_score

model = linear_model.LinearRegression()

cross_val_score(model, X, y, cv=5, scoring='r2')

### Cross Validation Fold Design

* How we design the folds matters: we want each subset of the data to be a _fair sample_ of the whole.


* In this problem, we want to select the LSTAT values randomly (rather than sequentially), and so we make a `ShuffleSplit`

In [None]:
from sklearn.model_selection import ShuffleSplit

shuffle_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
cross_val_score(model, X, y, cv=shuffle_split)

### Generalized Scoring

With our 10 fold shuffle splits, we can calculate generalized accuracy scores  - that could be used in a cross-validation model comparison.

In [None]:
MSE = cross_val_score(model, X, y, cv=shuffle_split, scoring='neg_mean_squared_error')
GE, errGE = -np.mean(MSE), np.std(MSE)/np.sqrt(len(MSE))
print("Generalization error:", GE, "+/-", errGE)

In [None]:
R2 = cross_val_score(model, X, y, cv=shuffle_split, scoring='r2')
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Generalized R2 score:", meanR2, "+/-", errR2)

### Model Expansion

* Let's expand our linear model to include some higher order terms (quadratic, cubic etc). This can be done by adding additional feature columns to the design matrix $X$. We are still just predicting $y$, but now we'll be asking for more coefficients.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=4)
XX = poly.fit_transform(X)
XX

In [None]:
polymodel = linear_model.LinearRegression()

poly_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

R2 = cross_val_score(polymodel, XX, y, cv=poly_split, scoring='r2')

poly_meanR2, poly_errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Polynomial: generalized R^2 score:", np.round(poly_meanR2, 2), "+/-", np.round(poly_errR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))

### Model Checking

As usual, it's a good idea to check the model's performance by visualizing its predictions in data space


We can make one model prediction for each training set in a set of folds, and plot all of them.

In [None]:
from sklearn.model_selection import cross_val_predict

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:

y_straightline = cross_val_predict(model, X, y, cv=10)
y_polynomial = cross_val_predict(polymodel, XX, y, cv=10)

In [None]:
# Plot outputs:
plt.scatter(X, y,  color='black', alpha=0.1)
plt.plot(X, y_straightline, color='blue', linewidth=2, alpha=0.4)
plt.plot(X, y_polynomial, color='green', linewidth=2, alpha=0.4)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

* In this example, the polynomial degree is a control parameter that needs to be set: we can search this parameter space for the value that gives the highest average cross validation score (or lowest generalization error). 

### Multiple Linear Regression

* The Boston dataset has 13 attributes, more than one of which might contain information about house prices in the city. Let's train a linear model on all these attributes, and see if we can improve our score.

In [None]:
# Define a linear model:
supermodel = linear_model.LinearRegression()

# Use all the data, and set up a 10-fold cross validation run:
super_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

# Carry out the cross-validation of the model, training, testing and reporting:
R2 = cross_val_score(supermodel, boston.data, boston.target, cv=super_split, scoring='r2')
                           
# Compute our model prediction accuracy score, for comparison with other models:
hpmeanR2, hperrR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Hyperplane: generalized R^2 score:", np.round(hpmeanR2, 2), "+/-", np.round(hperrR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))

### What just happened?

* We just went from a simple hypothesis (median house price `MEDV` depends on `LSTAT`) to a very much more complex one (house price could depend on all of our 13 measured attributes) in one step. The data analysis is *automated*, in the sense that we simply fed our machine new inputs and it processed them.


* Using all our data, we are now better at predicting house prices - *but we have gained no new understanding of how the Boston housing market works.*