### Reminder!

After pulling down the tutorial notebook, immediately make a copy. Then do not modify the original. Do your work in the copy. This will prevent the possibility of git conflicts should the version-controlled file change at any point in the future. (The same exhortation applies to homeworks.)

# Week 9 Tutorial 

## Introduction to Machine Learning

In this notebook you will start to explore the `scikit-learn` ML python package, and see how it supports a range of machine learning models with a uniform terminology and API, and emphasize model evaluation by cross-validation. 

> Credit: some of the material in this tutorial is based on Andy Mueller's `scikit-learn` tutorial from the 2015 edition of "Astro Hack Week". The SDSS examples are based on a tutorial by Josh Bloom. 

### Requirements

You will need to `pip install scikit-learn` and check that you have v0.18 or higher as a result.

## Simple Example: The Digits Dataset

* Let's take a look at one of the `SciKit-Learn` example datasets, `digits`

In [None]:
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

In [None]:
digits.images.shape

In [None]:
print(digits.images[0])

In [None]:
plt.matshow(digits.images[23], cmap=plt.cm.Greys)

In [None]:
digits.data.shape

In [None]:
digits.target.shape

In [None]:
digits.target[23]


* In `SciKit-Learn`,  `data` contains the design matrix $X$, and is a `numpy` array of shape $(N, P)$


* `target` contains the response variables $y$, and is a `numpy` array of shape $(N)$

In [None]:
print(digits.DESCR)

## Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(digits.data, digits.target, test_size=0.25)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

## Other Example Datasets

`SciKit-Learn` provides 5 "toy" datasets for tutorial purposes, all `load`-able in the same way:

Name        | Description
------------|:---------------------------------------
`boston`	| Boston house-prices, with 13 associated measurements (R)
`iris`	    | Fisher's iris classifications (based on 4 characteristics) (C)
`diabetes`	| Diabetes (x vs y) (R)
`digits`	| Hand-written digits, 8x8 images with classifications (C)
`linnerud`	| Linnerud: 3 exercise and 3 physiological data (R)


* "R" and "C" indicate that the problem to be solved is either a regression or a classification, respectively.

## Looking for Structure

* A model's ability to make predictions depends on there being _structure_ in the data

* If structure is present the data are informative, and vice versa.

* Feature design takes thought; thinking is aided by _data visualization_

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

In [None]:
# Visualizing the Boston house price data:

import corner

X = boston.data
y = boston.target

plot = np.concatenate((X, np.atleast_2d(y).T), axis=1)
labels = np.append(boston.feature_names,'MEDV')

corner.corner(plot, labels=labels);