# Machine Learning in Astronomy

Goals:

* Work through example regression and classification analyses using the `scikit-learn` package and the SDSS catalog dataset.

> Credit: much of the material in this chunk is based on Josh Bloom's SDSS example notebook, from the 2014 edition of "Astro Hack Week".

## Photometric Redshift Estimation and Quasar Classification

Modern wide field surveys are generating very large databases of automatically measured objects, whose error properties may not be well understood. 

Fast machine learning algorithms are proving to be very useful in such a regime.

Let's investigate the SDSS photometric object catalog, and look for machine learning solutions to the following two problems:

1. Estimating the redshifts of quasars from their photometry (regression)

2. Selecting quasars from a background of stars and galaxies (classification)

## Data Aquisition

From the SDSS Sky Server we've downloaded two types of photometry (aperature and petrosian), corrected for extinction, for a number of sources with redshifts. Here's the SQL for an example query, that gets us 10000 example quasars:

<font color="blue">
 <pre>SELECT *,dered_u - mag_u AS diff_u, dered_g - mag_g AS diff_g, dered_r - mag_r AS diff_g, dered_i - mag_i AS diff_i, dered_z - mag_z AS diff_z from
(SELECT top 10000
objid, ra, dec, dered_u,dered_g,dered_r,dered_i,dered_z,psfmag_u-extinction_u AS mag_u,
psfmag_g-extinction_g AS mag_g, psfmag_r-extinction_r AS mag_r, psfmag_i-extinction_i AS mag_i,psfmag_z-extinction_z AS mag_z,z AS spec_z,dered_u - dered_g AS u_g_color, 
dered_g - dered_r AS g_r_color,dered_r - dered_i AS r_i_color,dered_i - dered_z AS i_z_color,class
FROM SpecPhoto 
WHERE 
 (class = 'QSO')
 ) as sp
 </pre>
</font>

We've got 1000 stars and 1000 galaxies as well, and saved them for convenience.

In [None]:
# For pretty plotting
# !pip install --upgrade seaborn

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
%pylab inline
import seaborn as sns
sns.set()
import copy
from __future__ import print_function

## Photometric Redshift Estimation

* This is a regression problem, to be able to predict the redshift response variable given a number of photometric measurement "features"


* The SDSS spectroscopic quasar sample provides a _training set_ for the prediction of photometric redshifts


* Let's read in our SDSS quasars, and munge them into machine learning inputs

In [None]:
qsos = pd.read_csv("../examples/SDSSCatalog/data/qso10000.csv",index_col=0,usecols=["objid","dered_r","spec_z","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z"])

# Clean out extreme colors and bad magnitudes:
qsos = qsos[(qsos["dered_r"] > -9999) & (qsos["g_r_color"] > -10) & (qsos["g_r_color"] < 10)]


# Response variables: redshift
qso_redshifts = qsos["spec_z"]

# Features or attributes: photometric measurements
qso_features = copy.copy(qsos)
del qso_features["spec_z"]
qso_features.head()

## Visual Exploration

* Over what redshift range do we have quasar spectra?

* What structure is there in the _photometric feature space_ that might help us predict redshift?


Let's plot all the features, colored by the target redshift, to look for structure.

In [None]:
bins =  hist(qso_redshifts.values,bins=100) ; xlabel("redshift") ; ylabel("N")

In [None]:
import matplotlib as mpl
import matplotlib.cm as cm

# Truncate the color at z=2.5 just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(qso_redshifts.values), vmax=2.5)
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)

# Plot everything against everything else:
rez = pd.scatter_matrix(qso_features[0:2000], alpha=0.2, figsize=[15,15], c=m.to_rgba(qso_redshifts.values))

## Feature Data and Target Values

Now we have our machine learning inputs (photometric features: colors and magnitudes) and outputs (target redshift values):

In [None]:
X = qso_features.values  # Data: 9-d feature space
y = qso_redshifts.values # Target: redshifts

In [None]:
print("Design matrix shape =", X.shape)
print("Response variable vector shape =", y.shape)

### Linear Regression

Let's follow the same procedure as in the [`SciKit-Learn` Linear Regression tutorial](../../scikit-learn/Linear_Regression.ipynb):

In [None]:
# Split the data into a training set and a test set:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Instantiate a "linear model"
from sklearn import linear_model
linear = linear_model.LinearRegression()

# Fit the model, using all the attributes:
linear.fit(X_train, y_train)

# Do the prediction on the test data:
y_lr_pred = linear.predict(X_test)

# How well did we do? Compute MSE
from sklearn.metrics import mean_squared_error
mse_linear = np.sqrt(mean_squared_error(y_test,y_lr_pred))
r2_linear = linear.score(X_test, y_test)
print("Linear regression: MSE = ",mse_linear)
print("R2 score =",r2_linear)

In [None]:
plot(y_test,y_lr_pred - y_test,'o',alpha=0.2)
title("Linear Regression Residuals - MSE = %.2f" % mse_linear)
xlabel("Spectroscopic Redshift")
ylabel("Residual")
hlines(0,min(y_test),max(y_test),color="red")

Just how bad is this? Here's the MSE from guessing the *average redshift of the training set* for all new objects:

In [None]:
print("Naive MSE", ((1./len(y_train))*(y_train - y_train.mean())**2).sum())
print("Linear regression: MSE = ",mse_linear)

In [None]:
mean_squared_error?

### *k*-Nearest Neighbor (KNN) Regression

Now let's try a different kind of model: a *non-parametric* one.

["Regression based on k-nearest neighbors. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set."](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)


#### Question:

What underlying model is implied by the KNN algorithm? How many hidden parameters does it have?

In [None]:
from sklearn import neighbors
from sklearn import preprocessing

X_scaled = preprocessing.scale(X) # Many methods work better on scaled X.

KNN = neighbors.KNeighborsRegressor(5)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

KNN.fit(X_train,y_train)

In [None]:
y_knn_pred = KNN.predict(X_test)
mse_knn = mean_squared_error(y_test,y_knn_pred)
r2_knn = KNN.score(X_test, y_test)
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)

In [None]:
plot(y_test, y_knn_pred - y_test,'o',alpha=0.2)
title("k-NN Residuals - MSE = %.2f" % mse_knn)
xlabel("Spectroscopic Redshift")
ylabel("Residual")
hlines(0,min(y_test),max(y_test),color="red")

### Tuning the KNN Model

* Let's vary the control parameters of the KNN model, to see how good we can make our predictions.

* We can see our options in the model `repr`:

> KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform')

* Let's first make a "validation curve" to investigate one parameter: the number of nearest neighbors averaged over.

In [None]:
# We'll vary the number of neighbors used:
param_name = "n_neighbors"
param_range = np.array([1,2,4,8,16,32,64])

# And we'll need a cv iterator:
from sklearn.cross_validation import ShuffleSplit
shuffle_split = ShuffleSplit(len(X), 10, test_size=0.4)

# Compute our cv scores for a range of the no. of neighbors:
from sklearn.learning_curve import validation_curve
training_scores, validation_scores = validation_curve(KNN, X_scaled, y,
                                                      param_name=param_name,
                                                      param_range=param_range, 
                                                      cv=shuffle_split, scoring='r2')

In [None]:
def plot_validation_curve(param_name,parameter_values, training_scores, validation_scores):
    training_scores_mean = np.mean(training_scores, axis=1)
    training_scores_std = np.std(training_scores, axis=1)
    validation_scores_mean = np.mean(validation_scores, axis=1)
    validation_scores_std = np.std(validation_scores, axis=1)

    plt.fill_between(parameter_values, training_scores_mean - training_scores_std,
                     training_scores_mean + training_scores_std, alpha=0.1, color="r")
    plt.fill_between(parameter_values, validation_scores_mean - validation_scores_std,
                     validation_scores_mean + validation_scores_std, alpha=0.1, color="g")
    plt.plot(parameter_values, training_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(parameter_values, validation_scores_mean, 'o-', color="g",
             label="Cross-validation score")
    plt.ylim(validation_scores_mean.min() - .1, training_scores_mean.max() + .1)
    plt.xlabel(param_name)
    plt.legend(loc="best")

In [None]:
plot_validation_curve(param_name, param_range, training_scores, validation_scores)

#### Question:

Can you explain the shapes of these two curves? Talk to your neighbor for a few minutes, and be prepared to suggest reasons for a) the rise and fall of the cross validation score and b) the monotonic decrease in training score.

#### Model tuning with `GridSearchCV`

* Now, let's see if we can do better by varying some other KNN options as well - in a *grid search*.

In [None]:
param_grid = {'n_neighbors': np.array([1,2,4,8,16,32,64]),
                  'weights': ['uniform','distance'],
                       'p' : np.array([1,2])}

np.set_printoptions(suppress=True)
print(param_grid)

In [None]:
from sklearn.grid_search import GridSearchCV
KNN_tuned = GridSearchCV(KNN, param_grid, verbose=3)

A `GridSearchCV` object behaves just like a model, except it carries out a cross-validation while fitting:

<img src="../../scikit-learn/figures/grid_search_cross_validation.svg" width=100%>

In [None]:
KNN_tuned.fit(X_train, y_train)

In [None]:
y_knn_tuned_pred = KNN_tuned.predict(X_test)

mse_knn_tuned = mean_squared_error(y_test,y_knn_tuned_pred)
r2_knn_tuned = KNN_tuned.score(X_test, y_test)

print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =",r2_knn_tuned)
print("cf.")
print("MSE (KNN) = ",mse_knn)
print("R2 score (KNN) =",r2_knn)

Which are the best KNN control parameters we found?

In [None]:
KNN_tuned.best_params_

This value of `n_neighbors` is consistent with the peak in cross-validation score in the validation curve plot.

#### Generalization Error

Notice that all the above tuning happened while training on a single split (`X_train` and `y_train`).


It's possible that that particular fold prefers a slightly different set of parameters than a different one - so to assess our generalization error, we need a further level of cross-validation.


We can do this by passing a `GridSearchCV` model to the cross validation score calculator. This will take a few moments, as the grid search is carried out for each CV fold...

In [None]:
from sklearn.cross_validation import cross_val_score

R2 = cross_val_score(KNN_tuned, X_scaled, y, cv=shuffle_split, scoring='r2')

In [None]:
meanR2,errR2 = np.mean(R2),np.std(R2)
print("Mean score:",meanR2,"+/-",errR2)

### Notes

* Optimizing over control parameters (or hyper parameters) with grid search cross validation is a form of model selection.


* When presented with new data samples (photometry), and asked to predict the target response variables (photometric redshifts), we'll need a trained machine that has not been *over-fitted* to the training data.


* Minimizing and estimating the generalization error is a way to reduce the risk of getting this prediction wrong. 


* Let's finish off our photo-z machine learning algorithm.

In [None]:
KNNz = KNN_tuned.best_estimator_
KNNz.fit(X_train, y_train)

In [None]:
j = 571
one_pretend_quasar = X_test[j,:]
zphoto = KNNz.predict(one_pretend_quasar)
zspec = y_test[j]
print("True redshift cf. KNN photo-z:",zspec,zphoto)

In [None]:
zspec = y_test
zphoto = KNNz.predict(X_test)

plot(zspec, zphoto,'o',alpha=0.1)
title("KNNz performance")
xlabel("Spectroscopic Redshift")
ylabel("Photometric redshift")
lims = [0.0,4.0]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')

## Quasar Classification with Random Forests


* Let's switch gears and do a 3-class classification problem: star, galaxy, or QSO.


* A very good general-purpose classification (and regression!) algorithm is Random Forest. See [this blog post](http://blog.yhathq.com/posts/random-forests-in-python.html) for a nice high level introduction.


* ["A random forest is a meta estimator that fits a number of *decision tree classifiers* on various sub-samples of the dataset, and uses averaging to improve the predictive accuracy and control over-fitting.](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


* Let's read in equal numbers of all three types of data, clean them up, and set $y$ equal to the classification label.

In [None]:
all_sources = pd.read_csv("data/qso10000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"])[:1000]

all_sources = all_sources.append(pd.read_csv("data/star1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources.append(pd.read_csv("data/galaxy1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]

all_labels = all_sources["class"]

all_features = copy.copy(all_sources)
del all_features["class"]

X = copy.copy(all_features.values)
y = copy.copy(all_labels.values)

In [None]:
all_labels.tail()

In [None]:
print("Feature vector shape =", X.shape)
print("Class label vector shape =", y.shape)

What structure can we see in the data? Let's plot all the features as before.

In [None]:
yy = all_labels.values.copy()
yy[yy=="QSO"] = 0.0    # Red
yy[yy=="STAR"] = 0.5   # Green
yy[yy=="GALAXY"] = 1.0 # Blue

norm = mpl.colors.Normalize(vmin=min(yy), vmax=max(yy))
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)
rez = pd.scatter_matrix(all_features,alpha=0.2,figsize=[15,15],color=m.to_rgba(yy))

OK - looks like there is information there to be used! 
Let's turn on the machine learning.

### Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100,oob_score=True)
rf.fit(X,y)

What are the important features in the data?

In [None]:
sorted(zip(all_sources.columns.values,rf.feature_importances_),key=lambda q: q[1],reverse=True)

In [None]:
rf.oob_score_

This is the "Out of Bag" accuracy (of predicted y compared to truth), made available by ensemble classifiers. (Each decision tree in the ensemble is only working on a subset of the data, so it can track its accuracy with the data not in its own bag.) 

The accuracy of a classifier is the fraction of predictions made that are correct. This one looks like its doing well - but this is the accuracy on the training set.

### Classifier improvement with GridSearchCV

In [None]:
# Parameter values to try:
parameters = {'n_estimators':(50,100,200),"max_features": ["auto",3],
              'criterion':["gini","entropy"],"min_samples_leaf": [1,2]}

# Initial training/test split:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# Do a grid search to find the highest 3-fold CV score:
rf_tuned = GridSearchCV(rf, parameters, cv=3, verbose=1)
RFselector = rf_tuned.fit(X_train, y_train)

# Print the best score and estimator:
print(RFselector.best_score_)
print(RFselector.best_estimator_)

#### Question:

Would you be satisfied with a 95% successful classification fraction? Read the Random Forest `SciKit-Learn` docs to find some alternative scores, and think about when you might want to choose one of these instead. (Hint: imagine using a classifier to select a sample of *targets*.)

One way of visualizing classification accuracy is via a *confusion matrix*:

In [None]:
y_pred = RFselector.predict(X_test)

In [None]:
# Compute confusion matrix:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')

Each output label comes with a *classification probability*, computed from the results of the whole forest. To select a sample of classified objects, one can choose a selection threshold in this class probability, and only keep objects with higher probability than this threshold.


The availability of a class probability leads to an important diagnostic: the "Receiver Operating Characteristic" or ["ROC" curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html). This shows the *true positive rate* (TPR) plotted against the *false positive rate* (FPR) of a classifier, as the selection threshold is varied.


Typically, classifiers have control parameters that affect both the TPR and FPR (often improving one at the expense of the other), so the ROC curve is a good tool for investigating these parameters. 


Likewise, ROC curves provide a very good way to compare different classifiers.

### Exercise:

Use `SciKit-Learn` utilities to plot an ROC curve for the RFselector.

**[Back to the lesson plan](../../lessons/9.MachineLearning.ipynb)**