# Machine Learning in Astronomy

Goals:

* Work through example regression and classification analyses of the SDSS catalog dataset

* Understand some more "score" metrics and diagnostic visualizations, and carry out model selection by "cross validation"

### Further Reading

Ivezic Chapter 8 (regression) and Chapter 9 (classification)

> Credit: much of the material in this chunk is based on Josh Bloom's SDSS example notebook, from the 2014 edition of "Astro Hack Week".

## Photometric Redshift Estimation and Quasar Classification

Modern wide field surveys are generating very large databases of automatically measured objects, whose error properties may not be well understood. 

Fast machine learning algorithms are proving to be very useful in such a regime.

## Photometric Redshift Estimation and Quasar Classification

Let's investigate the SDSS photometric object catalog, and look for machine learning solutions to the following two problems:

1. Estimating the redshifts of quasars from their photometry (regression)

2. Selecting quasars from a background of stars and galaxies (classification)

## Data Aquisition

From the SDSS Sky Server we've downloaded two types of photometry (aperature and petrosian), corrected for extinction, for a number of sources with redshifts. Here's the SQL for an example query, that gets us 10000 example quasars:

<font color="blue" size=2>
<pre>SELECT *,dered_u - mag_u AS diff_u, dered_g - mag_g AS diff_g, dered_r - mag_r AS diff_g, dered_i - mag_i AS diff_i, dered_z - mag_z AS diff_z from
(SELECT top 10000
objid, ra, dec, dered_u,dered_g,dered_r,dered_i,dered_z,psfmag_u-extinction_u AS mag_u,
psfmag_g-extinction_g AS mag_g, psfmag_r-extinction_r AS mag_r, psfmag_i-extinction_i AS mag_i,psfmag_z-extinction_z AS mag_z,z AS spec_z,dered_u - dered_g AS u_g_color, 
dered_g - dered_r AS g_r_color,dered_r - dered_i AS r_i_color,dered_i - dered_z AS i_z_color,class
FROM SpecPhoto 
WHERE (class = 'QSO') ) as sp
 </pre>
</font>

We can do the same for 'STAR's and 'GALAXY's as well.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
%pylab inline
import seaborn as sns
sns.set()
import copy
from __future__ import print_function

## Photometric Redshift Estimation

* This is a regression problem, to be able to predict the redshift response variable given a number of photometric measurement "features"


* The SDSS spectroscopic quasar sample provides a _training set_ for the prediction of photometric redshifts


* Let's read in our SDSS quasars, and munge them into machine learning inputs

In [None]:
qsos = pd.read_csv("../examples/SDSScatalog/data/qso10000.csv",index_col=0,usecols=["objid","dered_r","spec_z","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z"])

# Clean out extreme colors and bad magnitudes:
qsos = qsos[(qsos["dered_r"] > -9999) & (qsos["g_r_color"] > -10) & (qsos["g_r_color"] < 10)]


# Response variables: redshift
qso_redshifts = qsos["spec_z"]

# Features or attributes: photometric measurements
qso_features = copy.copy(qsos)
del qso_features["spec_z"]
qso_features.head()

### Visual Exploration

* Over what redshift range do we have quasar spectra?

* What structure is there in the _photometric feature space_ that might help us predict redshift?


Let's plot all the features, colored by the target redshift, to look for structure.

In [None]:
bins =  hist(qso_redshifts.values,bins=100) ; xlabel("redshift") ; ylabel("N")

In [None]:
import matplotlib as mpl
import matplotlib.cm as cm

# Truncate the color at z=2.5 just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(qso_redshifts.values), vmax=2.5)
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)

# Plot everything against everything else:
rez = pd.scatter_matrix(qso_features[0:1000], alpha=0.2, figsize=[15,15], c=m.to_rgba(qso_redshifts.values))

### Feature Data and Target Values

Now we have our machine learning inputs (photometric features: colors and magnitudes) and outputs (target redshift values):

In [None]:
X = qso_features.values  # Data: 9-d feature space
y = qso_redshifts.values # Target: redshifts

In [None]:
print("Design matrix shape =", X.shape)
print("Response variable vector shape =", y.shape)

### Linear Regression

Let's follow the same procedure as in the [previous chunk (from the `SciKit-Learn` Linear Regression tutorial)](machinelearning2.ipynb):

In [None]:
# Split the data into a training set and a test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Instantiate a "linear model"
from sklearn import linear_model
linear = linear_model.LinearRegression()

# Fit the model, using all the attributes:
linear.fit(X_train, y_train)

# Do the prediction on the test data:
y_lr_pred = linear.predict(X_test)

# How well did we do? Compute MSE
from sklearn.metrics import mean_squared_error
mse_linear = np.sqrt(mean_squared_error(y_test,y_lr_pred))
r2_linear = linear.score(X_test, y_test)
print("Linear regression: MSE = ",mse_linear)
print("R2 score =",r2_linear)

In [None]:
plot(y_test, y_lr_pred, 'o', alpha=0.2)
title("Linear Regression - MSE = %.2f" % mse_linear, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7],[0,7], color='red')
ylim(0,5)

Just how bad is this? Here's the MSE from guessing the *average redshift of the training set* for all new objects:

In [None]:
print("Naive MSE", ((1./len(y_train))*(y_train - y_train.mean())**2).sum())
print("Linear regression: MSE = ",mse_linear)

### *k*-Nearest Neighbor (KNN) Regression

Now let's try a non-parametric model:

> ["Regression based on k-nearest neighbors.](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set."


#### Question:

What underlying model is implied by the KNN algorithm? How many hidden parameters does it (effectively) have? What choices do we get to make?

In [None]:
from sklearn import neighbors
from sklearn import preprocessing

X_scaled = preprocessing.scale(X) # Many methods work better on re-scaled ("whitened") X.

KNN = neighbors.KNeighborsRegressor(5)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

KNN.fit(X_train, y_train)

In [None]:
y_knn_pred = KNN.predict(X_test)
mse_knn = mean_squared_error(y_test,y_knn_pred)
r2_knn = KNN.score(X_test, y_test)
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)

In [None]:
plot(y_test, y_knn_pred,'o',alpha=0.2)
title("k-NN Residuals - MSE = %.2f" % mse_knn, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);

### Tuning the KNN Model

* Let's vary the control parameters of the KNN model, to see how good we can make our predictions.

* We can see our options in the model `repr`:

```
KNeighborsRegressor(algorithm='auto',
    leaf_size=30, metric='minkowski',      
    metric_params=None, n_neighbors=5, 
    p=2, weights='uniform')
```
* Let's first make a "validation curve" to investigate one parameter: the number of nearest neighbors averaged over.

In [None]:
# We'll vary the number of neighbors used:
param_name = "n_neighbors"
param_range = np.array([1,2,4,8,16,32,64])

# Compute our cv scores for the above range of the no. of neighbors:
from sklearn.model_selection import validation_curve
training_scores, validation_scores = validation_curve(KNN, X_scaled, y,
                                                      param_name=param_name,
                                                      param_range=param_range, 
                                                      cv=3, scoring='r2')

In [None]:
def plot_validation_curve(param_name,parameter_values, training_scores, validation_scores):
    training_scores_mean = np.mean(training_scores, axis=1)
    training_scores_std = np.std(training_scores, axis=1)
    validation_scores_mean = np.mean(validation_scores, axis=1)
    validation_scores_std = np.std(validation_scores, axis=1)

    plt.fill_between(parameter_values, training_scores_mean - training_scores_std,
                     training_scores_mean + training_scores_std, alpha=0.1, color="r")
    plt.fill_between(parameter_values, validation_scores_mean - validation_scores_std,
                     validation_scores_mean + validation_scores_std, alpha=0.1, color="g")
    plt.plot(parameter_values, training_scores_mean, 'o-', color="r",
             label="Training R2 score")
    plt.plot(parameter_values, validation_scores_mean, 'o-', color="g",
             label="Cross-validation R2 score")
    plt.ylim(validation_scores_mean.min() - .1, training_scores_mean.max() + .1)
    plt.xlabel(param_name, fontsize=18)
    plt.legend(loc="best", fontsize=18)

In [None]:
plot_validation_curve(param_name, param_range, training_scores, validation_scores)

#### Question:

Can you explain the shapes of these two curves? Talk to your neighbor for a few minutes, and be prepared to suggest reasons for a) the rise and fall of the cross validation score and b) the monotonic decrease in training score.

### Model tuning with `GridSearchCV`

* Now, let's see if we can do better by varying some other KNN options as well - in a *grid search*.

In [None]:
param_grid = {'n_neighbors': np.array([1,2,4,8,16,32,64]),
                  'weights': ['uniform','distance'],
                       'p' : np.array([1,2])}

np.set_printoptions(suppress=True)
print(param_grid)

In [None]:
from sklearn.model_selection import GridSearchCV
KNN_tuned = GridSearchCV(KNN, param_grid, verbose=3)

A `GridSearchCV` object behaves just like a model, except it carries out a cross-validation while fitting:

<img src="../graphics/ml_grid_search_cross_validation.svg" width=70% align='center'>

In [None]:
KNN_tuned.fit(X_train, y_train)

In [None]:
y_knn_tuned_pred = KNN_tuned.predict(X_test)

mse_knn_tuned = mean_squared_error(y_test,y_knn_tuned_pred)
r2_knn_tuned = KNN_tuned.score(X_test, y_test)

print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =", r2_knn_tuned)
print("cf.")
print("MSE (KNN) = ", mse_knn)
print("R2 score (KNN) =", r2_knn)

Which are the best KNN control parameters we found?

In [None]:
KNN_tuned.best_params_

This value of `n_neighbors` is consistent with the peak in cross-validation score in the validation curve plot.

### Generalization Error

Notice that all the above tuning happened while training on a single split (`X_train` and `y_train`).


It's possible that that one particular fold prefers a slightly different set of model parameters than a different one - so to assess our generalization error, we need a further level of cross-validation.


We can do this by passing a `GridSearchCV` model to the cross validation score calculator. This will take a few minutes, as the grid search is carried out for each CV fold...

In [None]:
from sklearn.model_selection import cross_val_score

R2 = cross_val_score(KNN_tuned, X_scaled, y, cv=3, scoring='r2')

In [None]:
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print('Mean score:', meanR2, '+/-', errR2)

### Notes

* Optimizing over control parameters (or hyper parameters) with grid search cross validation is a form of model selection.


* When presented with new data samples (photometry), and asked to predict the target response variables (photometric redshifts), we'll need a trained machine that has not been *over-fitted* to the training data.


* Minimizing and estimating the generalization error is a way to reduce the risk of getting this prediction wrong. 

Let's finish off our photo-z machine learning tool, by:


* Choosing the best model (from our cross-validation analysis)

* Training it on the whole of the training dataset

* Using it to predict the redshifts of the objects in the test set

In [None]:
KNNz = KNN_tuned.best_estimator_

KNNz.fit(X_train, y_train)

In [None]:
j = 571
one_new_quasar = X_test[j,:].reshape(1, -1)
zphoto = KNNz.predict(one_new_quasar)
zspec = y_test[j]
print("True redshift cf. KNN photo-z:", zspec, ' cf.', zphoto[0])

In [None]:
zspec = y_test
zphoto = KNNz.predict(X_test)

plot(zspec, zphoto,'o',alpha=0.1)
title("KNNz performance", fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);

## Quasar Classification with Random Forest

* Let's switch gears and do a 3-class classification problem: star, galaxy, or QSO.


* A very good general-purpose classification (and regression!) algorithm is "Random Forest." See [this yhat blog post](http://blog.yhathq.com/posts/random-forests-in-python.html) for a nice high level introduction.

## Random Forests


* From the `Scikit-Learn` docs: ["A random forest is a meta estimator that fits a number of *decision tree classifiers* on various sub-samples of the dataset, and uses averaging to improve the predictive accuracy and control over-fitting."](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


* Decision trees encode sequences of "cuts" in the data, leading to the separation of the samples into groups to which labels are assigned.

### Decision Trees

<img src="../graphics/ml_yhat_decision_tree_example.png" width=60%>

> [yhat.com "Random Forests in Python"](http://blog.yhathq.com/posts/random-forests-in-python.html)

### Forests of Random Trees

<img src="../graphics/ml_yhat_a_random_forest.png" width=60%>

> [yhat.com "Random Forests in Python"](http://blog.yhathq.com/posts/random-forests-in-python.html)

### Forests of Random Trees


* Random decision trees are random in that they:

  * Are set up with random cut thresholds, and random feature-ordering
  
  * Are given random bootstrap sub-samples of the data
  
  
* Most trees in a random forest are _terrible_ classifiers: but we only need a few with high weight to dominate the average

## Quasar Classification with Random Forest

* Let's read in equal numbers of all three types of data, clean them up, and set $y$ equal to the classification label.

In [None]:
all_sources = pd.read_csv("../examples/SDSSCatalog/data/qso10000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"])[:1000]

all_sources = all_sources.append(pd.read_csv("../examples/SDSSCatalog/data/star1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources.append(pd.read_csv("../examples/SDSSCatalog/data/galaxy1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]

all_labels = all_sources["class"]

all_features = copy.copy(all_sources)
del all_features["class"]

X = copy.copy(all_features.values)
y = copy.copy(all_labels.values)

### Feature Data and Target Labels

In classification problems the target values are _categorical_ "labels":

In [None]:
print("Feature vector shape =", X.shape)
print("Class label vector shape =", y.shape)

In [None]:
print(y, X)

### Visual Exploration

What structure can we see in the data? Let's plot all the features as before.

In [None]:
yy = all_labels.values.copy()
yy[yy=="QSO"] = 0.0    # Red
yy[yy=="STAR"] = 0.5   # Green
yy[yy=="GALAXY"] = 1.0 # Blue

norm = mpl.colors.Normalize(vmin=min(yy), vmax=max(yy))
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)
rez = pd.scatter_matrix(all_features, alpha=0.2, figsize=[15,15], c=m.to_rgba(yy))

### Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(X, y)

### Random Forest feature importances

* Each decision tree in the Random Forest focuses on a different combination of features

* One output of the fitted model is an indication of which features are most important


In [None]:
sorted(zip(all_sources.columns.values,rf.feature_importances_),key=lambda q: q[1],reverse=True)

### Classification Accuracy

* The accuracy of a classifier is the _fraction of predictions made that are correct_. 

* Ensemble classifiers like Random Forest provide an `oob_score`, the mean "Out of Bag" prediction accuracy

* Each decision tree in the ensemble is only working on a subset of the data, so it can track its accuracy with the data not in its own bag. 

In [None]:
# Report the "out of bag" accuracy score:
rf.oob_score_

This score _looks_ good, but this is the accuracy on the _training set_.


#### Question:
How should we estimate the generalized accuracy?

### Classifier Tuning with GridSearchCV

As before, this will take a few minutes, as the model selection is carried out on each fold within the training set... 

In [None]:
# Parameter values to try:
parameters = {'n_estimators':(50,100,200), "max_features": ["auto",3],
              'criterion':["gini","entropy"], "min_samples_leaf": [1,2]}

# Training/test split:
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25)

In [None]:
# Do a grid search to find the highest 3-fold cross-validation score:
rf_tuned = GridSearchCV(rf, parameters, cv=3, verbose=1)
RFselector = rf_tuned.fit(X_train, y_train)

# Print the best score and estimator:
print('Best OOB score:', RFselector.best_score_)
print(RFselector.best_estimator_)

#### Question:

Would you be satisfied with a 95% successful classification fraction? Can you suggest some more useful scores to optimize? (Hint: imagine using a classifier to select a sample of *follow-up targets*.)

## Confusion Matrices

One way of visualizing classification accuracy across a range of labels is via a *confusion matrix*:

In [None]:
# Make predictions given the test data:
y_pred = RFselector.predict(X_test)

In [None]:
# Compute confusion matrix:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

plt.matshow(cm)
plt.title('Confusion matrix', fontsize=18)
plt.colorbar()
plt.ylabel('True label', fontsize=18)
plt.xlabel('Predicted label', fontsize=18)

## ROC Curves

Each Random Forest output label comes with a *classification probability*, computed from the results of the whole forest. 

To select a sample of classified objects, one can choose a selection threshold in this class probability, and only keep objects with higher probability than this threshold.


The availability of a class probability leads to an important diagnostic: the "Receiver Operating Characteristic" or ["ROC" curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html). 

### ROC Curves

These show a classifier's *true positive rate* (TPR) plotted against the *false positive rate* (FPR), as the selection threshold is varied.


$\;\;\;\;\;{\rm TPR} = \frac{TP}{TP + FN}$  the fraction of actual 1's that are classified as 1's.


$\;\;\;\;\;{\rm FPR} = \frac{FP}{FP + TN}$  the fraction of actual 0's that are classified as 1's.  

> TPR = "Recall", "Completeness". FPR = "Fall-out".
> In astronomy, we are also (very) interested in "Precision," $\frac{TP}{FP + TP}$, which is related to "Purity" 


### ROC Curves

Typically, classifiers have control parameters that affect both the TPR and FPR (often improving one at the expense of the other), so the ROC curve is a good tool for investigating these parameters. 


Likewise, ROC curves provide a very good way to compare different classifiers, via the "area under the curve" (and above the random classifier TPR=FPR line). 

## RFselector ROC Curves

Let's use the `SciKit-Learn` utilities to plot an ROC curve for our RFselector, and quantify its performance via the AUC.

In [None]:
# Classify the test data, and store the classification probabilities:
BestRFselector = RFselector.best_estimator_
y_prob = BestRFselector.fit(X_train, y_train).predict_proba(X_test)

In [None]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and area under curve (AUC) for each class:
labels = BestRFselector.classes_ # ['GALAXY', 'QSO', 'STAR'] - the order is decided by the machine!
fpr = dict()
tpr = dict()
roc_auc = dict()
for i,label in enumerate(labels):
    fpr[label], tpr[label], _ = roc_curve(y_test, y_prob[:, i], pos_label=label)
    roc_auc[label] = auc(fpr[label], tpr[label])

In [None]:
lw = 2
colors = {'QSO':'red', 'STAR':'green', 'GALAXY':'blue'}
for label in labels:
    plot(fpr[label], tpr[label], color=colors[label],
         lw=lw, label='%s (AUC = %0.3f)' % (label, roc_auc[label]))
plot([0, 1], [0, 1], color='gray', lw=lw, linestyle='--')
xlim([0.0, 1.0])
ylim([0.0, 1.05])
xlabel('False Positive Rate', fontsize=18)
ylabel('True Positive Rate', fontsize=18)
title('RFselector ROC Curve', fontsize=18)
legend(loc="lower right", fontsize=14);

## Endnotes


The `scikit-learn` package makes it easy to experiment with various machine learning algorithms, and make non-parametric models that (via cross-validation) have high generalized prediction accuracy.

(These models may not provide parameter uncertainties or even much new understanding of the dataset, but this may not be your priority.)


Other frameworks exist: Google's ["TensorFlow"](https://www.tensorflow.org/tutorials/deep_cnn) is worth investigating, both for general ML, and also for its _neural network_ support. PGMs, as well as the ML basics shown here, will reappear