## Astronomy 406 "Computational Astrophysics" (Fall 2017)

### Week 12: Machine Learning methods for Classification

<b>Reading:</b> notes below, as well as $\S$9.3-9.7 of [Machine Learning](http://www.astroml.org/).

Last week we looked at [Gaussian Mixture](http://scikit-learn.org/stable/modules/mixture.html) modeling. It can be used to assign data points to distinct and separate classes.  [**Sklearn**](http://scikit-learn.org) (or Scikit-Learn) module provides several others methods for classification:

[Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html)

[Support Vector Machines](http://scikit-learn.org/stable/modules/svm.html)

[Random Forest](http://scikit-learn.org/stable/modules/ensemble.html#forest)

A handy comparison of all different classification methods is given [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).

We will use examples provided by ML.<br>
Gaussian classifier:
[Figure 9.1](http://www.astroml.org/book_figures/chapter9/fig_bayes_DB.html) and
[Figure 9.2](http://www.astroml.org/book_figures/chapter9/fig_simple_naivebayes.html)<br>
SVM classifier:
[Figure 9.9](http://www.astroml.org/book_figures/chapter9/fig_svm_diagram.html)

In [None]:
%matplotlib inline
from matplotlib import rcParams
rcParams["savefig.dpi"] = 90
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, svm, ensemble

#### SDSS dataset

Download a large dataset of galaxies with spectroscopic information from the SDSS survey. This survey obtained images for 357,000,000 stars and galaxies in five photometric bands, as well as spectroscopy for 1,600,000 galaxies, quasars, and stars. AstroML provides access to a subset of this huge information. It is described [here](http://www.astroml.org/user_guide/datasets.html). To load the properties of galaxies from the spectroscopic dataset, execute the following cell. The output lists various fields listed in the dataset, including galaxy positions on the sky, redshift, velocity dispersion, magnitudes in the SDSS $u$, $g$, $r$, $i$, and $z$ bands, stellar mass, star formation rate, and others.

In [None]:
from astroML.datasets import fetch_sdss_specgals
from astroML.plotting import scatter_contour
from astroML.plotting.tools import draw_ellipse

data = fetch_sdss_specgals()
print data.shape, 'spectra fields:', data.dtype.names

Let make a common color-magnitude diagram, for the $g-r$ color vs. $r$-band magnitude. To make a cleaner dataset, we restrict the redshift range to $0.02 < z < 0.06$. This reduces the number of objects from 661598 to 114527, still a huge number for a statistical study.

In [None]:
# redshift cut 
data = data[data['z'] > 0.02]
data = data[data['z'] < 0.06]

gr = data['modelMag_g'] - data['modelMag_r']
r = data['modelMag_r']
print len(r), 'galaxies selected'

fig, ax = plt.subplots(figsize=(8, 6))
scatter_contour(r, gr, threshold=400, log_counts=True, ax=ax,
                histogram2d_args=dict(bins=100),
                plot_args=dict(marker=',', linestyle='none', color='black'),
                contour_args=dict(cmap=plt.cm.bone))

ax.set_xlabel(r'${\rm r}$', size=16)
ax.set_ylabel(r'${\rm g - r}$', size=16)
ax.set_xlim(18, 14)
ax.set_ylim(0, 1.2)
plt.show()

There is a bimodality in the color distribution. The long narrow band around $g-r = 0.8$ is called *red sequence* galaxies, while the big cloud at $0.3 < g-r < 0.5$ is the *blue cloud* (star-forming) galaxies. The histogram below confirms this bimodal distribution.

In [None]:
plt.hist(gr, bins=np.arange(0.2,1,0.02), histtype='step')
plt.xlabel(r'${\rm g - r}$', size=16)
plt.show()

However, from a much smaller dataset, it would be difficult to infer this bimodality. Let's take the first hundred galaxies and plot their color histogram:

In [None]:
plt.hist(gr[:100], bins=np.arange(0.2,1,0.02), histtype='step')
plt.xlabel(r'${\rm g - r}$', size=16)
plt.show()

### Machine Learning

Machine Learning methods allow efficient exploration of datasets.  They are often divided into the categories of **Supervised** and **Unsupervised** methods.

**Supervised** methods generally deal with classification of objects. They are given a **labeled** training dataset, and the model is later applied to **un-labeled** data in order to predict the unknown label.

**Unsupervised** methods generally deal with clustering of objects or density estimation. They are given an **un-labeled** training dataset, and make inferences about the structure of the data without any label input. One familiar example is Kernel Density Estimator (KDE) which we used instead of histograms. Another example is Gaussian Mixture Modeling.

Most models in **Sklearn** have similar syntax. They are trained on a particular dataset using the ``fit()`` method. Then labels for new points can be predicted using the ``predict()`` method. This makes it very convenient to try different Machine Learning models just by changing the initialization step.

**Sklearn** also provides a routine (``cross_val_score``) for cross-validation of the model parameters, that is, evaluation of how well a particular model could be expected to perform on the dataset.

Here is an example of selecting the training set and the test set. Let's take 100 and 1000 points from our galaxy dataset.

In [None]:
X = np.vstack([r, gr]).T
print 'total dataset:', X.shape, 'galaxies'

Xtraining = X[:100]
print 'training set:', len(Xtraining), 'galaxies'

Xtest = X[-1000:]
print 'test set:', len(Xtest), 'galaxies'

Let's begin our exploration of this dataset with unsupervised Gaussian Mixture Modeling, which does not require the knowledge of any labels.

In [None]:
from sklearn.mixture import GaussianMixture

N = np.arange(1, 6)
models = [None for n in N]

for i in range(len(N)):
    models[i] = GaussianMixture(N[i], covariance_type='full').fit(Xtraining)

AIC = [m.aic(Xtraining) for m in models]
BIC = [m.bic(Xtraining) for m in models]

i_best = np.argmin(BIC)
gmm_best = models[i_best]
print 'best fit converged:', gmm_best.converged_
print 'number of interations =', gmm_best.n_iter_
print 'BIC: N components = %i' % N[i_best]

plt.plot(N, AIC, 'r-', label='AIC')
plt.plot(N, BIC, 'b--', label='BIC')
plt.xlabel('number of components')
plt.ylabel('information criterion')
plt.legend(loc=2, frameon=False)
plt.show()

Both information criteria prefer three modes. Let's color the points according to the predicted split of the training dataset.

In [None]:
cmap_bold = ListedColormap(['#FF0000', '#0000FF', '#00FF00',])
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF', '#AAFFAA'])

gmm_best = models[2]

plt.scatter(Xtraining[:,0], Xtraining[:,1], c=gmm_best.predict(Xtraining), cmap=cmap_bold, alpha=0.7, s=6)
plt.xlim(18,14)
plt.ylim(0,1.2)
plt.xlabel(r'${\rm r}$', size=16)
plt.ylabel(r'${\rm g - r}$', size=16)

for mu, C, w in zip(gmm_best.means_, gmm_best.covariances_, gmm_best.weights_):
    draw_ellipse(mu, C, scales=[2], fc='none', ec='k')

Now we can predict the expected mode for any value of $r, g-r$ and color the whole space according to the predicted mode.

In [None]:
xx, yy = np.meshgrid(np.arange(18,14,-0.01), np.arange(0,1.2,0.01))

Z = gmm_best.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=.8, cmap=cmap_light)
plt.xlim(18,14)
plt.ylim(0,1.2)
plt.plot()

### Nearest Neighbors

The [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) method could be as supervised or unsupervised.

Let's use the supervised version and assign the training labels two values (True or False) based on whether $g-r > 0.65$, which seems like a reasonable split by eye.

In [None]:
border = 0.65
target = (Xtraining[:,1] > border)

Now we initialize the Sklearn's implementation of the [NN Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), evaluate fit for the training dataset, and predict labels for the test set.

In [None]:
clf = neighbors.KNeighborsClassifier(n_neighbors=3, weights='distance')

# evaluate fit 
clf.fit(Xtraining, target)

# apply fit to the test data
y = clf.predict(Xtest)

plt.scatter(Xtest[:,0], Xtest[:,1], c=y, cmap=cmap_bold, alpha=0.7, s=6)
plt.axhline(border)
plt.xlim(18,14)
plt.ylim(0,1.2)
plt.xlabel(r'${\rm r}$', size=16)
plt.ylabel(r'${\rm g - r}$', size=16)
plt.plot()

Close-up view near the decision boundary.

In [None]:
plt.scatter(Xtest[:,0], Xtest[:,1], c=y, cmap=cmap_bold, alpha=0.7, s=6)
plt.axhline(border)
plt.xlim(18,14)
plt.ylim(0.45,0.85)
plt.xlabel(r'${\rm r}$', size=16)
plt.ylabel(r'${\rm g - r}$', size=16)
plt.plot()

Create a colored map of predicted labels.

In [None]:
xx, yy = np.meshgrid(np.arange(18,14,-0.02), np.arange(0,1.2,0.02))

if hasattr(clf, "decision_function"):
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=.8, cmap=cmap_light)
plt.xlim(18,14)
plt.ylim(0.4,0.9)
plt.plot()

### Support Vector Machines

[Support vector machines](http://scikit-learn.org/stable/modules/svm.html) (SVMs) are a set of supervised learning methods. SVMs draw a boundary between clusters of data, which maximizes the perpendicular distance between the clusters.

Sklearn implementation of the [SVM Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) has many options

In [None]:
clf = svm.SVC(gamma=2)

# evaluate fit 
clf.fit(Xtraining, target)

# apply fit to the test data
y = clf.predict(Xtest)

plt.scatter(Xtest[:,0], Xtest[:,1], c=y, cmap=cmap_bold, alpha=0.7, s=6)
plt.axhline(border)
plt.xlim(18,14)
plt.ylim(0,1.2)
plt.xlabel(r'${\rm r}$', size=16)
plt.ylabel(r'${\rm g - r}$', size=16)
plt.plot()

### Random Forest

The [Random Forest](http://scikit-learn.org/stable/modules/ensemble.html#forest) is a type of the decision tree algorithm.

Sklearn implementation of the [RF Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
clf = ensemble.RandomForestClassifier()

# evaluate fit 
clf.fit(Xtraining, target)

# apply fit to the test data
y = clf.predict(Xtest)

plt.scatter(Xtest[:,0], Xtest[:,1], c=y, cmap=cmap_bold, alpha=0.7, s=6)
plt.axhline(border)
plt.xlim(18,14)
plt.ylim(0,1.2)
plt.xlabel(r'${\rm r}$', size=16)
plt.ylabel(r'${\rm g - r}$', size=16)
plt.plot()