# AS262 - Projects

# Photometric Redshift Calculation

## Background / Motivation 

A photometric redshift is an estimate for the recession velocity of a galaxy made without measuring its spectrum.  Instead the technique uses photometry and broad-band colors to estimate the galaxies redshift.  Using photometric redshifts to estimate the distances of faint galaxies has become an integral part of galaxy surveys conducted during recent years. This is driven by the large number of galaxies and their faint fluxes, which have made spectroscopic follow-up infeasible except for a relatively small and bright fraction of the galaxy population. Albeit less precise and less accurate than spectroscopy, photometric redshifts provide a way to estimate distances for galaxies too faint for spectroscopy or samples too large to be practical for complete spectroscopic coverage. 

## Project Outline: 
* Use supervised classification to determine the redshift of galaxies based on their broad band photometry.


* The data used for this project will come from the [CEERS surey](https://ceers.github.io), which imaged a region of sky known as the Extended Groth Strip with JWST.  The photometry data is stored in an hdf5 file (similar to the fits table we used in Lecture 3 UVJ exercise, and read in with exactly the same routine, using astropy.table.Table).  This file will be made available on Filer.  Please note that the photometry data has not been published, so it should be considered proprietary for the time being.  I am allowing students to use this data since I am a member of the CEERS collaboration, and you can gain access to it through me.


* You'll most likely want to limit your analysis to relatively bright and/or massive galaxies.  This might mean only working with galaxies whose F356W magnitude is less than 26.5 and/or mass greater than $10^8 M_\odot$.  You should vary these cuts to determine how they affect the final results.


* Youâ€™ll need to group the photometry as features in a single `X` array in the format that Scikit-Learn wants and then split the array into a training and test dataset using the `split_samples` routine, which we encountered in Lecture 16. The "known" redshifts come from spectroscopy, given in the table as 'z_spec'.  Note that not all galaxies have spectroscopy, so your training set can only use a subset of the full dataset.  You could perform cross-validation on some subset of this, and then apply your model to the remaining data and compare to the team's photometric redshifts (given as 'ZA_finkelstein' or 'z_phot').


* Try **at least two** supervised classification algorithms and compare their effectiveness by plotting the true redshift against the predicted redshift.  The plot should look something similar to this (although the table I'm providing you doesn't list different quality flags for the spectroscopic redshifts, so all data points would be the same color):


<img src=https://www.colby.edu/physics/faculty/mcgrath/AS262/photoz.png width="500">


* A quantitative measure of how well the classifier is working can be calculated using the Normalized Median Absolute Deviation:

$$ \sigma_{\rm NMAD} = 1.48\times {\rm median}\left(\frac{|\Delta z|}{1+z_{rm true}}\right) $$





* Potential classification algorithms include:
    * Decision Tree Classifier (`sklearn.tree.DecisionTreeRegressor`)
    * Random Forest Classifier (`sklearn.ensemble.RandomForestRegressor`)
    * K-Neighbors Classifier (`sklearn.neighbors.KNeighborsRegressor`)
    * Support Vector Machine (`sklearn.svm.SVR`)

## Example

Here's an example of using a Decision Tree Regression to calculate the photometric redshift of galaxies in the SDSS dataset.

In [None]:
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
import numpy as np
from matplotlib import pyplot as plt
# import seaborn as sns; sns.set()
%matplotlib inline
%config InlineBackend.figure_format='retina'

from sklearn.tree import DecisionTreeRegressor
from astroML.datasets import fetch_sdss_specgals

#----------------------------------------------------------------------
# This function adjusts matplotlib settings for a uniform feel in the textbook.
# Note that with usetex=True, fonts are rendered with LaTeX.  This may
# result in an error if LaTeX is not installed on your system.  In that case,
# you can set usetex to False.
if "setup_text_plots" not in globals():
    from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

#------------------------------------------------------------
# Fetch data and prepare it for the computation
data = fetch_sdss_specgals()

In [None]:
# put magnitudes in a matrix
mag = np.vstack([data['modelMag_%s' % f] for f in 'ugriz']).T
z = data['z']

# train on ~60,000 points
mag_train = mag[::10]
z_train = z[::10]

# test on ~6,000 separate points
mag_test = mag[1::100]
z_test = z[1::100]

In [None]:
#------------------------------------------------------------
# Compute the cross-validation scores for several tree depths
depth = np.arange(1, 21)
rms_test = np.zeros(len(depth))
rms_train = np.zeros(len(depth))
i_best = 0
z_fit_best = None

for i, d in enumerate(depth):
    clf = DecisionTreeRegressor(max_depth=d, random_state=0)
    clf.fit(mag_train, z_train)

    z_fit_train = clf.predict(mag_train)
    z_fit = clf.predict(mag_test)
    rms_train[i] = np.mean(np.sqrt((z_fit_train - z_train) ** 2))
    rms_test[i] = np.mean(np.sqrt((z_fit - z_test) ** 2))

    if rms_test[i] <= rms_test[i_best]:
        i_best = i
        z_fit_best = z_fit

best_depth = depth[i_best]

In [None]:
#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(12, 6))
fig.subplots_adjust(wspace=0.25,
                    left=0.1, right=0.95,
                    bottom=0.15, top=0.9)

# first panel: cross-validation
ax = fig.add_subplot(121)
ax.plot(depth, rms_test, '-k', label='cross-validation')
ax.plot(depth, rms_train, '--k', label='training set')
ax.set_xlabel('depth of tree', size=20)
ax.set_ylabel('rms error', size=20)
ax.yaxis.set_major_locator(plt.MultipleLocator(0.01))
ax.set_xlim(0, 21)
ax.set_ylim(0.009,  0.04)
ax.legend(loc=1, fontsize='large')

# second panel: best-fit results
ax = fig.add_subplot(122)
edges = np.linspace(z_test.min(), z_test.max(), 101)
H, zs_bins, zp_bins = np.histogram2d(z_test, z_fit_best, bins=edges)
ax.imshow(H.T, origin='lower', interpolation='nearest', aspect='auto', 
           extent=[zs_bins[0], zs_bins[-1], zs_bins[0], zs_bins[-1]],
           cmap=plt.cm.binary)
ax.plot([-0.1, 0.4], [-0.1, 0.4], ':k')
ax.text(0.04, 0.96, "depth = %i\nrms = %.3f" % (best_depth, rms_test[i_best]),
        ha='left', va='top', transform=ax.transAxes, size=12)
ax.set_xlabel(r'$z_{\rm true}$', size=20)
ax.set_ylabel(r'$z_{\rm fit}$', size=20)

ax.set_xlim(-0.02, 0.4001)
ax.set_ylim(-0.02, 0.4001)
ax.xaxis.set_major_locator(plt.MultipleLocator(0.1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.1))

plt.show()

Some preliminaries to get you started using the CEERS dataset:

In [4]:
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt

In [9]:
photometry_file = "ceers_all_v0.51_eazy.hdf5" # Note, you'll need to edit this path to wherever you store the data.
photometry = Table.read(photometry_file)
photometry

ModuleNotFoundError: h5py is required to read and write HDF5 files

In [None]:
# Columns of interest in the photometry file:
log_mass = photometry['fast_lmass']
# To calculate magnitudes, you need the zeropoint (31.4), since the catalog lists fluxes in units of nano-Jansky:
f356w_mag = -2.5*np.log10(photometry['FLUX_356'])+31.4
# It's the same zeropoint for all JWST filters: F115W, F150W, F200W, F277W, F356W, F410M, F444W.  There is also HST photometry data in this table.  Come talk to me to get help with using these.

# photometric redshfits (two different sets done by two different CEERS team members):
z_phot = photometry['z_phot']
# or, alternatively you could use a different set of phot-z's:
z_phot_finkelstein = photometry['ZA_finkelstein']

#Spectroscopic redshifts (values of -99 or -1 should be excluded):
z_spec = photometry['z_spec']


NameError: name 'photometry' is not defined