### Part 3: Creating and testing photometric redshift estimator

Goals: create a photometric redshift estimator using the scikit-learn tool-kit and test it out

Specifics: I have provided you with some prepared photometric reference data, which includes cross-matched objects with known redshifts.   You will want to:

1. prepare the data for consumption by the machine learning algorithms, 
2. use part of this data to train a regression model, 
3. apply that regression model to remainder of the data
4. investigate how well the regression model performed


If you want to see what things should look like, you can have a look:

1. in the notebook [05_ExploreRedshift.ipynb](https://github.com/KIPAC/MACSS/blob/main/nb/05_ExploreRedshift.ipynb) to see an exploration of the features in the data that can be used to extract redshift information.

2. in the notebook [06_SklearnRegression.ipynb](https://github.com/KIPAC/MACSS/blob/main/nb/06_SklearnRegression.ipynb) to see examples of running several different types of estimation algorithms.



#### Standard imports

In [None]:
import os
import tables_io
import numpy as np
import matplotlib.pyplot as plt

#### Change this to match the correct location

In [None]:
HOME = os.environ['HOME']
pz_dir = f'{HOME}/macss'

#### here we are going to open two files, "test" and "train" datasets

The idea here is that we took prepared data and split it into two parts, one for training the regression method and one for testing it

In [None]:
train = tables_io.read(f"{pz_dir}/data/dp1_matched_v4_train.hdf5")
test = tables_io.read(f"{pz_dir}/data/dp1_matched_v4_test.hdf5")

Here are a few empty cells to explore the test and train data

#### Now you want to write a function to extract "features" and "targets" from the input data

"features" are the features that the estimator will try to use to estimate the redshifts,
"targets" are the values that the estimator should predict, i.e., the redshifts.

"features" should be a 2D array
"targets" should be a 1D array, of the same size

In [None]:
def extraFeaturesAndTargets(inputData):
    features = np.nan
    targets = np.nan
    return features, targets

In [None]:
train_targets, train_features = extraFeaturesAndTargets(train)
test_targets, test_features = extraFeaturesAndTargets(test)

#### here we import all the different Estimation methods. 

A "Regressor" is a python object that implements a machine learning method to build a regression model that tries to predict a value from features the data.

In [None]:
from sklearn.ensemble import (HistGradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor)
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import HuberRegressor, LinearRegression, QuantileRegressor
from sklearn.svm import NuSVR
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor

In [None]:
knr = KNeighborsRegressor()
#hbr = HistGradientBoostingRegressor()
#etr = ExtraTreesRegressor()
#abr = AdaBoostRegressor()
#gpr = GaussianProcessRegressor()
#isr = IsotonicRegression()
#hur = HuberRegressor()
#lir = LinearRegression()
#qur = QuantileRegressor()
#nsr = NuSVR()
#rnr = RadiusNeighborsRegressor()

In [None]:
def run_regression(reg, train_features, train_targets, test_features):
    reg.fit(train_features, train_targets)
    return reg.predict(test_features)

In [None]:
preds = run_regression(knr, train_features, train_targets, test_features)

#### Looking at how it did

Here is a simple scatter plot comparing the reference redshifts in the test data to the estimates.

In [None]:
grid = np.linspace(0, 1.5, 151)
_ = plt.hist2d(test_targets, preds, bins=[grid, grid], norm='log', cmap='gray')
_ = plt.xlabel(r'$z_{\rm ref}$')
_ = plt.ylabel(r'$z_{\rm est}$')
_ = plt.colorbar()

#### Compute and add some quantitative measures of the performance to the figure.

At this point you can make some quantitiative measures of the perfomance, for example how accurately the estimates match the refernces redshifts.  Come up with a few measures of the performance and add them to a better version of the figure.