3rd ASTERICS-OBELICS International School - Annecy, France - 8-12 April 2019

### Machine Learning Tutorial

# Section 1.c - Supervised Learning: regression
by [Emille Ishida](https://www.emilleishida.com/)

### *Take home message 3: choosing a machine learning algorithm is an art!*

**Goal:** Get acquainted with basic machine learning algorithms for regression

**Task**: Estimate the redshift based on photometric magnitudes  

**Data**: Extract from the [Teddy photometric redshift catalog](https://github.com/COINtoolbox/photoz_catalogues)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;First presented by [Beck et al., 2017, MNRAS, 468 (4323)](https://cosmostatistics-initiative.org/portfolio-item/representativeness-photoz/)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5000 objects for training (teddy_A)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5000 objects for testing (teddy_B)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Features:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$mag\_r$: standardized r-band magnitude  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$u-g$: standardized SDSS u-g color  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$g-r$: standardized SDSS g-r color  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$r-i$: standardized SDSS r-i color  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$i-z$: standardized SDSS i-z color  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$z\_spec$: spectroscopic redshift (label)  

In [None]:
# import some basic libaries 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split

### Step 1: Digest the data

As always, we start by loading and visualizing the data 

In [None]:
# read the data
data = pd.read_csv('../data/teddy_A.csv')

# check available columns (features)
data.keys()

In [None]:
# check dimensionality of the data
data.shape

We see from the documentation that the test data is given in a separate file.  
As a consequence, we only need to split the training data intro train and validation.

In [None]:
# separate 80% for training and 20% for testing

# check your samples (size, features, etc.)


In [None]:
# plot the data
g = sns.PairGrid(data, diag_sharey=False)
g.map_lower(sns.kdeplot)                      
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
plt.show()

### Step 2: train a few classifiers

Using [scikit-learn](https://scikit-learn.org/stable/) we are able to quickly train a set of algorithms: 


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.a) [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html):

In [None]:
from sklearn import linear_model

# Create linear regression object
regr = linear_model.LinearRegression()

# train
regr.fit(X_train, y_train)

# estimate the photoz
photoz_linear_val = regr.predict(X_validation)

# quality of the fit
R2_linear_val = regr.score(X_validation, y_validation)
R2_linear_val

There is not much more to optimize in this simple model, so we can use the trained algorithm to estimate the redshift in the test sample:

In [None]:
# read  test sample
data_test = pd.read_csv('../data/teddy_B.csv')

# check the features
data_test.keys()

In [None]:
# estimate the photoz
photoz_linear_test = regr.predict(data_test[['mag_r', 'u-g', 'g-r', 'r-i', 'i-z']])

# quality of the fit
R2_linear_test = regr.score(data_test[['mag_r', 'u-g', 'g-r', 'r-i', 'i-z']], data_test[['z_spec']])
R2_linear_test

In [None]:
# plot result
sns.set_style('ticks')
fig = plt.figure()
plt.title('Teddy catalog: A-> B, Linear reg. score: ' + str(round(R2_linear_test,2)))
plt.scatter(data_test[['z_spec']], photoz_linear_test, marker='x')
plt.plot([0,0.65], [0,0.65], color='red', lw=2, ls='--')
plt.xlabel('true redshift')
plt.ylabel('estimated redshift')
plt.show()

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.b) [Nearest Neighbor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Here we have a little more room for improvement.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Try changing the number of neighbors, or other parameters (check documentation), to improve the quality of the fit. 

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# choose number of neighbors
nn = 9

# initiate a KNN instance

# fit the model using training data

# estimate photometric redshift for the validation data

# quality of the fit


Once you are happy with the optimization, estimate the photometric redshift values for the test sample:

In [None]:
# estimate the photoz

# quality of the fit


In [None]:
# plot result


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.c) [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html):

Here we have still more freedom. To begin with, try playing with the number of trees in your forest and the maximum depth allowed for each tree.  
**How do your regression results change?**

In [None]:
from sklearn.ensemble import RandomForestRegressor

# choose number of trees in the forest
n_trees =

# define maximum depth, None=> split continues until the leafs are pure
depth = 

# initiate a Random Forest instance

# train the model


# estimate the photometric redshift for the validation sample

# quality of the fit


If you are satisfied, see how your regression perform in the test sample:

In [None]:
# estimate the photoz

# quality of the fit


In [None]:
# plot result


#### ... few free to try other algorithms if you wish to

### Step 3: Compare results

Let's take a look at the results we have so far:

In [None]:

print('                         Test sample   Validation sample')
print('Linear regression: ', R2_linear_test, R2_linear_val)
print('kNN:               ', R2_knn_test, R2_knn_val)
print('Random Forest:     ', R2_randforest_test, R2_randforest_val)

These results seem pretty stable, which give us still another ensurance that the results from the machine learning algorithms are consistent.

**Can you guess what characteristics of the data helps this stability?**

Answer: 

### Food for thought:

In the `data` folder of the github repository there are other 2 files: `teddy_C` and `teddy_D`.   
This files should be used only for testing.  

Try applying your trained regression models in these data sets and compare the results with the ones above. 

As always, remember to weight your expectations before you start.

**Are the results any different? If so, can you guess why?**