In this simple notebook we use a fully connected neural network to solve a previously seen problem in regression: the photometric redshift problem.

It accompanies Chapter 8 of the book.

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman.


In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import shuffle

In [None]:
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 150)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams['figure.dpi'] = 300

Tensorflow is a very commonly used library used in development of Deep Learning models. It is an open-source platform that was developed by Google. It supports programming in several languages, e.g. C++, Java, Python, and many others.

Keras is a high-level API (Application Programming Interface) that is built on top of TensorFlow (or Theano, another Deep Learning library). It is Python-specific, and we can think of it as the equivalent of the sklearn library for neural network. It is less general, and less customizable, but it is very user-friendly and comparatively easier than TensorFlow. We will use keras with the tensorflow back-end.

In [None]:
import tensorflow as tf

In [None]:
tf.__version__

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
import keras

from keras.models import Sequential #the model is built adding layers one after the other

from keras.layers import Dense #fully connected layers: every output talks to every input

from keras.layers import Dropout #for regularization

In [None]:
# from google.colab import drive
# drive.mount('/gdrive')
# %cd /gdrive

### Problem 2: photometric redshifts

I will start out from the reduced (high-quality) data set we used for Bagging and Boosting methods. For reference, our best model achieved a NMAD around 0.02 and an outlier fraction of 4%.

In [None]:
X = pd.read_csv('../data/sel_features.csv', sep = '\t')
y = pd.read_csv('../data/sel_target.csv')

In [None]:
X,y = shuffle(X,y, random_state = 12)

In [None]:
fifth = int(len(y)/5)

In [None]:
X_train = X.values[:3*fifth,:]
y_train = y[:3*fifth]

X_val = X.values[3*fifth:4*fifth,:]
y_val = y[3*fifth:4*fifth]

X_test = X.values[4*fifth:,:]
y_test = y[4*fifth:]

We know that we need to scale!

In [None]:
scaler = StandardScaler()

scaler.fit(X_train)

In [None]:
Xst_train = scaler.transform(X_train)
Xst_val = scaler.transform(X_val)
Xst_test = scaler.transform(X_test)

In a regression problem, we will choose a different activation for the output layer (e.g. linear), and a different loss function (MSE, MAE, ...).

Our input layer has six neurons for this problem.

For other parameters and the network structure, we can start with two layers with 100 neurons and go from there.

In [None]:
dir(keras.activations)

In [None]:
dir(keras.losses)

In [None]:
model = Sequential()

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# Add an input layer and specify its size (number of original features)

model.add(Dense(100, activation='relu', input_shape=(6,)))

#model.add(Dropout(0.2))

# Add one hidden layer and specify its size

model.add(Dense(100, activation='relu'))

#model.add(Dropout(0.2))

# Add one hidden layer and specify its size

#model.add(Dense(30, activation='relu'))

# Add one hidden layer and specify its size

#model.add(Dense(12, activation='relu'))

#model.add(Dropout(0.2))

# Add an output layer 

model.add(Dense(1, activation='linear'))

model.compile(loss='mse', optimizer=optimizer)


We begin with 100 epochs and batch size = 300.

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=300)

In [None]:
results = model.evaluate(Xst_test, y_test)
print('MSE:', results) #we are only monitoring MSE

As usual, we can plot the loss.

In [None]:
plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)
plt.legend(fontsize = 12);
#plt.savefig('Photoz_NN.png')

In [None]:
plt.figure(figsize=(5,5))
    
plt.xlabel('True redshift', fontsize = 14)
plt.ylabel('Estimated redshift', fontsize = 14)

plt.scatter(y_test, model.predict(Xst_test), s =10, c = 'teal');

plt.xlim(0,2)
plt.ylim(0,2)
plt.tight_layout()
#plt.savefig('Photoz_NN_scatter.png')

In [None]:
ypred = model.predict(Xst_test)

### Learning Check-in
    
Calculate the Outlier Fraction and the Normalized Median Absolute Deviation for this set of predictions.

<br>

<details>
<summary style="display: line-item;">Click here for the answer!</summary>
<p>
    
```python
print(len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test))

print(1.48*np.median(np.abs(y_test-ypred)/(1 + y_test)))
```

</p>
</details>

To further improve, we can play with/optimize the parameters; one thing that is very interesting IMO is to see the effect of using different losses on the residuals, and trying to add more layers.

### Let's try some optimization with keras tuner

In [None]:
# !pip3 install -U keras-tuner

In [None]:
from kerastuner.tuners import RandomSearch
from tensorflow.keras import layers

#Some material below is adapted from the Keras Tuner documentation

# https://keras-team.github.io/keras-tuner/

This function specifies which parameters we want to tune. Tunable parameters can be of type "Choice" (we specify a set), Int, Boolean, or Float.

In [None]:
def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 2, 6)):
        model.add(layers.Dense(units=hp.Int('units_' + str(i),
                                            min_value=100,
                                            max_value=500,
                                            step=100),
                               activation='relu'))
    model.add(Dense(1, activation='linear')) #last one
    model.compile(
        optimizer=tf.keras.optimizers.Adam(
            hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='mse')
    return model

Next, we specify how we want to explore the parameter space. The Random Search is the simplest choice, but often quite effective; alternatives are Hyperband (optimized Random Search where a larger fraction of models is trained for a smaller number of epochs, but only the most promising ones survive), or Bayesian Optimization, which attempts to build a probabilistic interpretation of the model scores (the posterior probability of obtaining score x, given the values of hyperparameters).

In [None]:
tf.keras.backend.clear_session()

tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=40, #number of combinations to try
    executions_per_trial=3,
    project_name='MyDrive/Photoz') #may need to delete or reset

We can visualize the search space below:

In [None]:
tuner.search_space_summary()

Finally, it's time to put our tuner to work. (This is a big job!)

In [None]:
tuner.search(Xst_train, y_train, #same signature as model.fit
             epochs=100, validation_data=(Xst_val, y_val), batch_size=300, verbose = 0) 

#Note: setting verbosity to 0 would give no output until done - it took about ~30 mins on my laptop

The "results\_summary(n)" function gives us access to the n best models. It's useful to look at a few because often the differences are minimal, and a smaller model might be preferable! Note that the "number of units" parameter would have a value assigned to it for each layers (even if the number of layers is smaller in that particular realization).

In [None]:
tuner.results_summary(5)

The losses of the first few models are very similar, suggesting that 1. as usual, we need to do some form of cross-validation to be able to come up with a ranking, and 2. With 3-5 layers and a few hundred neurons per layer, the exact configuration doesn't matter too much.

In [None]:
best_hps=tuner.get_best_hyperparameters()[0] #choose first model

In [None]:
best_hps.get('learning_rate')

In [None]:
best_hps.get('num_layers')

In [None]:
#Size of layers

print(best_hps.get('units_0'))
print(best_hps.get('units_1'))
print(best_hps.get('units_2'))

In [None]:
model = tuner.hypermodel.build(best_hps) #get best model

In [None]:
model.build(input_shape=(None,6)) #build best model (if not fit yet, this will give access to summary)

In [None]:
model.summary() #Note that this differs from what was shown in the tuner search summary! 

In [None]:
bestnet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=300)

In [None]:
plt.plot(bestnet.history['loss'], label = 'train')
plt.plot(bestnet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.ylim(0,0.1)
plt.legend(loc='upper right', fontsize = 12)
plt.legend(fontsize = 12);
#plt.savefig('OptimalNN_Photoz.png',dpi=300)

In [None]:
model.evaluate(Xst_test, y_test)

In [None]:
ypred = model.predict(Xst_test)

#Calculate OLF

print('OLF', len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test))

#Calculate Normalized Median Absolute Deviation (NMAD)

print('NMAD', 1.48*np.median(np.abs(y_test-ypred)/(1 + y_test)))

These numbers have improved, compared to the baseline version - whether or not the improvement is significant should be determined via cross validation.

In [None]:
plt.figure(figsize=(5,5))
    
plt.xlabel('True redshift', fontsize = 14)
plt.ylabel('Estimated redshift', fontsize = 14)

plt.scatter(y_test, model.predict(Xst_test), s =10, c = 'teal');

plt.xlim(0,2)
plt.ylim(0,2)
plt.tight_layout()
#plt.savefig('OptimalNN_scatter.png')

Given the gap between train and validation scores above, it might be tempting to add some regularization (this should however be included in the tuner!)

In [None]:
model = Sequential()

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

model.add(Dense(400, activation='relu', input_shape=(6,)))

model.add(Dropout(0.3))

model.add(Dense(100, activation='relu'))

model.add(Dense(500, activation='relu'))

model.add(Dropout(0.3))

model.add(Dense(1, activation='linear'))

model.compile(loss='mse', optimizer=optimizer)

In [None]:
bestregnet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=300)

In [None]:
model.evaluate(Xst_test, y_test)

ypred = model.predict(Xst_test)

#Calculate OLF

print('OLF', len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test))

#Calculate Normalized Median Absolute Deviation (NMAD)

print('NMAD', 1.48*np.median(np.abs(y_test-ypred)/(1 + y_test)))

In [None]:
plt.plot(bestregnet.history['loss'], label = 'train')
plt.plot(bestregnet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.ylim(0,0.1)
plt.legend(loc='upper right', fontsize = 12)
plt.legend(fontsize = 12);

Note the overall effect is minimal.

## Effect of different loss functions

In [None]:
dir(keras.losses)

In [None]:
X,y = shuffle(X,y, random_state = 10)

X_train = X.values[:3*fifth,:]
y_train = y[:3*fifth]

X_val = X.values[3*fifth:4*fifth,:]
y_val = y[3*fifth:4*fifth]

X_test = X.values[4*fifth:,:]
y_test = y[4*fifth:]

scaler.fit(X_train) #Important: we use only training data to scale

Xst_train = scaler.transform(X_train)
Xst_val = scaler.transform(X_val)
Xst_test = scaler.transform(X_test)

### Effect of num trials; is the difference between OLF/NMAD significant?

In [None]:
#Architecture stays the same

tf.keras.backend.clear_session()

model = keras.Sequential()

model.add(layers.Dense(units=500,
                               activation='relu'))
model.add(layers.Dense(units=100,
                               activation='relu'))
model.add(layers.Dense(units=400,
                               activation='relu'))
model.add(Dense(1, activation='linear')) #last one

#We use three different loss functions and repeat the training 4x

for loss in ['mse','mae', 'mape']:

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate = 0.001),
        loss=loss)

    OLF = np.zeros(4)
    NMAD = np.zeros(4)

    for i in range(0,3): #let's do this 4 times and change only random weights initialization
    
        model.fit(Xst_train, y_train,
             epochs=100,
             validation_data=(Xst_val, y_val), batch_size=300, verbose = 0)

        ypred = model.predict(Xst_test)

        #Calculate OLF

        OLF[i] = len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test)

        #Calculate Normalized Median Absolute Deviation (NMAD)
        
        NMAD[i] = 1.48*np.median(np.abs(y_test-ypred)/(1 + y_test))

    print('OLF mean/std using loss', loss, 'is:', "{:.3f}".format(OLF.mean()), "{:.3f}".format(OLF.std()))
    print('NMAD mean/std using loss', loss, 'is:', "{:.2f}".format(NMAD.mean()), "{:.3f}".format(NMAD.std()))

### Learning Check-in
    
Which loss functions are most suited to miniziming the OLF and NMAD?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
If we want to minimize OLF/NMAD, our preferred choice(s) should be the MAE or MSE losses. In alternative, we can define a custom loss. 
```

</p>
</details>

### Single model evaluation



In [None]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, cross_validate, cross_val_predict
from sklearn.pipeline import Pipeline, make_pipeline

In [None]:
cv = KFold(n_splits = 5, shuffle = True)

In [None]:
def create_single_model():
    
    model = keras.Sequential()

    model.add(layers.Dense(300, activation = 'relu'))
    model.add(layers.Dense(300, activation = 'relu'))
    model.add(layers.Dense(300, activation = 'relu'))
    model.add(Dense(1, activation = 'linear')) 

    model.compile(optimizer=tf.keras.optimizers.Adam(),loss='mse')
    
    return model

estimator = KerasRegressor(build_fn = create_single_model, epochs = 100, batch_size = 200, verbose=0)

In [None]:
pipeline = Pipeline([('scale', StandardScaler()), ('model', estimator)])

In [None]:
scores = cross_validate(pipeline, X, y, cv = cv, scoring = 'neg_mean_squared_error', return_train_score = True)

In [None]:
scores

In [None]:
scores['test_score'].mean(), scores['test_score'].std()

### Model optimization and scoring with cross validation/nested cross validation.

First, we define a new function to build a keras model, and make sure we can vary the arguments we are interested in optimizing. In this case, we are keeping the number of hidden layers at 3, and varying their sizes, as well as the learning rate. Some other parameters will be added directly as part of the parameter grid.

In [None]:
def create_model(lr = 0.01, size_1 = 500, size_2 = 100, size_3 = 400):

    model = keras.Sequential()

    model.add(layers.Dense(units = size_1,
                               activation='relu'))
    model.add(layers.Dense(units = size_2,
                               activation='relu'))
    model.add(layers.Dense(units = size_3,
                               activation='relu'))
    model.add(Dense(1, activation='linear')) #last one

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate = lr),
        loss='mse')
    
    return model

Next, we define our hyperparameter grid, which will be the input for our Random Search.

In [None]:
# Random search parameters

batch_size = [200, 300, 400]

lrs = [0.0001, 0.001, 0.01]

epochs = [50, 100, 200]

size_1 = [100, 300, 500]

size_2 = [100, 300, 500]

size_3 = [100, 300, 500]

Then, we run a cross-validated search for the best parameters. We choose 40 model evaluations. Note that running this search takes a while!

In [None]:
kmodel = KerasRegressor(build_fn = create_model, verbose=0)

pipeline = Pipeline([('scale', StandardScaler()), ('est', kmodel)])

pipeline.get_params().keys()

In [None]:
tf.keras.backend.set_floatx('float64') #this is here because of a warning

#Define cv strategy

cv = KFold(n_splits = 4, shuffle = True)

param_grid = dict(est__size_1 = size_1, est__size_2 = size_2, est__size_3 = size_3,
                  est__batch_size = batch_size, est__epochs = epochs, est__lr = lrs)
                  
grid = RandomizedSearchCV(estimator = pipeline, param_distributions = param_grid, n_iter = 40, n_jobs=-1, \
                          cv=cv, return_train_score = True)

results = grid.fit(X, y)

We can take a look at the distribution of validation scores (it says test here, but we should really think of them as  validation) by looking at the first lines of the "results" object, sorted by validation score.

In [None]:
scores = pd.DataFrame(results.cv_results_)
scoresCV = scores[['params','mean_test_score','std_test_score']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)
scoresCV.head(10)

This procedure settles the question of which model(s) perform(s) best, but it still doesn't provide a proper estimate of the generalization error, which should be computed on data that have never participated in the yperparameter optimization nor training process. Assessing the test scores (and their uncertainty due to the stochastic nature of sample selection and the non-deterministic aspects of the neural network) requires a three-tiered structure, with two nested CV processes: the outer CV "peels out" the test folds, the inner CV does the validation/parameter optimization.  

### Nested cross validation in action

In [None]:
Xa = X.values #turn them into numpy array
ya = y.values.ravel() #turn them into numpy array

In [None]:
Xa.shape

In [None]:
#Outer and inner k-fold:
    
outercv = KFold(n_splits=4, shuffle=True) #creates 4 disjoint splits

innercv = KFold(n_splits=3, shuffle=True) #creates 3 disjoint splits

i = 0

winning_model_test_scores = []

OLF = []

NMAD = []

for train_index, test_index in outercv.split(Xa,ya): #This runs the outer cross validation
    
    i+=1
    
    print('Fold ' ,i, 'outer cross validation')
    
    X_train = Xa[train_index] #learning set, will be used for training + validation
    y_train = ya[train_index] 
    
    X_test = Xa[test_index] #test set, won't know anything about training or validation
    y_test = ya[test_index]
    
    #Let's scale here (and not again within the CV; this is slightly not rigorous but ok for practical purposes)
    
    scaler.fit(X_train)
    
    Xst_train = scaler.transform(X_train)
    Xst_test = scaler.transform(X_test)
    
    #defining parameter grid and model
    
    model = KerasRegressor(build_fn = create_model, verbose=0)

    # Random search parameters

    batch_size = [200, 300, 400]

    lrs = [0.0001, 0.001, 0.01]

    epochs = [50, 100, 200]

    size_1 = [100, 300, 500]

    size_2 = [100, 300, 500]

    size_3 = [100, 300, 500]

    param_grid = dict(size_1 = size_1, size_2 = size_2, size_3 = size_3, lr = lrs, \
                  batch_size = batch_size, epochs = epochs)

    grid = RandomizedSearchCV(estimator = model, param_distributions = param_grid, n_iter = 40, n_jobs=-1, \
                          cv=innercv, return_train_score = True)

    results = grid.fit(Xst_train, y_train) #if you want to explore the validation search results, you should save this object

    #Get best estimator; compute test scores with optimal parameters on outer i-th test fold
    
    winner = results.best_estimator_
    
    print('The winning model has parameters', results.best_params_) #This is just to compare the best model in different folds
    
    winner.fit(Xst_train, y_train) #we can use the entire inner learning set to train the winning model
    
    ypred = winner.predict(Xst_test) #X_test is totally new to the training/optimization process
    
    #calculate OLF and NMAD!
    
    OLF.append(len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test))

    #Calculate Normalized Median Absolute Deviation (NMAD)
        
    NMAD.append(1.48*np.median(np.abs(y_test-ypred)/(1 + y_test)))
    
    #Finally, save test scores
    
    winning_model_test_scores.append(metrics.mean_squared_error(y_test,ypred)) #append this to the outer cv results
    


In [None]:
print('The average MSE of the winning model (i.e. the generalization error) is', \
      "{:.3f}".format(np.mean(winning_model_test_scores)), 'with a std of', "{:.3f}".format(np.std(winning_model_test_scores)))

print('The average OLF of the winning model is', \
      "{:.3f}".format(np.mean(OLF)), 'with a std of', "{:.3f}".format(np.std(OLF)))

print('The average NMAD of the winning model is', \
      "{:.3f}".format(np.mean(NMAD)), 'with a std of', "{:.3f}".format(np.std(NMAD)))


### Notes:

There are lots of random processes in training NNs -> even with one fold, because the weight initialization is random (num_trials = 3 is minimum recommended).

k fold cross validation (or better, nested cross validation) should be used to find optimal model and estimate test scores.

Things get expensive really fast! Random Search helps with this.

Recognizing secondary parameters also helps. 

My recommendation is that often multiple configurations give similar results, and having single-point estimates can trick us into thinking that small differences matter more than they do. So definitely invest resources in figuring out whether a more expensive network (wider, deeper) really matters for your task. 
