## Hyperparameter Tuning: Second phase

Run **SHERPA**. Fix batchsize = 1024. Fix Adam. Do not shuffle the input data as that takes a lot of time. <br>
*First phase:* Start with 3 epochs each. Here we can already discard some models. <br>
*Second phase:* Run 3 epochs with a parameter space confined to the four best models from phase 1. Add a learning rate scheduler a la Stephan Rasp (Divide learning rate by 20 every two epochs). <br>
*Third phase:* Run 6 epochs with the two best models from phase 2. With Sherpa, vary only the learning rate scheduler. Use cross-validation here to truly get a good estimate of generalization error!. <br>

To vary: 
- Learning rate (Learning rate scheduler)
- Model layers (only max 1-4 hidden layers)
- Regularization methods
- Hidden Units
- Activation Functions (not the last)

**Best results from phase 1:** <br>
Activation_1: lrelu or leaky_relu <br>
Activation_2: lrelu or elu <br>
Activation_3: relu or leaky_relu <br>
Dropout: 0.15 - 0.3 <br>
Epsilon: 0 or 0.1 <br>
l1_reg: 0.0001 to 0.007 <br>
l2_reg: 0.001 to 0.007 <br>
lr_init: 0.001-0.009 <br>
model_depth: 3 or 4 <br>
hidden units: 32 to 256

In [2]:
# Best results from Phase No. 1:

# Trial-ID                                         1
# Status                                   COMPLETED
# Iteration                                        2
# activation_1    <function lrelu at 0x2abbb526a830>
# activation_2    <function lrelu at 0x2abbb526a830>
# activation_3                                   NaN
# activation_4                                   NaN
# activation_5                                  relu
# activation_6                                  relu
# activation_7                                  relu
# dropout                                   0.184124
# epsilon                                        0.1
# l1_reg                                    0.000162
# l2_reg                                    0.007437
# lrinit                                    0.008726
# model_depth                                      3
# num_units                                      256
# Objective                                42.318375
# Name: 0, dtype: object

# Trial-ID                                              2
# Status                                        COMPLETED
# Iteration                                             2
# activation_1    <function leaky_relu at 0x2b915efe3ef0>
# activation_2         <function lrelu at 0x2b9167db78c0>
# activation_3                                       relu
# activation_4                                        NaN
# activation_5                                        NaN
# activation_6                                        NaN
# activation_7                                        NaN
# dropout                                        0.151082
# epsilon                                             0.0
# l1_reg                                         0.002857
# l2_reg                                         0.006965
# lrinit                                         0.002045
# model_depth                                           4
# num_units                                            32
# Objective                                     44.167797
# Name: 1, dtype: object

# Trial-ID                                              2
# Status                                        COMPLETED
# Iteration                                             2
# activation_1         <function lrelu at 0x2ba500c62830>
# activation_2                                        elu
# activation_3    <function leaky_relu at 0x2ba4f7e8eef0>
# activation_4                                        NaN
# activation_5                                        NaN
# activation_6                                        NaN
# activation_7                                        NaN
# dropout                                        0.231274
# epsilon                                             0.0
# l1_reg                                         0.007085
# l2_reg                                          0.00134
# lrinit                                         0.005609
# model_depth                                           4
# num_units                                            64
# Objective                                     44.761337
# Name: 1, dtype: object

# Trial-ID                                         1
# Status                                   COMPLETED
# Iteration                                        2
# activation_1    <function lrelu at 0x2add07c84830>
# activation_2    <function lrelu at 0x2add07c84830>
# activation_3                                   NaN
# activation_4                                   NaN
# activation_5                                  relu
# activation_6                                  relu
# activation_7                                  relu
# dropout                                        0.3
# epsilon                                        0.1
# l1_reg                                       0.004
# l2_reg                                       0.004
# lrinit                                      0.0012
# model_depth                                      3
# num_units                                      128
# Objective                                49.790131

In [4]:
#Starting only with a few epochs
epochs = 3

In [5]:
# Ran with 800GB (750GB should also be fine)

import sys
import numpy as np
import time
import pandas as pd
import matplotlib.pyplot as plt
import os
import copy
import gc

#Import sklearn before tensorflow (static Thread-local storage)
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l1_l2

from tensorflow.keras import backend as K
from tensorflow.keras.layers import Activation

t0 = time.time()
path = '/pf/b/b309170'
path_figures = path + '/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/figures'
path_model = path + '/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/saved_models'
path_data = path + '/my_work/icon-ml_data/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/based_on_var_interpolated_data'

# Add path with my_classes to sys.path
sys.path.insert(0, path + '/workspace_icon-ml/cloud_cover_parameterization/')
# Add sherpa
sys.path.insert(0, path + '/my_work/sherpa')

#import sherpa
#import sherpa.algorithms.bayesian_optimization as bayesian_optimization

# Reloading custom file to incorporate changes dynamically
import importlib
import my_classes
importlib.reload(my_classes)

from my_classes import read_mean_and_std
from my_classes import TimeOut

import datetime

# Set seed for reproducibility
seed = 10
tf.random.set_seed(seed)

gpus = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_visible_devices(gpus[3], 'GPU')

In [6]:
# Won't run on a CPU node
try:
    # Prevents crashes of the code
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.set_visible_devices(physical_devices[0], 'GPU')
    # Allow the growth of memory Tensorflow allocates (limits memory usage overall)
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
except:
    pass

In [7]:
scaler = StandardScaler()

### Load the data

In [8]:
# input_narval = np.load(path_data + '/cloud_cover_input_narval.npy')
# input_qubicc = np.load(path_data + '/cloud_cover_input_qubicc.npy')
# output_narval = np.load(path_data + '/cloud_cover_output_narval.npy')
# output_qubicc = np.load(path_data + '/cloud_cover_output_qubicc.npy')

In [9]:
input_data = np.concatenate((np.load(path_data + '/cloud_cover_input_narval.npy'), 
                             np.load(path_data + '/cloud_cover_input_qubicc.npy')), axis=0)
output_data = np.concatenate((np.load(path_data + '/cloud_cover_output_narval.npy'), 
                              np.load(path_data + '/cloud_cover_output_qubicc.npy')), axis=0)

In [10]:
samples_narval = np.load(path_data + '/cloud_cover_output_narval.npy').shape[0]

In [11]:
(samples_total, no_of_features) = input_data.shape
(samples_total, no_of_features)

(1008913906, 10)

*Temporal cross-validation*

Split into 2-weeks increments (when working with 3 months of data). It's 25 day increments with 5 months of data. <br>
1.: Validate on increments 1 and 4 <br>
2.: Validate on increments 2 and 5 <br>
3.: Validate on increments 3 and 6

--> 2/3 training data, 1/3 validation data

In [13]:
training_folds = []
validation_folds = []
two_week_incr = samples_total//6

for i in range(3):
    # Note that this is a temporal split since time was the first dimension in the original tensor
    first_incr = np.arange(samples_total//6*i, samples_total//6*(i+1))
    second_incr = np.arange(samples_total//6*(i+3), samples_total//6*(i+4))

    validation_folds.append(np.append(first_incr, second_incr))
    training_folds.append(np.arange(samples_total))
    training_folds[i] = np.delete(training_folds[i], validation_folds[i])

### 3-fold cross-validation

In [225]:
#We loop through the folds
def run_cross_validation(i):
    
    filename = 'cross_validation_cell_based_fold_%d'%(i+1)
    
    #Standardize according to the fold
    scaler.fit(input_data[training_folds[i]])

    #Load the data for the respective fold and convert it to tf data
    input_train = scaler.transform(input_data[training_folds[i]])
    input_valid = scaler.transform(input_data[validation_folds[i]])
    output_train = output_data[training_folds[i]]
    output_valid = output_data[validation_folds[i]]
    
    # Column-based: batchsize of 128
    # Possibly better to use .apply(tf.data.experimental.copy_to_device("/gpu:0")) before prefetch
    # I'm not shuffling for hyperparameter tuning
    train_ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices(input_train), 
                                tf.data.Dataset.from_tensor_slices(output_train))) \
                .shuffle(10**5, seed=seed) \
                .batch(batch_size=1024, drop_remainder=True) \
                .prefetch(1)
    
    # No need to add prefetch.
    # tf data with batch_size=10**5 makes the validation evaluation 10 times faster
    valid_ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices(input_valid), 
                                tf.data.Dataset.from_tensor_slices(output_valid))) \
                .batch(batch_size=10**5, drop_remainder=False)
    
    return train_ds, valid_ds

In [226]:
#Should be a pretty unique number
random_num = np.random.randint(500000)
print(random_num)

def save_model(study, today, optimizer):
    out_path = '/pf/b/b309170/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/sherpa_results/'+\
            today+'_'+optimizer+'_'+str(random_num)
    
    study.results = study.results[study.results['Status']=='COMPLETED'] #To specify results
    study.results.index = study.results['Trial-ID']  #Trial-ID serves as a better index
    # Remove those hyperparameters that actually do not appear in the model
    for i in range(1, max(study.results['Trial-ID']) + 1):
        depth = study.results.at[i, 'model_depth']
        for j in range(depth, 4): #Or up to 8
            study.results.at[i, 'activation_%d'%j] = None
#             study.results.at[i, 'bn_%d'%j] = None
    # Create the directory and save the SHERPA-output in it
    try:
        os.mkdir(out_path)
    except OSError:
        print('Creation of the directory %s failed' % out_path)
    else: 
        print('Successfully created the directory %s' % out_path)
    study.save(out_path)

494519


In [227]:
# For Leaky_ReLU:
from tensorflow import nn 

def lrelu(x):
    return nn.leaky_relu(x, alpha=0.01)

OPTIMIZER = 'adam'
parameters = [sherpa.Ordinal('num_units', [32, 64, 128, 256]), #No need to vary these per layer. Could add 512.
             sherpa.Discrete('model_depth', [3, 4]), #Originally [2,8] although 8 was never truly tested
             sherpa.Choice('activation_1', ['relu', nn.leaky_relu]), #Adding SeLU is trickier
             sherpa.Choice('activation_2', ['elu']), 
             sherpa.Choice('activation_3', ['relu']),
             sherpa.Continuous('lrinit', [0.001, 0.01], 'log'),
             sherpa.Ordinal('epsilon', [1e-8, 0.1]),
             sherpa.Continuous('dropout', [0.15, 0.3]),
             sherpa.Continuous('l1_reg', [0.0001, 0.007]),
             sherpa.Continuous('l2_reg', [0.001, 0.007])]

In [228]:
# max_num_trials is left unspecified, so the optimization will run until the end of the job-runtime

# good_hyperparams = pd.DataFrame({'num_units': [128], 'model_depth': [3], 'activation_1': [lrelu], 'activation_2':[lrelu],
#                    'activation_3':['relu'], 'activation_4':['relu'], 'activation_5':['relu'], 'activation_6':['relu'],
#                    'activation_7':['relu'], 'lrinit':[0.0012], 'epsilon':[0.1], 'dropout':[0.3], 
#                                  'l1_reg':[0.004], 'l2_reg':[0.004]})

# # I expect an objective of around 61.

# alg = bayesian_optimization.GPyOpt(initial_data_points=good_hyperparams)

alg = bayesian_optimization.GPyOpt() 
study = sherpa.Study(parameters=parameters, algorithm=alg, lower_is_better=True)

INFO:sherpa.core:
-------------------------------------------------------
SHERPA Dashboard running. Access via
http://10.50.13.252:8880 if on a cluster or
http://localhost:8880 if running locally.
-------------------------------------------------------


 * Serving Flask app "sherpa.app.app" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on


In [229]:
# Reduce lr every two epochs, starting at the second epoch
def scheduler(epoch, lr):
    if epoch > 0 and epoch%2==0:
        return lr/20
    else:
        return lr
    
scheduler_callback = tf.keras.callbacks.LearningRateScheduler(scheduler, verbose=1)

In [230]:
# Usually setting patience=8
today = str(datetime.date.today())[:7] # YYYY-MM

for trial in study:
    
    val_loss = []

    # Create the model
    model = Sequential()
    par = trial.parameters

    # Input layer
    model.add(Dense(units=par['num_units'], activation=par['activation_1'], input_dim=no_of_features,
                   kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))

    # Hidden layers    
    for j in range(2, par['model_depth']):
        model.add(Dense(units=par['num_units'], activation=par['activation_'+str(j)], 
                        kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))
        model.add(Dropout(par['dropout'])) #After every hidden layer we (potentially) add a dropout layer

    # Output layer
    model.add(Dense(1, activation='linear', 
                    kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))

    # Optimizer: Adam is relatively robust w.r.t. its beta-parameters 
    optimizer = Adam(lr=par['lrinit'], epsilon=par['epsilon']) 
    model.compile(loss='mse', optimizer=optimizer)

    # Cross-validate
    # Not using the keras_callback here as then the objective of a given trial is fixed after only one run of model fit
    for i in range(1):
        train_ds, valid_ds = run_cross_validation(i)
        history = model.fit(train_ds, epochs=epochs, verbose=2, validation_data=valid_ds, callbacks=[scheduler_callback]) 
        val_loss.append(np.min(history.history['val_loss']))
    
    # Using add_observation instead of keras_callback. 
    # With i = 3
#     study.add_observation(trial, objective=np.mean(val_loss), context={'Val-loss First Fold': val_loss[0], 
#                                                                  'Val-loss Second Fold': val_loss[1], 
#                                                                  'Val-loss Third Fold': val_loss[2]})
    
    study.add_observation(trial, objective=np.mean(val_loss))
    
    study.finalize(trial)
    save_model(study, today, OPTIMIZER)

0.0036447139546774226
1
2
3
Epoch 1/3

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0036447138991206884.
2/2 - 1s - loss: 0.9846 - val_loss: 7.4126
Epoch 2/3

Epoch 00002: LearningRateScheduler reducing learning rate to 0.0036447138991206884.
2/2 - 0s - loss: 0.9509 - val_loss: 0.9493
Epoch 3/3

Epoch 00003: LearningRateScheduler reducing learning rate to 0.00018223569495603442.
2/2 - 0s - loss: 0.9042 - val_loss: 0.9859
1
2
3
Epoch 1/3

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0001822357007768005.
2/2 - 0s - loss: 0.9001 - val_loss: 0.8831
Epoch 2/3

Epoch 00002: LearningRateScheduler reducing learning rate to 0.0001822357007768005.
2/2 - 0s - loss: 0.8905 - val_loss: 0.8792
Epoch 3/3

Epoch 00003: LearningRateScheduler reducing learning rate to 9.111785038840025e-06.
2/2 - 0s - loss: 0.9343 - val_loss: 0.8791
1
2
3
Epoch 1/3

Epoch 00001: LearningRateScheduler reducing learning rate to 9.111785402637906e-06.
2/2 - 0s - loss: 0.8922 - val_loss: 

KeyboardInterrupt: 