## Hyperparameter Tuning: Third phase, first model

Run **SHERPA**. Fix batchsize = 1024. Do not shuffle the input data as that takes a lot of time. <br>
*First phase:* Start with 3 epochs each. Here we can already discard some models. <br>
*Second phase:* Run 3 epochs with a parameter space confined to the four best models from phase 1. Add a learning rate scheduler a la Stephan Rasp (Divide learning rate by 20 every two epochs). <br>
*Third phase:* Run 6 epochs with the two best models from phase 2. With Sherpa, vary the learning rate schedule. Usually one uses cross-validation here to truly get a good estimate of generalization error!. 
To vary: 
- Learning rate scheduler

In [1]:
# Best results from Phase No. 2:

# Optimizer                                 Adadelta
# activation_1                                  tanh
# activation_2    <function leaky_relu at 0x2accec7a4ef0>
# activation_3                                  tanh
# bn_0                                             0
# bn_1                                             1      
# bn_2                                             0
# dropout                                      0.221
# epsilon                                        0.1
# l1_reg                                    0.004749
# l2_reg                                    0.008732
# lrinit                                    0.000433
# model_depth                                      4
# num_units                                       64
# Objective                                    37.45
# Name: 0, dtype: object

# Actually only the 4th best, but the second and third model weren't really satisfactory in their training progress:
# Optimizer                                         Nadam
# activation_1    <function leaky_relu at 0x2baa40732ef0>
# activation_2                                       tanh
# activation_3         <function lrelu at 0x2baa4950f710>
# bn_0                                             0
# bn_1                                             1      
# bn_2                                             0
# dropout                                    0.20987
# epsilon                                        0.1
# l1_reg                                    0.008453
# l2_reg                                    0.004271
# lrinit                                    0.008804
# model_depth                                      4
# num_units                                      128
# Objective                                    45.94
# Name: 1, dtype: object

In [2]:
#Starting only with a few epochs
epochs = 6

In [3]:
# Ran with 800GB (750GB should also be fine)

import sys
import numpy as np
import time
import pandas as pd
import matplotlib.pyplot as plt
import os
import copy
import gc

#Import sklearn before tensorflow (static Thread-local storage)
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adadelta
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.regularizers import l1_l2

from tensorflow.keras import backend as K
from tensorflow.keras.layers import Activation

t0 = time.time()
path = '/pf/b/b309170'
path_figures = path + '/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/figures'
path_model = path + '/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/saved_models'
path_data = path + '/my_work/icon-ml_data/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/based_on_var_interpolated_data'

# Add path with my_classes to sys.path
sys.path.insert(0, path + '/workspace_icon-ml/cloud_cover_parameterization/')
# Add sherpa
sys.path.insert(0, path + '/my_work/sherpa')

#import sherpa
#import sherpa.algorithms.bayesian_optimization as bayesian_optimization

# Reloading custom file to incorporate changes dynamically
import importlib
import my_classes
importlib.reload(my_classes)

from my_classes import read_mean_and_std
from my_classes import TimeOut

import datetime

# Set seed for reproducibility
seed = 10
tf.random.set_seed(seed)

gpus = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_visible_devices(gpus[3], 'GPU')

In [4]:
# Won't run on a CPU node
try:
    # Prevents crashes of the code
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.set_visible_devices(physical_devices[0], 'GPU')
    # Allow the growth of memory Tensorflow allocates (limits memory usage overall)
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
except:
    pass

In [5]:
scaler = StandardScaler()

### Load the data

In [6]:
# input_narval = np.load(path_data + '/cloud_cover_input_narval.npy')
# input_qubicc = np.load(path_data + '/cloud_cover_input_qubicc.npy')
# output_narval = np.load(path_data + '/cloud_cover_output_narval.npy')
# output_qubicc = np.load(path_data + '/cloud_cover_output_qubicc.npy')

In [7]:
input_data = np.concatenate((np.load(path_data + '/cloud_cover_input_narval.npy'), 
                             np.load(path_data + '/cloud_cover_input_qubicc.npy')), axis=0)
output_data = np.concatenate((np.load(path_data + '/cloud_cover_output_narval.npy'), 
                              np.load(path_data + '/cloud_cover_output_qubicc.npy')), axis=0)

In [8]:
samples_narval = np.load(path_data + '/cloud_cover_output_narval.npy').shape[0]

In [9]:
(samples_total, no_of_features) = input_data.shape
(samples_total, no_of_features)

(1008901640, 10)

*Temporal cross-validation*

Split into 2-weeks increments (when working with 3 months of data). It's 25 day increments with 5 months of data. <br>
1.: Validate on increments 1 and 4 <br>
2.: Validate on increments 2 and 5 <br>
3.: Validate on increments 3 and 6

--> 2/3 training data, 1/3 validation data

In [10]:
training_folds = []
validation_folds = []
two_week_incr = samples_total//6

for i in range(3):
    # Note that this is a temporal split since time was the first dimension in the original tensor
    first_incr = np.arange(samples_total//6*i, samples_total//6*(i+1))
    second_incr = np.arange(samples_total//6*(i+3), samples_total//6*(i+4))

    validation_folds.append(np.append(first_incr, second_incr))
    training_folds.append(np.arange(samples_total))
    training_folds[i] = np.delete(training_folds[i], validation_folds[i])

### 3-fold cross-validation

In [11]:
#We loop through the folds
def run_cross_validation(i):
    
    filename = 'cross_validation_cell_based_fold_%d'%(i+1)
    
    #Standardize according to the fold
    scaler.fit(input_data[training_folds[i]])

    #Load the data for the respective fold and convert it to tf data
    input_train = scaler.transform(input_data[training_folds[i]])
    input_valid = scaler.transform(input_data[validation_folds[i]])
    output_train = output_data[training_folds[i]]
    output_valid = output_data[validation_folds[i]]
    
    # Column-based: batchsize of 128
    # Possibly better to use .apply(tf.data.experimental.copy_to_device("/gpu:0")) before prefetch
    # I'm not shuffling for hyperparameter tuning
    train_ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices(input_train), 
                                tf.data.Dataset.from_tensor_slices(output_train))) \
                .shuffle(10**5, seed=seed) \
                .batch(batch_size=1024, drop_remainder=True) \
                .prefetch(1)
    
    # No need to add prefetch.
    # tf data with batch_size=10**5 makes the validation evaluation 10 times faster
    valid_ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices(input_valid), 
                                tf.data.Dataset.from_tensor_slices(output_valid))) \
                .batch(batch_size=10**5, drop_remainder=False)
    
    return train_ds, valid_ds

In [12]:
#Should be a pretty unique number
random_num = np.random.randint(500000)
print(random_num)

def save_model(study, today, optimizer):
    out_path = '/pf/b/b309170/workspace_icon-ml/cloud_cover_parameterization/grid_cell_based_QUBICC_R02B05/sherpa_results/phase_three_'+\
            today+'_'+optimizer+'_'+str(random_num)
    
    study.results = study.results[study.results['Status']=='COMPLETED'] #To specify results
    study.results.index = study.results['Trial-ID']  #Trial-ID serves as a better index
    # Create the directory and save the SHERPA-output in it
    try:
        os.mkdir(out_path)
    except OSError:
        print('Creation of the directory %s failed' % out_path)
    else: 
        print('Successfully created the directory %s' % out_path)
    study.save(out_path)

225259


In [13]:
# The only practical way to reset the model is to re-initialize it
def initialize_model(par):
    model = Sequential()
    
    # Input layer
    model.add(Dense(units=par['num_units'], activation='tanh', input_dim=no_of_features,
                   kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))

    # Hidden layers    
    model.add(Dense(units=par['num_units'], activation=nn.leaky_relu, 
                    kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))
    model.add(Dropout(par['dropout'])) #After every hidden layer we (potentially) add a dropout layer
    model.add(BatchNormalization())

    model.add(Dense(units=par['num_units'], activation='tanh', 
                    kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))
    model.add(Dropout(par['dropout'])) #After every hidden layer we (potentially) add a dropout layer
    
    # Output layer
    model.add(Dense(1, activation='linear', 
                    kernel_regularizer=l1_l2(l1=par['l1_reg'], l2=par['l2_reg'])))
    
    optimizer = Adadelta(lr=par['lrinit'], epsilon=par['epsilon']) 
        
    model.compile(loss='mse', optimizer=optimizer)
    return model

In [14]:
# Reduce lr every two epochs, starting at the second epoch
def scheduler_stephan(epoch, lr):
    if epoch > 0 and epoch%2==0:
        return lr/20
    else:
        return lr
    
def scheduler_fast(epoch, lr):
    if epoch > 0:
        return lr/20
    else:
        return lr
    
def scheduler_slow(epoch, lr):
    return lr*np.exp(-0.1)
    
callback_stephan = tf.keras.callbacks.LearningRateScheduler(scheduler_stephan, verbose=1)
callback_fast = tf.keras.callbacks.LearningRateScheduler(scheduler_fast, verbose=1)
callback_slow = tf.keras.callbacks.LearningRateScheduler(scheduler_slow, verbose=1)

callback_choices = [callback_stephan, callback_fast, callback_slow]

In [15]:
# For Leaky_ReLU:
from tensorflow import nn 

def lrelu(x):
    return nn.leaky_relu(x, alpha=0.01)

OPTIMIZER = 'adadelta'
parameters = [sherpa.Ordinal('num_units', [64]), #No need to vary these per layer. Could add 512.
             sherpa.Ordinal('model_depth', [4]), #Originally [2,8] although 8 was never truly tested
             sherpa.Ordinal('lrinit', [0.000433]),
             sherpa.Ordinal('epsilon', [0.1]),
             sherpa.Ordinal('dropout', [0.221]),
             sherpa.Ordinal('l1_reg', [0.004749]),
             sherpa.Ordinal('l2_reg', [0.008732])]

In [16]:
# max_num_trials is left unspecified, so the optimization will run until the end of the job-runtime

# alg = bayesian_optimization.GPyOpt(initial_data_points=good_hyperparams)

alg = bayesian_optimization.GPyOpt() 
study = sherpa.Study(parameters=parameters, algorithm=alg, lower_is_better=True)

INFO:sherpa.core:
-------------------------------------------------------
SHERPA Dashboard running. Access via
http://10.50.13.250:8880 if on a cluster or
http://localhost:8880 if running locally.
-------------------------------------------------------


 * Serving Flask app "sherpa.app.app" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on


In [None]:
# Usually setting patience=8
today = str(datetime.date.today())[:7] # YYYY-MM

for trial in study:
    
    val_loss = []
    
    # Which callback to choose
    cb = np.random.randint(3)

    # Cross-validate
    # Not using the keras_callback here as then the objective of a given trial is fixed after only one run of model fit
    for i in range(3):
        train_ds, valid_ds = run_cross_validation(i)
        
        # Initialize the model
        model = initialize_model(trial.parameters)
        history = model.fit(train_ds, epochs=epochs, verbose=2, validation_data=valid_ds, 
                            callbacks=[callback_choices[cb]]) 
        val_loss.append(np.min(history.history['val_loss']))
    
    # Using add_observation instead of keras_callback. 
    study.add_observation(trial, objective=np.mean(val_loss), context={'Val-loss First Fold': val_loss[0], 
                                                                 'Val-loss Second Fold': val_loss[1], 
                                                                 'Val-loss Third Fold': val_loss[2]})
    
#     study.add_observation(trial, objective=np.mean(val_loss))
    
    study.finalize(trial)
    save_model(study, today, OPTIMIZER)

Epoch 1/6

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0004330000083427876.
656837/656837 - 2935s - loss: 49.5686 - val_loss: 53.9367
Epoch 2/6

Epoch 00002: LearningRateScheduler reducing learning rate to 2.165000041713938e-05.
656837/656837 - 2852s - loss: 46.3698 - val_loss: 47.1870
Epoch 3/6

Epoch 00003: LearningRateScheduler reducing learning rate to 1.082500057236757e-06.
656837/656837 - 2838s - loss: 54.8163 - val_loss: 46.6972
Epoch 4/6

Epoch 00004: LearningRateScheduler reducing learning rate to 5.412500172496948e-08.
656837/656837 - 2883s - loss: 84.2991 - val_loss: 46.7774
Epoch 5/6

Epoch 00005: LearningRateScheduler reducing learning rate to 2.7062501573027474e-09.
