# Optiver Realized Volatility Prediction - Train

**This notebook seeks to EDITS HERE**
---------

## Files
**book_[train/test].parquet** - A [parquet](https://arrow.apache.org/docs/python/parquet.html) file partitioned by `stock_id`. Provides order book data on the most competitive buy and sell orders entered into the market. The top two levels of the book are shared. The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level.

 - `stock_id` - ID code for the stock. Not all `stock_id`s exist in every time bucket. Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
 - `time_id` - ID code for the time bucket. `time_id`s are not necessarily sequential but are consistent across all stocks.
 - `seconds_in_bucket` - Number of seconds from the start of the bucket, always starting from 0.
 - `bid_price[1/2]` - Normalized prices of the most/second most competitive buy level.
 - `ask_price[1/2]` - Normalized prices of the most/second most competitive sell level.
 - `bid_size[1/2]` - The number of shares on the most/second most competitive buy level.
 - `ask_size[1/2]` - The number of shares on the most/second most competitive sell level.
 
**trade_[train/test].parquet** - A [parquet](https://arrow.apache.org/docs/python/parquet.html) file partitioned by `stock_id`. Contains data on trades that actually executed. Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book.

 - `stock_id` - Same as above.
 - `time_id` - Same as above.
 - `seconds_in_bucket` - Same as above. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
 - `price` - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
 - `size` - The sum number of shares traded.
 - `order_count` - The number of unique trade orders taking place.
 
**train.csv** The ground truth values for the training set.

 - `stock_id` - Same as above, but since this is a csv the column will load as an integer instead of categorical.
 - `time_id` - Same as above.
 - `target` - The realized volatility computed over the 10 minute window following the feature data under the same `stock_id`/`time_id`. There is no overlap between feature and target data. 
 
**test.csv** Provides the mapping between the other data files and the submission file. As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download.

 - `stock_id` - Same as above.
 - `time_id` - Same as above.
 - `row_id` - Unique identifier for the submission row. There is one row for each existing `stock_id`/`time_id` pair. Each time window is not necessarily containing every individual stock.
 
**sample_submission.csv** - A sample submission file in the correct format.

 - `row_id` - Same as in test.csv.
 - `target` - Same definition as in **train.csv**. The benchmark is using the median target value from **train.csv**.
 
## Prepare Environment
### Import Packages

In [None]:
# General packages
import pandas as pd
import numpy as np
import pyarrow.parquet as pq # To handle parquet files
import os
import gc
import random
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import multiprocessing

import time
import warnings
warnings.filterwarnings('ignore')

# Data vis packages
import matplotlib.pyplot as plt
%matplotlib inline

# Data prep
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA

# Modelling packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as k
# Key layers
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Input, Add, Dense, Flatten
# Activation layers
from tensorflow.keras.layers import ReLU, LeakyReLU, ELU, ThresholdedReLU
# Dropout layers
from tensorflow.keras.layers import Dropout, AlphaDropout, GaussianDropout
# Normalisation layers
from tensorflow.keras.layers import BatchNormalization
# Embedding layers
from tensorflow.keras.layers import Embedding, Concatenate, Reshape
# Callbacks
from tensorflow.keras.callbacks import Callback, EarlyStopping, LearningRateScheduler, ModelCheckpoint
# Optimisers
from tensorflow.keras.optimizers import SGD, RMSprop, Adam, Adadelta, Adagrad, Adamax, Nadam, Ftrl
# Model cross validation and evaluation
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.losses import binary_crossentropy

# For Bayesian hyperparameter searching
from skopt import gbrt_minimize, gp_minimize
from skopt.utils import use_named_args
from skopt.space import Real, Categorical, Integer

In [None]:
strategy = tf.distribute.get_strategy()
REPLICAS = strategy.num_replicas_in_sync

# Data access
gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)

# Get number of cpu cores for multiprocessing
try:
    cpus = int(multiprocessing.cpu_count() / 2)
except NotImplementedError:
    cpus = 1 # Default number of cores
    
print(f"Num GPUs Available: {len(tf.config.experimental.list_physical_devices('GPU'))}")
print(f"Num CPU Threads Available: {cpus}")
print(f'REPLICAS: {REPLICAS}')

### Read in Data

In [None]:
# Data paths
comp_dir_path = Path("../input/optiver-realized-volatility-prediction")

# Train paths
train_book_path   = comp_dir_path/"book_train.parquet"
train_trade_path  = comp_dir_path/"trade_train.parquet"
train_labels_path = comp_dir_path/"train.csv"

# Test paths
test_book_path   = comp_dir_path/"book_test.parquet"
test_trade_path  = comp_dir_path/"trade_test.parquet"
test_labels_path = comp_dir_path/"test.csv"

# Sample submission path
sample_sub_path = comp_dir_path/"sample_submission.csv"

In [None]:
# Define helper functions for data reading
def get_stock_ids_list(data_dir_path):
    data_dir = os.listdir(data_dir_path)
    # Get list of stock ids in directory
    stock_ids = list(map(lambda x: x.split("=")[1], data_dir))
    return stock_ids
    
    
def load_book_stock_id_data(stock_id):
    # Get stock id extension
    stock_id_ext = f"stock_id={stock_id}"
    
    # Read individual stock parquet file
    if is_train_test == "train":
        book_stock_id_path = os.path.join(train_book_path, stock_id_ext)
    elif is_train_test == "test":
        book_stock_id_path = os.path.join(test_book_path, stock_id_ext)
    book_stock_id = pd.read_parquet(book_stock_id_path)
    
    # Add stock id feature from filename
    book_stock_id["stock_id"] = int(stock_id)
            
    return book_stock_id

def load_trade_stock_id_data(stock_id):
    # Get stock id extension
    stock_id_ext = f"stock_id={stock_id}"
    
    # Read individual stock parquet file
    if is_train_test == "train":
        trade_stock_id_path = os.path.join(train_trade_path, stock_id_ext)
    elif is_train_test == "test":
        trade_stock_id_path = os.path.join(test_trade_path, stock_id_ext)
    trade_stock_id = pd.read_parquet(trade_stock_id_path)
    
    # Add stock id feature from filename
    trade_stock_id["stock_id"] = int(stock_id)
            
    return trade_stock_id

In [None]:
%%time
# Get list of stock ids
train_stock_ids = get_stock_ids_list(train_book_path)
test_stock_ids = get_stock_ids_list(test_book_path)

# Read train data
is_train_test = "train"
# Create worker pool and read
pool         = multiprocessing.Pool(processes=cpus)
train_book   = pd.concat(pool.map(load_book_stock_id_data, train_stock_ids[0:2]))
train_trade  = pd.concat(pool.map(load_trade_stock_id_data, train_stock_ids[0:2]))
train_labels = pd.read_csv(train_labels_path)
# Close worker pool
pool.close()
pool.join()

# Read test data
is_train_test = "test"
# Create worker pool and read
pool        = multiprocessing.Pool(processes=cpus)
test_book   = pd.concat(pool.map(load_book_stock_id_data, test_stock_ids))
test_trade  = pd.concat(pool.map(load_trade_stock_id_data, test_stock_ids))
test_labels = pd.read_csv(test_labels_path)

# Read sample submission
sample_sub = pd.read_csv(sample_sub_path)

# Print data dimensions
print("TRAIN DATA DIMENSIONS")
print(f"train_book shape: {train_book.shape}")
print(f"train_trade shape: {train_trade.shape}")
print(f"train_labels shape: {train_labels.shape}")

print("\nTEST DATA DIMENSIONS")
print(f"test_book shape: {test_book.shape}")
print(f"test_trade shape: {test_trade.shape}")
print(f"test_labels shape: {test_labels.shape}\n")

## Data Preparation
### Define Feature Engineering Functions

In [None]:
# Define helper functions for data manipulation
def get_log_return(list_stock_prices):
    return np.log(list_stock_prices).diff()


def get_trade_log_return(df_trade, col_stock_id, col_time_id, col_price):
    """
    Returns the Log Return at each time ID.
    """
    trade_log_return = df_trade.groupby([col_stock_id, col_time_id])[col_price].apply(get_log_return)
    trade_log_return = trade_log_return.fillna(0)
    return trade_log_return


def get_agg_feature(df, col_name, func):
    """
    Returns aggregated feature by stock ID and time ID based on input df and feature.
    """
    if "function" in str(func):
        func_str = str(func).split(" ")[1]
        agg_feat_col_name = f"{col_name}_{func_str}"
    else:
        agg_feat_col_name = f"{col_name}_{func}"
    
    agg_feat = df.groupby(by=["stock_id", "time_id"])[col_name].agg(func)
    agg_feat = agg_feat.replace([np.inf, -np.inf], np.nan).fillna(0)
    agg_feat = agg_feat.reset_index().rename(columns={col_name: agg_feat_col_name})
    
    return agg_feat


def get_wap(df_book, col_bid_price, col_ask_price, col_bid_size, col_ask_size):
    """
    Returns Weighted Average Price. 
    """
    wap_numerator = df_book[col_bid_price]  * df_book[col_ask_size]
    wap_numerator += df_book[col_ask_price] * df_book[col_bid_size]
    
    wap_denominator = df_book[col_bid_size] + df_book[col_ask_size]
    
    return wap_numerator / wap_denominator


def get_wap_combined(df_book, col_bid_price1, col_ask_price1, col_bid_size1, col_ask_size1,
                     col_bid_price2, col_ask_price2, col_bid_size2, col_ask_size2):    
    """
    Returns the Combined Weighted Average Price for both Bid and Ask features.
    """
    wap_numerator1  = df_book[col_bid_price1] * df_book[col_ask_size1]
    wap_numerator1 += df_book[col_ask_price1] * df_book[col_bid_size1]
    wap_numerator2  = df_book[col_bid_price2] * df_book[col_ask_size2]
    wap_numerator2 += df_book[col_ask_price2] * df_book[col_bid_size2]
    
    wap_denominator  = df_book[col_bid_size1] + df_book[col_ask_size1]
    wap_denominator += df_book[col_bid_size2] + df_book[col_ask_size2]
    
    return (wap_numerator1 + wap_numerator2) / wap_denominator


def get_wap_avg(df_book, col_bid_price1, col_ask_price1, col_bid_size1, col_ask_size1,
                col_bid_price2, col_ask_price2, col_bid_size2, col_ask_size2):
    """
    Returns the Combined Average Weighted Average Price for both Bid and Ask features.
    """
    wap_numerator1  = df_book[col_bid_price1] * df_book[col_ask_size1]
    wap_numerator1 += df_book[col_ask_price1] * df_book[col_bid_size1]
    wap_numerator1 /= df_book[col_bid_size1] + df_book[col_ask_size1]
    
    wap_numerator2  = df_book[col_bid_price2] * df_book[col_ask_size2]
    wap_numerator2 += df_book[col_ask_price2] * df_book[col_bid_size2]
    wap_numerator2 /= df_book[col_bid_size2] + df_book[col_ask_size2]
    
    return (wap_numerator1 + wap_numerator2) / 2


def get_vol_wap(df_book, col_stock_id, col_time_id, col_wap):
    """
    Returns the Volume Weighted Average Price at each time ID.
    """
    vol_wap = df_book.groupby([col_stock_id, col_time_id])[col_wap].apply(get_log_return)
    vol_wap = vol_wap.fillna(0)
    return vol_wap


def get_bid_ask_spread(df_book, col_bid_price1, col_ask_price1, col_bid_price2, col_ask_price2):
    """
    Get Combined bid ask spread using both Bid and Ask features.
    """
    bas_numerator   = df_book[[col_ask_price1, col_ask_price2]].min(axis=1)
    bas_denominator = df_book[[col_bid_price1, col_bid_price2]].max(axis=1) - 1
    
    return bas_numerator / bas_denominator


def get_vertical_spread(df_book, col_price1, col_price2):
    """
    Returns the vertical spread for Bid/Ask price features inputted.
    """
    v_spread = df_book[col_price1] - df_book[col_price2]
    return v_spread


def get_spread_feature(df_book, col_price_a, col_price_b):
    """
    Returns a spread feature based on the price features inputted.
    """
    spread_feat = df_book[col_price_a] - df_book[col_price_b]
    return spread_feat


def realized_volatility(series_log_return):
    """
    Returns the realized volatility for a given period.
    """
    return np.sqrt(np.sum(series_log_return**2))


def rmspe(y_true, y_pred):
    """
    Returns the Root Mean Squared Prediction Error.
    """
    rmspe = np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))
    return rmspe


def get_row_id(df, col_stock_id, col_time_id):
    """
    Returns row ids in format required for submission. 
    """
    row_ids = df[col_stock_id].astype("str") + "-" + df[col_time_id].astype("str")
    return row_ids

In [None]:
# Compile data manipulation helper functions into complete functions
def extract_trade_feature_set(df_trade):
    """
    Returns engineered trade dataset, where each row is a unique stock ID/time ID pair.
    """
    # Get the Log return for trades by stock ID and time ID
    df_trade["trade_log_return"] = get_trade_log_return(df_trade, "stock_id", "time_id", "price")
    
    # Get aggregate statistics for specified numerical features
    trade_features = ["price", "size", "order_count", "trade_log_return"]
    
    for trade_feature in trade_features:
        # Get min aggregations
        df_trade = df_trade.merge(
            get_agg_feature(df=df_trade, col_name=trade_feature, func="min"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get max aggregations
        df_trade = df_trade.merge(
            get_agg_feature(df=df_trade, col_name=trade_feature, func="max"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get mean aggregations
        df_trade = df_trade.merge(
            get_agg_feature(df=df_trade, col_name=trade_feature, func="mean"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get std aggregations
        df_trade = df_trade.merge(
            get_agg_feature(df=df_trade, col_name=trade_feature, func="std"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get sum aggregations
        df_trade = df_trade.merge(
            get_agg_feature(df=df_trade, col_name=trade_feature, func="sum"),
            how="left",
            on=["stock_id", "time_id"]
        )
    
    # Reduce trade df to just unique stock ID and time ID pairs
    df_trade = df_trade.drop(["seconds_in_bucket", "price", "size", "order_count", "trade_log_return"], axis=1)
    df_trade = df_trade.drop_duplicates().reset_index(drop=True)
    
    return df_trade


def extract_book_feature_set(df_book):
    """
    Returns engineered book dataset, where each row is a unique stock ID/time ID pair.
    """
    # WAP for both bid/ask price/size features
    df_book["wap1"] = get_wap(df_book, "bid_price1", "ask_price1", "bid_size1", "ask_size1")
    df_book["wap2"] = get_wap(df_book, "bid_price2", "ask_price2", "bid_size2", "ask_size2")
    # Combined WAP
    df_book["wap_combined"] = get_wap_combined(
        df_book, "bid_price1", "ask_price1", "bid_size1", "ask_size1", 
        "bid_price2", "ask_price2", "bid_size2", "ask_size2"
    )
    # Average WAP for both bid/ask price/size features
    df_book["wap_avg"] = get_wap_avg(
        df_book, "bid_price1", "ask_price1", "bid_size1", "ask_size1", 
        "bid_price2", "ask_price2", "bid_size2", "ask_size2"
    )
    
    # Get VWAPS based on different WAP features
    df_book["vol_wap1"]         = get_vol_wap(df_book, "stock_id", "time_id", "wap1")
    df_book["vol_wap2"]         = get_vol_wap(df_book, "stock_id", "time_id", "wap2")
    df_book["vol_wap_combined"] = get_vol_wap(df_book, "stock_id", "time_id", "wap_combined")
    df_book["vol_wap_avg"]      = get_vol_wap(df_book, "stock_id", "time_id", "wap_avg")
    
    # Get different spread features
    df_book["bid_ask_spread"] = get_bid_ask_spread(df_book, "bid_price1", "ask_price1", "bid_price2","ask_price2")
    df_book["bid_v_spread"]   = get_vertical_spread(df_book, "bid_price1", "bid_price2")
    df_book["ask_v_spread"]   = get_vertical_spread(df_book, "ask_price1", "ask_price2")
    df_book["h_spread1"]      = get_spread_feature(df_book, "ask_price1", "bid_price1")
    df_book["h_spread2"]      = get_spread_feature(df_book, "ask_price2", "bid_price2")
    df_book["spread_diff1"]   = get_spread_feature(df_book, "ask_price1", "bid_price2")
    df_book["spread_diff2"]   = get_spread_feature(df_book, "ask_price2", "bid_price1")
    
    # Get aggregated volatility features for each VWAP
    vol_features = ["vol_wap1", "vol_wap2", "vol_wap_combined", "vol_wap_avg"]
    
    for vol_feature in vol_features:
         df_book = df_book.merge(
             get_agg_feature(df=df_book, col_name=vol_feature, func=realized_volatility),
             how="left",
             on=["stock_id", "time_id"]
         )
            
    # Get aggregated features for different spread features
    spread_features = [
        "bid_ask_spread", "bid_v_spread", "ask_v_spread", "h_spread1", 
        "h_spread2", "spread_diff1", "spread_diff2"
    ]
    
    for spread_feature in spread_features:
        # Get min aggregations
        df_book = df_book.merge(
            get_agg_feature(df=df_book, col_name=spread_feature, func="min"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get max aggregations
        df_book = df_book.merge(
            get_agg_feature(df=df_book, col_name=spread_feature, func="max"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get mean aggregations
        df_book = df_book.merge(
             get_agg_feature(df=df_book, col_name=spread_feature, func="mean"),
             how="left",
             on=["stock_id", "time_id"]
        )
        # Get std aggregations
        df_book = df_book.merge(
            get_agg_feature(df=df_book, col_name=spread_feature, func="std"),
            how="left",
            on=["stock_id", "time_id"]
        )
        # Get sum aggregations
        df_book = df_book.merge(
            get_agg_feature(df=df_book, col_name=spread_feature, func="sum"),
            how="left",
            on=["stock_id", "time_id"]
        )

    # Reduce trade df to just unique stock ID and time ID pairs
    df_book = df_book.drop([
        "seconds_in_bucket", "bid_price1", "ask_price1", "bid_price2", 
        "ask_price2", "bid_size1", "ask_size1", "bid_size2", "ask_size2",
        # WAP features
        "wap1", "wap2", "wap_combined", "wap_avg", "vol_wap1", 
        "vol_wap2", "vol_wap_combined", "vol_wap_avg", 
        # Spread features
        "bid_ask_spread", "bid_v_spread", "ask_v_spread", "h_spread1", 
        "h_spread2", "spread_diff1", "spread_diff2" 
    ], axis=1)
    df_book = df_book.drop_duplicates().reset_index(drop=True)
    
    return df_book


def get_initial_feature_set(df_train, df_trade, df_book):
    """
    Returns engineered feature set with labels, before preprocessing
    """
    # Extract trade and book features
    df_trade = extract_trade_feature_set(df_trade)
    df_book  = extract_book_feature_set(df_book)
    # Merge trade and book features to labels
    df_train = pd.merge(df_train, df_trade, how="inner", on=["stock_id", "time_id"])
    df_train = pd.merge(df_train, df_book, how="inner", on=["stock_id", "time_id"])
    
    return df_train

### Full Data Manipulation Pipeline

In [None]:
# Run feature generation pipeline
train = get_initial_feature_set(train_labels, train_trade, train_book)

X_tdx = train[0:999].drop("target", axis=1)
X_vdx = train[1000:1999].drop("target", axis=1)
y_tdx = train[0:999]["target"]
y_vdx = train[1000:1999]["target"]

In [None]:
# Define key parameters
baseline_model = True

SEED = 14
np.random.seed(SEED)

SCALER_METHOD = RobustScaler()

FEATURE_SELECTOR = RandomForestRegressor(random_state=SEED)
NUM_FEATURES = 500

PCA_METHOD = PCA(random_state=SEED)

EPOCHS = 100
BATCH_SIZE = 64
KFOLDS = 2
PATIENCE = 10

MODEL_TO_USE = "nn"
model_name_save = f"{MODEL_TO_USE}_final_classifier_seed_{str(SEED)}"

print(f"Model name: {model_name_save}")

In [None]:
# Define full dataset transformation pipeline
def transform_dataset(X_train, X_val, y_train, y_val, 
                      verbose=0, 
                      scaler=SCALER_METHOD, 
                      feature_selector=FEATURE_SELECTOR,
                      num_features=NUM_FEATURES,
                      pca=PCA_METHOD, 
                      seed=SEED
                     ):
    """
    Takes in train and validation datasets, and applies feature transformations,
    feature selection, scaling and pca (dependent on arguments). 
    
    Returns transformed X_train and X_val data ready for training/prediction.
    """

    
    ## DATA PREPARATION ##

    # Get indices for train and validation dfs - we'll need these later
    train_idx = list(X_train.index)
    val_idx   = list(X_val.index)
       
    # Get train colnames before scaling and feature selection (minus ID features)
    feat_cols = X_train.drop(["stock_id", "time_id"], axis=1).columns
    
    # Get subset for ID features
    train_id_feats = X_train[["stock_id", "time_id"]]
    val_id_feats = X_val[["stock_id", "time_id"]]
    
    
    ## SCALING ##
    
    if scaler != None:
        if verbose == 1:
            print("APPLYING SCALER...")
            
        # Fit and transform scaler to train and val
        scaler.fit(X_train.drop(["stock_id", "time_id"], axis=1))
        X_train = scaler.transform(X_train.drop(["stock_id", "time_id"], axis=1))
        X_val   = scaler.transform(X_val.drop(["stock_id", "time_id"], axis=1))
        
        # Convert scaled array back dataframe
        X_train = pd.DataFrame(X_train, index=train_idx, columns=feat_cols)
        X_train = pd.merge(train_id_feats, X_train, how="left", left_index=True, right_index=True)
        
        X_val = pd.DataFrame(X_val, index=val_idx, columns=feat_cols)
        X_val = pd.merge(val_id_feats, X_val, how="left", left_index=True, right_index=True)

        
    ## FEATURE SELECTION ##
    
    # Feature selection is only ran on numerical data
    if feature_selector != None:
        if verbose == 1:
            print("APPLYING FEATURE SELECTOR...")
            cols_num = X_train.shape[1]
                
        # Fit tree based classifier to select features
        feature_selector_fit = SelectFromModel(estimator=feature_selector)
        feature_selector_fit = feature_selector_fit.fit(X_train, y_train)
        
        # Retrieve the names of the features selected for each label
        feature_idx = feature_selector_fit.get_support()
        selected_features = list(X_train.columns[feature_idx])
        
        # Subset datasets to selected features only
        X_train = X_train[selected_features]
        X_val   = X_val[selected_features]
        
        if verbose == 1: 
            print(f"{cols_num - X_train.shape[1]} features removed in feature selection.")
                
        
    ## PCA ##
    
    if pca != None:
        if verbose == 1:
            print("APPLYING PCA...")
        # Fit and transform pca to train and val
        pca.fit(X_train)
        X_train = pca.transform(X_train)
        X_val   = pca.transform(X_val)
        if verbose == 1:
            print(f"NUMBER OF PRINCIPAL COMPONENTS: {pca.n_components_}")
        # Convert numerical features into pandas dataframe and clean colnames
        X_train = pd.DataFrame(X_train, index=train_idx).add_prefix("pca_")
        X_val   = pd.DataFrame(X_val, index=val_idx).add_prefix("pca_")
        
        
        
        
    if verbose == 1:
        print(f"TRAIN SHAPE: \t\t{X_train.shape}")
        print(f"VALIDATION SHAPE: \t{X_val.shape}")

        
    return X_train, X_val, selected_features

In [None]:
X_train, X_val, selected_features = transform_dataset(X_tdx, X_vdx, y_tdx, y_vdx, verbose=1)

## Modelling
### Learning Scheduler

In [None]:
def build_lrfn(lr_start          = 0.00001, 
               lr_max            = 0.0008, 
               lr_min            = 0.00001, 
               lr_rampup_epochs  = 20, 
               lr_sustain_epochs = 0, 
               lr_exp_decay      = 0.8):
    
    lr_max = lr_max * strategy.num_replicas_in_sync

    def lrfn(epoch):
        if epoch < lr_rampup_epochs:
            lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start
        elif epoch < lr_rampup_epochs + lr_sustain_epochs:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_exp_decay**(epoch - lr_rampup_epochs - lr_sustain_epochs) + lr_min
        return lr

    return lrfn

lrfn = build_lrfn()
lr = LearningRateScheduler(lrfn, verbose=0)

plt.plot([lrfn(epoch) for epoch in range(EPOCHS)])
plt.title('Learning Rate Schedule')
plt.xlabel('Epochs')
plt.ylabel('Learning Rate')
plt.show()

### Define Baseline Model
The below model was the original architecture, however when we conduct our Bayesian Hyperparameter search, we'll be playing around with the architecture of this baseline model a little. Parameter tuning will affect the model depth as well as the numbers of nodes at each layer, the dropout layers, activation functions and optimisers.

In [None]:
if baseline_model == True:
    def get_model(X_train, y_train):
        
        input_ = Input(shape=(X_train.shape[1], ))
        x = Dense(8192, activation='relu')(input_)
            
        x = Dense(4096, activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(0.5)(x) 
        
        x = Dense(2048, activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(0.5)(x)
    
        x = Dense(1024, activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(0.5)(x)
        
        x = Dense(512, activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(0.5)(x)
        
        x = Dense(256, activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(0.5)(x)
        
        output = Dense(1, activation='linear')(x)
    
        model = Model(input_, output)
        
        return model