#  Introduction, Data Processing, Algorithm Demonstration and Key Hyperparameters

Yufei Jin

### Hyper-Parameter Optimization (HPO) is an important part of AutoML, where we try to search for the “best” set of hyper-parameters for an ML/DL model. Common methods include grid search and random search. 

### However, each time we want to examine the performance (e.g. prediction accuracy) of certain sets of hyper-parameters, we need to train the model from the very beginning, which could be extremely time-consuming along with high cost of computer resources. In other words, we have very limited data samples. 

## To deal with this issue, we introduce Bayesian Optimization (BO) which is famous for being data-efficient and has been a powerful tool for HPO. Our task is to use Bayesian Optimization to search for the “best” hyper-parameters for the time series forecasting models (specificly LSTM model in our case) used in the wealth management project. The metric of success is the hyperparameters found by BO beat the performance of the default hyperparameter configuration.

In [119]:
cd "C:\Users\Frederica\Desktop\msor semester 1\WM-SecuritySelection"

C:\Users\Frederica\Desktop\msor semester 1\WM-SecuritySelection


In [120]:
### All of import stuff that we used in this project
import sys 
import os
from tqdm import tqdm
import csv
import warnings
import time
import json
import pickle # used to save serialized files

import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

import security_selection.configuration as conf
from utils import read_json, write_json

from data.main import Pool, Meta
from data.reader.helpers import read_reader_params_from_json, read_performance_measures_params_from_json

from security_selection.feature.main import Feature
from security_selection.feature.performance_measures import *

from security_selection.training.main import TrainingEnv

from security_selection.model.base import EstimatorBased, TransformerBased
from security_selection.model.sequence_model import LSTM, sequence_model_input_shape_map, sequence_model_single_dim_instance_prediction_input_shape_map

from datetime import datetime
from security_selection.training.helpers import *

from scipy.stats import kurtosis, skew

from dateutil.relativedelta import relativedelta
from datetime import datetime

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization


## Now, we need to prepare the data that we are going to train and test



# DATA

In [121]:

filepath = "C:/Users/Frederica/Desktop/msor semester 1/WM-SecuritySelection/security_selection/automation/notebooks/MF_LargeCap_ExcessReturn_3Y.parquet"

**Below is our settings of parameters for data processing.**

In [122]:
feat_window = 90

# performance measures window: number of years
pm_window = 3
lb_window = int(3 * pm_window * 365.25) + 1

# Window length between training samples: number of days
sample_window = 30

# test period start
test_start_date = '2020-06-30'





**We construct data_dict and label_dict for our dataset.**

In [123]:
def prepare_data_for_er_ari(filepath, label_window=pm_window):
    er_ari_df = pd.read_parquet(filepath)
    

    data_dict = {ticker: er_ari_df[ticker].dropna() for ticker in er_ari_df.columns}
    tickers_to_remove = []
    
    label_dict = {}
    for ticker, series in tqdm(data_dict.items()):
        if series.isna().sum() == series.shape[0]:
            tickers_to_remove += [ticker]
            continue

        last_date = series.index[-1] - relativedelta(years=pm_window)
        if last_date <= series.index[0]:
            tickers_to_remove.append(ticker)
            continue

        index = series.loc[:series.index[-1] - relativedelta(years=pm_window)].index
        label_dict[ticker] = pd.Series([
            series[date + relativedelta(years=pm_window)] for date in index
        ], index=index)
        
    _ = [data_dict.pop(ticker) for ticker in tickers_to_remove]
    
    return data_dict, label_dict


In [124]:
data_dict, label_dict = prepare_data_for_er_ari(filepath)

100%|██████████| 1330/1330 [03:33<00:00,  6.23it/s]


In [125]:
'''
data_dict=pd.read_csv("data_dict.csv")
data_dict["Date"]=data_dict["Date"].astype('datetime64[ns]')
label_dict=pd.read_csv("label_dict.csv")
label_dict["Date"]=label_dict["Date"].astype('datetime64[ns]')
'''

'\ndata_dict=pd.read_csv("data_dict.csv")\ndata_dict["Date"]=data_dict["Date"].astype(\'datetime64[ns]\')\nlabel_dict=pd.read_csv("label_dict.csv")\nlabel_dict["Date"]=label_dict["Date"].astype(\'datetime64[ns]\')\n'

In [126]:
tickers = list(data_dict.keys())

**We construct the initial train_data and test_data.**

In [127]:
train_data = []
train_labels = []

test_data = []
test_labels = []

# test start date
checkpoint = datetime.strptime(test_start_date, '%Y-%m-%d') - relativedelta(years=pm_window)

for ticker in tqdm(tickers):    
    label = label_dict[ticker]
    if label.shape[0] == 0:
        continue
    ts = data_dict[ticker].loc[:label.index[-1]]

    indices = [np.arange(i, i+lb_window, feat_window) for i in range(0, ts.shape[0] - lb_window + 1, sample_window)]
    
    temp_data = np.array([ts.iloc[sub_indices].values for sub_indices in indices])
    if temp_data.shape[0] == 0:
        continue
    temp_labels = np.array([label.loc[ts.index[sub_indices[-1]]] for sub_indices in indices])
    
    train_indices = [idx for idx in range(temp_data.shape[0]) if ts.index[indices[idx][-1]] <= checkpoint]
    test_indices = [idx for idx in range(temp_data.shape[0]) if ts.index[indices[idx][-1]] > checkpoint]
    
    train_data += [temp_data[train_indices]] 
    train_labels += [temp_labels[train_indices]]
    
    test_data += [temp_data[test_indices]] 
    test_labels += [temp_labels[test_indices]]


100%|██████████| 1130/1130 [00:09<00:00, 114.18it/s]


**By concatenation, we build our completed version of train_data and test_data.**

In [128]:
train_data = np.concatenate(train_data)
train_labels = np.concatenate(train_labels)

test_data = np.concatenate(test_data)
test_labels = np.concatenate(test_labels)

In [129]:
from sklearn.model_selection import train_test_split

**Moreover, we also need validation set split from train_data to validate the model derived by training set 
(which is the complement of validation set in train_data) to better serve model adjustment.**

In [130]:
x_train, x_val, y_train, y_val = train_test_split(train_data, train_labels, train_size=0.9)


## Next,  introduce the search algorithm used for this project.

Generally, there are two simple and well-known algorithms used to optimize decision variables, namely, in our security selection project case, to search for optimal hyper-parameters: 

1. **Grid Search**:

    We first divide our decision variable into different segments
    
    We try out a value in each segment and compute the objective value
    
    Then, choose the best one to optimize the objective function
    
    
2. **Random Search**:

    Randomly sample a value from the whole space and compute the objective value
    
    Repeat the above step for N times
    
    Then choose the best value to optimize the objective function

Random search and grid search both serve as good starting points for HPO. They are not only very simple to be implemented, but also could be run in parallel to verify each other.

However, both two have significant drawbacks: First and foremost, they do not learn from previous experience so **there is no guarantee of finding a local minimum to some precision unless the search space is thoroughly sampled**. Second, **they are extremely expensive search algorithms for time series models which deals with large and complex datasets**. 

**So we want to find a method that could not only learn from its past trials but also is time-efficient and computer-resouce-efficient, so that it can better deal with the model that is computationally expensive together with more hyperparameters.** 

So here we introduce one of the most efficient and cutting-edge methods, which is called **Bayesian Optimization (BO)**.


### BO Algorithm:

Bayesian optimization is a technique for solving optimization problems where the objective function does not have an analytic expression, rather it can only be evaluated through some time-consuming operation. The powerful point for BO is **even if the true objective function is unknown, we can still fit a model to the observations we have so far and use the model to predict good parts of the parameter space where we should run additional experiments.**

Bayesian optimization works by **constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize**. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below. It could be viewed as an updated version of TPE algorithm. Here is a simple demonstration of how it works: We sample x from its space and compute the corresponding objective function value y. 

![alt text](bo_example.png "Title")


As you iterate over and over, the algorithm **balances its needs of exploration and exploitation taking into account what it knows about the target function**. At each step a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with a exploration strategy (such as UCB (Upper Confidence Bound), or EI (Expected Improvement)), are used to determine the next point that should be explored (see the gif below).
Once we get enough samples, the model will then split the outcomes using y∗ as the cutoff point. 

![SegmentLocal](bayesian_optimization.gif "segment")

This process is designed to **minimize the number of steps required to find a combination of parameters that are close to the optimal combination**. To do so, this method uses a proxy optimization problem (finding the maximum of the acquisition function) that, albeit still a hard problem, is **cheaper (in the computational sense) and common tools** can be employed. 

**Therefore Bayesian Optimization is most adequate for situations where sampling the function to be optimized is a very expensive endeavor.**


With the background of the BO algorithm and its strengths compared with random or grid search, we use the bayes_opt, **a powerful Python library that can perform the BO algorithm to realize hyperparameter optimization.** 



















































## The key hyperparameters:

### 1. Learning rate

    Learning rate controls how quickly or slowly a neural network model learns a problem. The amount that the weights are updated during training is denoted as the learning rate. Learning rate is a small positive value, generally in the range [0,1].
    
    Intuition: Small learning rate can lead to long time for converging while large learning rate can even lead to divergent situation.
    
### 2. Epochs

    It defines the number of times that the learning algorithm will pass through the entire training dataset.
    
    Intuition: one epoch means each sample in the training dataset has had an opportunity to update the internal model parameters.
    
### 3. Batch Size

    It controls how often to update the weights of the network.


After an understanding of key hyperparameters in the LSTM model and discussions with TAs, we believe “learning rate” and "batch size" are the most critical hyperparameters that could impact the performance of model prediction. Therefore, with this assumption, the main work we have been doing for the past several months is to use BO to search for a better learning rate and batch size, with epochs fixed.