# Scikit Learn Benzene Price Forecasting

## 0.0 Notes and Explainations

This notebook uses SQL to query::
* lyb-sql-devdacore-002.5bff9fcb8330.database.windows.net 
    * To extract the information contained in the t-code ZMRRELPO
* lyb-sql-prddacore-002.bed79ae4ef8b.database.windows.net
    * To extract ZEMA IHS data
    * To extract ZEMA ICIS data

### 0.1 Environment setup

This workbook utilizes the py37_benzene environment which can be installed via the Anaconda Prompt from your local repo sync by running:
> conda env create -f py37_benzene.yml

This .yml file is stored in the root scripts directory

## 1.0 Prepare Workspace

### 1.1 Import Standard Libraries and configure runtime parameters

In [1]:
##
# Import Basic Python DS Libraries 
import pandas as pd     # Standard data science package
import numpy as np      # Additional numerical functions
#import dtale
import warnings
warnings.filterwarnings('ignore')

##
# Import Advanced Python DS Libraries
import missingno as msno    # Missing value toolbox
#from pandas_profiling import ProfileReport     # Integrated/deep reporting, resource intensive
from statsmodels.tsa.stattools import adfuller  # Statistical test for stationary data

##
# Import Database Connection Libraries
#import pyodbc           # Database connector

##
# Import Plotting Libraries
import matplotlib.pyplot as plt                # Full featured plotting toolbox
plt.rcParams['figure.figsize'] = (25,8)
#import plotly.graph_objects as go               #Plotly GO toolbox
import plotly.express as px                    # Plotly Express plotting toolbox
import plotly.io as pio                        # Addtional Controls for plotly to allow visuals within VSCode Notebook
#import seaborn as sns       # Seaborn plotting toolbox

##
# Import Operating System Interface Libraries
import os       # operating system interface
import sys
#import calendar
#import datetime as dt
from datetime import datetime, timedelta, date

##
# Import File Format Libraries
#import pyarrow.feather as feather       # Advanced datasource import/export toolbox (Apache Parquet)
import shelve # Allows serialization (pickle-ing) of multiple opjects and saving them to dict objects for later use
import sqlite3
##
# Import ML and Forecasting Libraries
#from scipy import interpolate
#from sklearn import metrics     # Metrics for sklearn ML models
from sklearn.ensemble import RandomForestRegressor # ML Model package
#from sklearn.tree import DecisionTreeRegressor  # ML Model package
from sklearn_genetic.space import Categorical, Integer, Continuous
from sktime.forecasting.model_selection import ExpandingWindowSplitter, SlidingWindowSplitter
#from sklearn.model_selection import TimeSeriesSplit

#from sktime.forecasting.model_selection import ForecastingGridSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.feature_selection import RFECV, RFE
#from xgboost import XGBRegressor, XGBRFRegressor
#from prophet import Prophet

##
# Setup packages for sending messages
import requests
import json

##
# Import timing function(s)
from time import perf_counter
import datetime

In [2]:
# Configure Libraries
pio.renderers.default = "notebook_connected" # Configure plotly to print within VSCode environment
#pio.renderers.default = "vscode"


In [3]:
# Configure Run Environment
AllLPC = True # Used in the "Deal with non-Stationary data" step to ensure that e take the Percent Change for every column and not just those that are non-stationary

### 1.2 Import Custom Functions

In [4]:
# Define location for custom functions
module_path = os.path.abspath(os.path.join('../Functions'))

# Verify it's accessible for loading
if (module_path not in sys.path) & (os.path.isdir(module_path)):
    sys.path.append(module_path)
    print('Added', module_path, 'to system paths')

elif (module_path in sys.path) & (os.path.isdir(module_path)):
    print(module_path, 'ready to be used for import')

else:
    print(module_path, 'is not a valid path')

# Import Custom Functions
try: from multi_plot import *; print('Loaded multiplot')
except: print('failed to load multi_plot')

try: from StationaryTools import *; print('Loaded StationaryTools')
except: print('Failed to load StationaryTools')

try: from RegressionTools import *; print('Loaded RegressionTools')
except: print('Failed to load RegressionTools')

#try: from NaiveForecasting import *; print('Loaded NaiveTools')
#except: print('Failed to load NaiveTools')

try: from ModelingTools import *; print("Loaded ModelingTools")
except: print('ModelingTools failed to load')

try: from  sk_ts_modelfit import *; print('Loaded sk_ts_modelfit')
except: print('Failed to load sk_ts_modelfit')

Added c:\Users\baanders\Documents\Benzene Forecasting\Scripts\Functions to system paths
Loaded multiplot
Loaded multiplot
Loaded StationaryTools
Loaded StationaryTools
Loaded RegressionTools
Loaded ModelingTools
Loaded sk_ts_modelfit


Using TensorFlow backend.


### 1.3 Configure messaging

In [5]:
url = 'https://lyondellbasell.webhook.office.com/webhookb2/0b44f724-4c38-4322-af43-1cc8a79b2240@fbe62081-06d8-481d-baa0-34149cfefa5f/IncomingWebhook/7d6ffad65b414df9bd9bf5fa64da44b2/76b3b453-0a85-40ff-a5d7-15d042decdf6'

#payload = {
#    "text": "Sample alert text"
#}
#headers = {
#    'Content-Type': 'application/json'
#}
#response = requests.post(url, headers=headers, data=json.dumps(payload))
#print(response.text.encode('utf8'))

## 2.0 Import Data

In [6]:
# Define default storage location for files
dataroot = '../../Data/Parquet/SKLearn Data/'
ifilename = 'weekly_for_modeling'

In [7]:
# Check if data location above exists. If it does import dataset.
# All datasets imported with name df so that we can generically 

if os.path.isdir(dataroot):
    df_p = pd.read_parquet(dataroot+ifilename+'.parquet')
    print(ifilename + ' dataset loaded with shape', df_p.shape, 'and', df_p.isna().sum().sum(), 'NaN values')
    
else:
    print('Storage location does not exist. Please update directory and try again.')

weekly_for_modeling dataset loaded with shape (381, 5432) and 0 NaN values


In [8]:
# Find newest date stored in df_p
newest_date = max(df_p.columns.levels[0])
print(f'Newest run_date is: {newest_date}')

Newest run_date is: 2022-07-24


In [9]:
# Extract df as only newest model date
df = df_p[newest_date]
print(f'Shape of {newest_date} entry is {df.shape}')

Shape of 2022-07-24 entry is (381, 5432)


## 3.0 Modeling

### 3.1 Limit Columns to be used in fits

In [10]:
print('Shape before droping columns', df.shape)
df = df.loc[:,~df.columns.str.contains('^USD')]         # Remove columns that start with USD
df = df.loc[:,~df.columns.str.contains('^....USD')]     # Remove columns that start with xxx/USD
df = df.loc[:,~df.columns.str.contains('-CLOSE')]       # Remove columns that contain -CLOSED
df = df.loc[:,~df.columns.str.contains('-HIGH')]        # Remove columns that contain -HIGH
df = df.loc[:,~df.columns.str.contains('-HIGHLOW2')]    # Remove columns that contain -HIGHLOW2
df = df.loc[:,~df.columns.str.contains('-LOW')]         # Remove columns that contain -LOW

print('Shape after droping columns', df.shape)          # Print final shape of df for validation

Shape before droping columns (381, 5432)
Shape after droping columns (381, 1694)


In [11]:
test_cols = [col for col in df if col.startswith('Benzene')]
test_cols

['Benzene CFR Taiwan MAvg (USD/MT)-AVERAGE',
 'Benzene CFR Taiwan Weekly (USD/MT)-AVERAGE',
 'Benzene ENEOS Corporation CP Nomination (USD/MT)-AVERAGE',
 'Benzene ENEOS Corporation CP Settlement (USD/MT)-AVERAGE',
 'Benzene FOB Brazil Weekly (USD/MT)-AVERAGE',
 'Benzene FOB Korea Marker (USD/MT)-AVERAGE',
 'Benzene FOB Korea Marker MAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea Marker WAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper BalMo (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper BalMo MAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo01 (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo01 MAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo02 (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo02 MAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo03 (USD/MT)-AVERAGE',
 'Benzene FOB Korea Paper Mo03 MAvg (USD/MT)-AVERAGE',
 'Benzene FOB Korea W2 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W3 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W4 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W5 (USD/MT)-AVERAGE',
 

In [12]:
# Keep only those columns slected by Allen on 5/25/2022
keep_cols = [ 
    'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
    'Benzene CFR Taiwan Weekly (USD/MT)-AVERAGE',
    'Benzene FOB Korea Marker (USD/MT)-AVERAGE',
    'Benzene FOB Korea W2 (USD/MT)-AVERAGE',
    'Benzene FOB Korea W3 (USD/MT)-AVERAGE',
    'Benzene FOB Korea W4 (USD/MT)-AVERAGE',
    'Benzene FOB Korea W5 (USD/MT)-AVERAGE',
    'Benzene FOB Korea W6 (USD/MT)-AVERAGE',
    'Benzene-Spot, Current Month, Low-N/A-Cents per Gallon-FOB Houston, TX-North America',
    'Benzene-Spot, Next Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
    'Benzene-Spot, Next Month, Low-N/A-Cents per Gallon-FOB Houston, TX-North America',
    'Crude Oil Prices Brent  Europe',
    'Ethylene-Prod Cash Cost Naphtha Feed, Current Month, Spot Co-Product Credits-N/A-Cents per Pound-US Gulf Coast-North America',
    'Toluene-Spot, Current Month, High-Commercial Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Toluene-Spot, Current Month, High-Nitration Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Toluene-Spot, Current Month, Low-Commercial Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Toluene-Spot, Current Month, Low-Nitration Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Toluene-Spot, Next Month, High-Nitration Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Toluene-Spot, Next Month, Low-Nitration Grade-Cents per Gallon-FOB Houston, TX-North America',
    'Naphtha FOB Singapore Assessment Spot Closing Value Daily (Mid) : USD/bbl',
    'Naphtha Reforming FOB US Assessment Barges Spot 4 Weeks Closing Value Daily (Mid) : US CTS/US gal'
]
keep_cols

['Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Benzene CFR Taiwan Weekly (USD/MT)-AVERAGE',
 'Benzene FOB Korea Marker (USD/MT)-AVERAGE',
 'Benzene FOB Korea W2 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W3 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W4 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W5 (USD/MT)-AVERAGE',
 'Benzene FOB Korea W6 (USD/MT)-AVERAGE',
 'Benzene-Spot, Current Month, Low-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Benzene-Spot, Next Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Benzene-Spot, Next Month, Low-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Crude Oil Prices Brent  Europe',
 'Ethylene-Prod Cash Cost Naphtha Feed, Current Month, Spot Co-Product Credits-N/A-Cents per Pound-US Gulf Coast-North America',
 'Toluene-Spot, Current Month, High-Commercial Grade-Cents per Gallon-FOB Houston, TX-North America',
 'Toluene-Spot, Current Month, High-Nitration Grade-Cents per Gallon-FOB Houston, T

In [13]:
# Extract all columns that start with the provided list (includes lags)
all_keep_cols = pd.Series()
for keep_col in keep_cols:
    all_keep_cols = all_keep_cols.append(pd.Series([col for col in df if col.startswith(keep_col)]))

all_keep_cols = all_keep_cols.tolist()


In [14]:
# Perform the column truncation

print('Shape before droping columns', df.shape)

#df = df[all_keep_cols]         # Keep only the columns Allen Selected

print('Shape after droping columns', df.shape)          # Print final shape of df for validation

Shape before droping columns (381, 1694)
Shape after droping columns (381, 1694)


### 3.2 Data Validation

In [15]:
# Validate that no inf and -inf values remain
inf_vals = df.isin([np.inf, -np.inf]).sum().sum()
nan_vals = df.isna().sum().sum()
print('There are:',inf_vals,'inf or -inf values in df')
print('There are:',nan_vals,'NaN values in df')
if (inf_vals>0) or (nan_vals>0): 
    df = df.replace([np.inf, -np.inf], np.NaN)
    df = df.dropna(axis=0)
    inf_vals = df.isin([np.inf, -np.inf]).sum().sum()
    nan_vals = df.isna().sum().sum()
    print('After conversion there are:', inf_vals,'inf or -inf values in df')
    print('After conversion there are:',nan_vals,'NaN values in df')

There are: 0 inf or -inf values in df
There are: 0 NaN values in df


In [16]:
# Simple utility to find a column name based on what it starts with
# May need it to find target column values
[col for col in df if col.startswith('Benzene-Spot, Current Month, High')]

['Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_1',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_2',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_3',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_4',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_5',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_6',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_7',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_8',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_9',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon

### 3.3 Configure Global Inputs

In [17]:
# Our target and Identifying columns won't change for any of the models so we'll define them once here
#target = 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America'
target = 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America'

# Our Identifying (IDcol) is 'date' for time series analyses
IDcol = ['date']

# All columns that are not Targets(predictee's) or ID's(dates) should be used to predict the Target
predictors = [x for x in df.columns if x not in [target]+[IDcol]]

print("Algorithm will attempt to predict:\n\t", target, "\nusing:\n\t", IDcol, "\nbased on:\n\t", len(predictors), "predictors")

Algorithm will attempt to predict:
	 Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America 
using:
	 ['date'] 
based on:
	 1693 predictors


### 3.4 Looped Lagging and CV Optimization

In [18]:
# Set optimization output locations
pfile = '../../Data/Models/fit_params.csv' # Best fit per lag parameter location
cvfile = '../../Data/Models/cv_res_log.csv' # cross validation result log
rfile = '../../Data/Models/runtime.csv' # run time log 1-13 for each lag and 0 for total runtime.

#### 3.4.1 Limit dates to range of interest

In [19]:
# Set start_date for slicing df
start_date = '2015-01-01'
# Set end_date for slicing df
end_date = '2018-12-31'#'2019-12-31'

#df_full = df.copy(deep=True)
#df = df[start_date : end_date]
print('df.shape =', df.shape)

df.shape = (381, 1694)


#### 3.4.2 Export data for use later

In [20]:
# Define location of historical data file
model_df_root = '../../Data/Models/'
model_df_file = 'Random_Forest_Models_df'
df_hist = pd.DataFrame()
run_date = datetime.now().strftime('%Y%m%d_%H')
#run_date = '20220427'
run_date

'20220810_15'

In [21]:
# Load Historical Data file
hist_data_loc = model_df_root+model_df_file+'.parquet'
if os.path.isfile(hist_data_loc):
    df_hist = pd.read_parquet(model_df_root+model_df_file+'.parquet')
    print(model_df_file + ' dataset loaded with shape', df_hist.shape, 'and', df_hist.isna().sum().sum(), 'NaN values')
    
else:
    df_hist = pd.DataFrame()
    print('Storage file does not exist. Beginning with empty DataFrame.')
#df_hist.head(5)

Random_Forest_Models_df dataset loaded with shape (381, 9352) and 1684788 NaN values


In [22]:
# Create new_hist
new_hist = pd.concat([df],keys=[run_date], names=['run_date'], axis=1)
#new_hist.head(5)

In [23]:
# Append current model df to df_hist 
df_hist = pd.concat([df_hist, new_hist], axis=1)
#df_hist

In [24]:
# Store df_hist
df_hist.to_parquet(path=model_df_root+model_df_file+'.parquet', engine='pyarrow', compression=None, index=True)

In [25]:
# Clean up memory from large parqet files
dellist = [df_hist, new_hist]
del df_hist, new_hist
del dellist

#### Looped lagging Hyperparameter Optimization with storage

#### 3.4.3 Configure Shelf

In [26]:
# Run parameters for function
model_dataroot = '../../Data/Models/'
model_shelf = 'Random_Forest_Models'
fit_shelf = model_dataroot+model_shelf

#### 3.4.4 Define cv_interval parameters

In [27]:
# Determine where to start training for cv interval
df.index.get_loc('2020-12-27')
#init_ind = df.index.get_loc('2017-12-31')
#print('Initial training index is:',init_ind)
#ts_cv_exwindow(df, init_ind, 1, 13, verbose=True)

#init_ind = df.index.get_loc(df.index[-1])
#print('Initial training index is:',init_ind)
#ts_cv_exwindow(df, init_ind, 1, 1, verbose=True)

298

In [28]:
# Print valid sklearn scoring methods
#import sklearn
#[x for x in sorted(sklearn.metrics.SCORERS.keys()) if x.startswith('neg')]
#[x for x in sorted(sklearn.metrics.SCORERS.keys())]

### 3.5 Random Forest

#### 3.5.1 Lag_RF_RandomSearchCV

This step can take hours. df.shape = (195, 1568) with 13 lags and 60 search_iters over a broad random_grid takes 12 hours

In [29]:
# Define alg search params
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 50)] # [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt'] # ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 20)] # [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(2, 50, num = 11)] #[2, 5, 10] # [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 4, 5, 10, 20] # [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False] # [True, False]
# Define Random Forest Regressor HP grid for DV
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }

In [30]:
#fit_params, cv_res_log, runtimelog = Lag_RF_RandomSearchCV(df, 142, 1, 13, random_grid, predictors, target, fit_shelf, url, lags=13, search_iters=60, init_params=None, verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

In [31]:
# Locate Fit Parameters of specific run and specific lag
#fit_params.loc[(fit_params['run_date'] == '20220427') & (fit_params['lag']==2)]

# Print all fit_params
#fit_params

In [32]:
# Print log of all cv_results_
#cv_res_log

In [33]:
# Print Runtimes of each Lag
#runtimelog

#### 3.5.2 Lag_RF_BayesSearchCV

In [34]:
# Define alg search params
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = len(predictors), num = 50)] # [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt'] # ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(2, 110, 50)] # [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(2, 50, 20)] #[2, 5, 10] # [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(1, 30, 20)]#[1, 2, 3, 4, 5, 10, 20] # [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False] # [True, False]
# Define Random Forest Regressor HP grid for DV
bayes_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }
[print(item, bayes_grid[item]) for item in bayes_grid]

n_estimators [20, 54, 88, 122, 156, 190, 224, 259, 293, 327, 361, 395, 429, 463, 498, 532, 566, 600, 634, 668, 702, 737, 771, 805, 839, 873, 907, 941, 976, 1010, 1044, 1078, 1112, 1146, 1180, 1215, 1249, 1283, 1317, 1351, 1385, 1419, 1454, 1488, 1522, 1556, 1590, 1624, 1658, 1693]
max_features ['auto', 'sqrt']
max_depth [2, 4, 6, 8, 10, 13, 15, 17, 19, 21, 24, 26, 28, 30, 32, 35, 37, 39, 41, 43, 46, 48, 50, 52, 54, 57, 59, 61, 63, 65, 68, 70, 72, 74, 76, 79, 81, 83, 85, 87, 90, 92, 94, 96, 98, 101, 103, 105, 107, 110, None]
min_samples_split [2, 4, 7, 9, 12, 14, 17, 19, 22, 24, 27, 29, 32, 34, 37, 39, 42, 44, 47, 50]
min_samples_leaf [1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, 30]
bootstrap [True, False]


[None, None, None, None, None, None]

In [35]:
#fit_params, cv_res_log, runtimelog = Lag_RF_BayesSearchCV(df, 142, 1, 13, bayes_grid, predictors, target, fit_shelf, url, lags=13, search_iters=60, init_params=None, score_method='max_error', verbose=True, debugging=False)

In [36]:
fit_params, cv_res_log, runtimelog = Lag_RF_BayesSearchCV(df, 298, 1, 13, bayes_grid, predictors, target, fit_shelf, url, lags=13, search_iters=60, init_params=None, score_method='neg_root_mean_squared_error', verbose=True, debugging=False)
export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

../../Data/Models/Random_Forest_Models opened 
Initial training window ends on:  2020-12-27 00:00:00+00:00
Number of Folds = 71
Model Date is: 2022/08/10 15:17 

Lag 1
Shape of lag 1 iteration df is (380, 1694)
Optimization runtime was 4:39:29.250157

Lag 2
Shape of lag 2 iteration df is (379, 1694)
Optimization runtime was 3:18:02.032931

Lag 3
Shape of lag 3 iteration df is (378, 1694)
Optimization runtime was 2:34:58.154948

Lag 4
Shape of lag 4 iteration df is (377, 1694)
Optimization runtime was 4:16:00.092991

Lag 5
Shape of lag 5 iteration df is (376, 1694)
Optimization runtime was 3:19:27.073597

Lag 6
Shape of lag 6 iteration df is (375, 1694)
Optimization runtime was 4:19:53.017655

Lag 7
Shape of lag 7 iteration df is (374, 1694)
Optimization runtime was 2:59:57.066290

Lag 8
Shape of lag 8 iteration df is (373, 1694)
Optimization runtime was 1:12:46.097724

Lag 9
Shape of lag 9 iteration df is (372, 1694)
Optimization runtime was 3:31:37.257826

Lag 10
Shape of lag 10 itera

#### 3.5.3 Lag_RF_GASearchCV

In [37]:
# Define alg search params
# Number of trees in random forest
n_estimators = Integer(20, len(predictors))
# Number of features to consider at every split
max_features = Categorical(['auto', 'sqrt'])
# Maximum number of levels in tree
max_depth = Integer(2, 110)
#max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = Integer(2, 50) 
# Minimum number of samples required at each leaf node
min_samples_leaf = Integer(1, 30) 
# Method of selecting samples for training each tree
bootstrap = Categorical([True, False])
# Define Random Forest Regressor HP grid for DV
genetic_grid = {'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth,
                'min_samples_split': min_samples_split,
                'min_samples_leaf': min_samples_leaf,
                'bootstrap': bootstrap
               }

genetic_grid

{'n_estimators': <sklearn_genetic.space.space.Integer at 0x17610942c48>,
 'max_features': <sklearn_genetic.space.space.Categorical at 0x176109425c8>,
 'max_depth': <sklearn_genetic.space.space.Integer at 0x17610942b48>,
 'min_samples_split': <sklearn_genetic.space.space.Integer at 0x17610942ec8>,
 'min_samples_leaf': <sklearn_genetic.space.space.Integer at 0x176109424c8>,
 'bootstrap': <sklearn_genetic.space.space.Categorical at 0x17610942908>}

In [38]:
#fit_params, cv_res_log, runtimelog = Lag_RF_GeneticSearchCV(df, 142, 1, 13, genetic_grid, predictors, target, fit_shelf, url, lags=13, search_iters=60, init_params=None, verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

### 3.6 XGBoost

#### 3.6.1 Lag_XGB_RandomSearchCV

In [39]:
# Define alg search params
# Booster
booster = ['gbtree','gblinear','dart']
# Learning Rate
learning_rate= [x for x in np.arange(0.00, 1, 0.05)] # [x for x in np.linspace(start = 0.001, stop = 2000, num = 1000)]
# Loss reduction
gamma = [ x for x in np.linspace(0, 100, 101)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 110, num = 100)] # [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# min_child_weight
min_child_weight = [ x for x in np.linspace(0, 1000, 1001)]
# subsample
subsample = [x for x in np.arange(0, 1.0, 0.1)] # [x for x in np.linspace(0.1, 1, 10)] #
# colsample_bytree
colsample_bytree = [x for x in np.arange(0.1, 1.0, 0.1)]
# colsample_bylevel
colsample_bylevel = [x for x in np.arange(0.1, 1.0, 0.1)]
# Define Random Forest Regressor HP grid for DV
random_grid = {'booster': booster,
               'gamma' : gamma,
               'max_depth': max_depth,
               'min_child_weight' : min_child_weight,
               'learning_rate': learning_rate,
               'subsample': subsample,
               'colsample_bytree': colsample_bytree,
               'colsample_bylevel': colsample_bylevel
               }



In [40]:
#fit_params, cv_res_log, runtimelog = Lag_XGB_RandomSearchCV(df, 142, 1, 13, random_grid, predictors, target, fit_shelf, url, lags=13, search_iters=600, init_params=None, verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

#### 3.6.2 Lag_XGB_BayesSearchCV

In [41]:
# Define alg search params
# Booster
booster = ['gbtree','gblinear','dart']
# Learning Rate
learning_rate= [x for x in np.arange(0.95, 1, 0.05)] # [x for x in np.arange(0, 1, 0.05)]
# Loss reduction
gamma = [ x for x in np.linspace(0, 1, 3)] # [ x for x in np.linspace(0, 100, 101)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5000, 10000, 6)] # [int(x) for x in np.linspace(1, 110, num = 100)]
#max_depth.append(None)
# min_child_weight
min_child_weight = [ x for x in np.linspace(0, 1, 3)] # [ x for x in np.linspace(0, 1000, 2001)]
# subsample
subsample = [x for x in np.arange(0, 0.1, 0.05)] # [x for x in np.arange(0, 1.0, 0.1)]
# colsample_bytree
colsample_bytree = [x for x in np.arange(0.1, 1.0, 0.1)]
# colsample_bylevel
colsample_bylevel = [x for x in np.arange(0.1, 1.0, 0.1)]
# Define Random Forest Regressor HP grid for DV
bayes_grid = {#'booster': booster,
               'gamma' : gamma,
               'max_depth': max_depth,
               'min_child_weight' : min_child_weight,
               'learning_rate': learning_rate,
               'subsample': subsample,
               'colsample_bytree': colsample_bytree,
               'colsample_bylevel': colsample_bylevel
               }
#[print(item, bayes_grid[item]) for item in bayes_grid]

In [42]:
#fit_params, cv_res_log, runtimelog = Lag_XGB_BayesSearchCV(df, 142, 1, 13, bayes_grid, predictors, target, fit_shelf, url, lags=13, search_iters=60, init_params=None, score_method='neg_root_mean_squared_error', verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

### 3.7 Multi-Layer Perceptrono Regressor (MLPR) Neural Net

In [43]:
# Define alg search params
# 
hidden_layer_sizes = [
    (50,), 
    (50, 50), (50, 100),
    (50,50,50), (50,100,50), (50,50,100),
    (100,),
    (100, 50), (100,100),
    (100, 50, 50), (100, 100, 50), (100, 50, 100)
    ]
# 
activation = ['tanh', 'relu']
# 
solver= ['lbfgs', 'sgd', 'adam']
# 
alpha = [x for x in np.linspace(0.0001, 1, num = 1000)]
# 
learning_rate = ['constant','adaptive']
#
max_iter = [x for x in np.linspace(5, 5000, num= 5000)]

# Define Random Forest Regressor HP grid for DV
random_grid = {'hidden_layer_sizes': hidden_layer_sizes,
               'activation': activation,
               'solver': solver,
               'alpha': alpha,
               'learning_rate': learning_rate
               }
random_grid

{'hidden_layer_sizes': [(50,),
  (50, 50),
  (50, 100),
  (50, 50, 50),
  (50, 100, 50),
  (50, 50, 100),
  (100,),
  (100, 50),
  (100, 100),
  (100, 50, 50),
  (100, 100, 50),
  (100, 50, 100)],
 'activation': ['tanh', 'relu'],
 'solver': ['lbfgs', 'sgd', 'adam'],
 'alpha': [0.0001,
  0.001100900900900901,
  0.0021018018018018015,
  0.0031027027027027026,
  0.004103603603603604,
  0.005104504504504504,
  0.006105405405405406,
  0.0071063063063063064,
  0.008107207207207206,
  0.009108108108108108,
  0.010109009009009007,
  0.011109909909909909,
  0.01211081081081081,
  0.01311171171171171,
  0.014112612612612612,
  0.015113513513513512,
  0.016114414414414413,
  0.017115315315315315,
  0.018116216216216216,
  0.019117117117117114,
  0.020118018018018016,
  0.021118918918918917,
  0.02211981981981982,
  0.02312072072072072,
  0.02412162162162162,
  0.02512252252252252,
  0.02612342342342342,
  0.027124324324324323,
  0.028125225225225224,
  0.029126126126126126,
  0.030127027027027024

#### 3.7.1 Lag_MLPR_BayesSearchCV

In [44]:
#fit_params, cv_res_log, runtimelog = Lag_MLPR_BayesSearchCV(df, 142, 1, 13, random_grid, predictors, target, fit_shelf, url, lags=13, search_iters=600, init_params=None, verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

#### 3.7.2 Lag_MLPR_RandomSearchCV

In [45]:
#fit_params, cv_res_log, runtimelog = Lag_MLPR_RandomSearchCV(df, 142, 1, 13, random_grid, predictors, target, fit_shelf, url, lags=13, search_iters=600, init_params=None, verbose=True, debugging=False)
#export_optimization(fit_params, cv_res_log, runtimelog, pfile, cvfile, rfile)

## 3.7 Long Short Term Memory (Keras LSTM)

### 3.7.1 Lag_kLSTM_BayesSearchCV

# Documentation and Links

#### Time Series WFV and Optimization
* https://notebooks.githubusercontent.com/view/ipynb?browser=chrome&color_mode=auto&commit=24f6be86f95bfc1ec246dee7dcdd455e0a84a872&device=unknown&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f616c616e2d747572696e672d696e737469747574652f736b74696d652f323466366265383666393562666331656332343664656537646364643435356530613834613837322f6578616d706c65732f77696e646f775f73706c6974746572732e6970796e62&logged_in=false&nwo=alan-turing-institute%2Fsktime&path=examples%2Fwindow_splitters.ipynb&platform=android&repository_id=156401841&repository_type=Repository&version=98
* https://quantile.app/blog/cross_validation
* https://towardsdatascience.com/dont-use-k-fold-validation-for-time-series-forecasting-30b724aaea64
* https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
* https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

#### Bayesian and Evolutionary Optimization
* https://machinelearningmastery.com/what-is-bayesian-optimization/
* https://www.analyticsvidhya.com/blog/2021/05/bayesian-optimization-bayes_opt-or-hyperopt/
* https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html
* https://towardsdatascience.com/tune-your-scikit-learn-model-using-evolutionary-algorithms-30538248ac16
* https://machinelearningmastery.com/scikit-optimize-for-hyperparameter-tuning-in-machine-learning/
* https://towardsdatascience.com/optimizing-hyperparameters-the-right-way-3c9cafc279cc

#### Error
* https://scikit-learn.org/stable/modules/model_evaluation.html
* https://towardsdatascience.com/forecast-kpi-rmse-mae-mape-bias-cdc5703d242d#:~:text=The%20Mean%20Absolute%20Percentage%20Error,average%20of%20the%20percentage%20errors.

#### SKTime Documentation:
* https://www.sktime.org/en/latest/examples/01_forecasting.html

#### SKforcast (PIP, not installed in environment):
* https://www.cienciadedatos.net/documentos/py27-time-series-forecasting-python-scikitlearn.html

#### Incremental Forecast loop using standard sklearn algorithms:
* https://www.analyticsvidhya.com/blog/2021/06/random-forest-for-time-series-forecasting/
* https://towardsdatascience.com/time-series-modeling-using-scikit-pandas-and-numpy-682e3b8db8d1
* https://www.ethanrosenthal.com/2019/02/18/time-series-for-scikit-learn-people-part3/

#### TS with Random Forest
* https://towardsdatascience.com/multivariate-time-series-forecasting-using-random-forest-2372f3ecbad1
* https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/

#### ARIMA
* https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

#### Prophet 
* https://towardsdatascience.com/implementing-facebook-prophet-efficiently-c241305405a3