# Scikit Learn Benzene Price Forecasting

## 0.0 Notes and Explainations

This notebook uses SQL to query::
* lyb-sql-devdacore-002.5bff9fcb8330.database.windows.net 
    * To extract the information contained in the t-code ZMRRELPO
* lyb-sql-prddacore-002.bed79ae4ef8b.database.windows.net
    * To extract ZEMA IHS data
    * To extract ZEMA ICIS data

### 0.1 Environment setup

This workbook utilizes the py37_benzene environment which can be installed via the Anaconda Prompt from your local repo sync by running:
> conda env create -f py37_benzene.yml

This .yml file is stored in the root scripts directory

## 1.0 Prepare Workspace

### 1.1 Import Standard Libraries and configure runtime parameters

In [1]:
##
# Import Basic Python DS Libraries 
import pandas as pd     # Standard data science package
import numpy as np      # Additional numerical functions
#import dtale
import warnings
warnings.filterwarnings('ignore')

##
# Import Advanced Python DS Libraries
import missingno as msno    # Missing value toolbox
#from pandas_profiling import ProfileReport     # Integrated/deep reporting, resource intensive
from statsmodels.tsa.stattools import adfuller  # Statistical test for stationary data

##
# Import Database Connection Libraries
#import pyodbc           # Database connector

##
# Import Plotting Libraries
import matplotlib.pyplot as plt                # Full featured plotting toolbox
plt.rcParams['figure.figsize'] = (25,8)
#import plotly.graph_objects as go               #Plotly GO toolbox
import plotly.express as px                    # Plotly Express plotting toolbox
import plotly.io as pio                        # Addtional Controls for plotly to allow visuals within VSCode Notebook
#import seaborn as sns       # Seaborn plotting toolbox

##
# Import Operating System Interface Libraries
import os       # operating system interface
import sys
#import calendar
#import datetime as dt
from datetime import datetime, timedelta, date

##
# Import File Format Libraries
#import pyarrow.feather as feather       # Advanced datasource import/export toolbox (Apache Parquet)
import shelve # Allows serialization (pickle-ing) of multiple opjects and saving them to dict objects for later use
#import sqlite3

##
# Import ML and Forecasting Libraries
#from scipy import interpolate
#from sklearn import metrics     # Metrics for sklearn ML models
from sklearn.ensemble import RandomForestRegressor # ML Model package
#from sklearn.tree import DecisionTreeRegressor  # ML Model package

from sktime.forecasting.model_selection import ExpandingWindowSplitter, SlidingWindowSplitter
#from sklearn.model_selection import TimeSeriesSplit

#from sktime.forecasting.model_selection import ForecastingGridSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.feature_selection import RFECV, RFE
#from xgboost import XGBRegressor, XGBRFRegressor
#from prophet import Prophet

##
# Setup packages for sending messages
import requests
import json

##
# Import timing function(s)
from time import perf_counter
import datetime

In [2]:
# Configure Libraries
pio.renderers.default = "notebook_connected" # Configure plotly to print within VSCode environment
#pio.renderers.default = "vscode"


### 1.2 Import Custom Functions

In [3]:
# Define location for custom functions
module_path = os.path.abspath(os.path.join('../Functions'))

# Verify it's accessible for loading
if (module_path not in sys.path) & (os.path.isdir(module_path)):
    sys.path.append(module_path)
    print('Added', module_path, 'to system paths')

elif (module_path in sys.path) & (os.path.isdir(module_path)):
    print(module_path, 'ready to be used for import')

else:
    print(module_path, 'is not a valid path')

# Import Custom Functions
try: from multi_plot import *; print('Loaded multiplot')
except: print('failed to load multi_plot')

try: from StationaryTools import *; print('Loaded StationaryTools')
except: print('Failed to load StationaryTools')

try: from RegressionTools import *; print('Loaded RegressionTools')
except: print('Failed to load RegressionTools')

try: from ModelingTools import *; print("Loaded ModelingTools")
except: print('ModelingTools failed to load')

try: from  sk_ts_modelfit import *; print('Loaded sk_ts_modelfit')
except: print('Failed to load sk_ts_modelfit')

Added c:\Users\baanders\Documents\Benzene Forecasting\Scripts\Functions to system paths
Loaded multiplot
Loaded multiplot
Loaded StationaryTools
Loaded StationaryTools
Loaded RegressionTools


Using TensorFlow backend.


Loaded ModelingTools
Loaded sk_ts_modelfit


### 1.3 Configure messaging

In [4]:
url = 'https://lyondellbasell.webhook.office.com/webhookb2/0b44f724-4c38-4322-af43-1cc8a79b2240@fbe62081-06d8-481d-baa0-34149cfefa5f/IncomingWebhook/7d6ffad65b414df9bd9bf5fa64da44b2/76b3b453-0a85-40ff-a5d7-15d042decdf6'

#payload = {
#    "text": "Sample alert text"
#}
#headers = {
#    'Content-Type': 'application/json'
#}
#response = requests.post(url, headers=headers, data=json.dumps(payload))
#print(response.text.encode('utf8'))

## 2.0 Import Data

### 2.1 Import data

#### 2.1.1 Modeled df

In [5]:
# Define default storage location for files
model_df_root = '../../Data/Models/'
model_df_file = 'Random_Forest_Models_df'

In [6]:
# Check if data location above exists. If it does import dataset.
# All datasets imported with name df so that we can generically 

if os.path.isdir(model_df_root):
    df_model = pd.read_parquet(model_df_root+model_df_file+'.parquet')
    print(model_df_file + ' dataset loaded with shape', df_model.shape, 'and', df_model.isna().sum().sum(), 'NaN values') 
else:
    print('Storage location does not exist. Please update directory and try again.')
#df_all.head(5)

Random_Forest_Models_df dataset loaded with shape (381, 11046) and 1684788 NaN values


In [7]:
# Extract the most recent data file from the multiindex df_all
newest_df_date = max(df_model.columns.levels[0])
df_model_newest = df_model[newest_df_date]
print(f'{newest_df_date} is the model date being processed')

20220810_15 is the model date being processed


#### 2.1.2 Most Recent Data df

In [8]:
# Full dataset location
df_full_loc = '../../Data/Parquet/SKLearn Data/weekly_for_modeling.parquet'
df_full = pd.read_parquet(df_full_loc)
df_full = df_full[max(df_full.columns.levels[0])]


In [9]:
# Set validation end date for full data set
#end_date = '2019-12-31'

# Set end_date to newest data point
end_date = df_full.index[-1]

In [10]:
# Truncate df_Full to end at end_date
df_full = df_full[:end_date]
print('date ranges\n\t',df_full.index[0], ":", df_full.index[-1])
print('shape', df_full.shape)

date ranges
	 2015-04-12 00:00:00+00:00 : 2022-07-24 00:00:00+00:00
shape (381, 5432)


In [11]:
# Limit df_full to not include peak COVID era data
covid_start = '2020-01-01'
covid_stop = '2021-12-31'

pre_covid_df = df_full[:covid_start]
pos_covid_df = df_full[covid_stop:]

non_covid_df = pd.concat([pre_covid_df, pos_covid_df], axis=0)

# Commenting and/or uncommenting this line will control the COVID restriction enforcement.
#df_full = non_covid_df.copy(deep=True)

print('date ranges\n\t',df_full.index[0], ":", df_full.index[-1])
print('shape', df_full.shape)

date ranges
	 2015-04-12 00:00:00+00:00 : 2022-07-24 00:00:00+00:00
shape (381, 5432)


#### 2.1.3 Match columns in df_full to df_model

In [12]:
df = df_full[df_model_newest.columns]
print('(rows, columns)\ndf_model shape', df_model_newest.shape, '\ndf_full shape', df_full.shape,'\n\ndf shape',df.shape)

(rows, columns)
df_model shape (381, 1694) 
df_full shape (381, 5432) 

df shape (381, 1694)


### 2.2 Import Model Pickle Files

In [13]:
# Define storage location for Shelf
model_dataroot = '../../Data/Models/'
model_shelf = 'Random_Forest_Models'

# Load Shelf if it exists
if os.path.isdir(model_dataroot):
    # Open shelf for read only
    s = shelve.open(model_dataroot+model_shelf, flag='c', writeback=False)
    print(model_dataroot + model_shelf+' opened ')
else:
    print('Shelve storage location does not exist. Please correct and try again.')

../../Data/Models/Random_Forest_Models opened 


### 2.3 Data Validation

In [14]:
# Display df
print(newest_df_date, 'is df_model slice being used, validate it matches model dates if recreating model fits')
df

20220810_15 is df_model slice being used, validate it matches model dates if recreating model fits


Unnamed: 0_level_0,Benzene CFR Taiwan MAvg (USD/MT)-AVERAGE,Benzene CFR Taiwan Weekly (USD/MT)-AVERAGE,Benzene ENEOS Corporation CP Nomination (USD/MT)-AVERAGE,Benzene ENEOS Corporation CP Settlement (USD/MT)-AVERAGE,Benzene FOB Brazil Weekly (USD/MT)-AVERAGE,Benzene FOB Korea Marker (USD/MT)-AVERAGE,Benzene FOB Korea Marker MAvg (USD/MT)-AVERAGE,Benzene FOB Korea Marker WAvg (USD/MT)-AVERAGE,Benzene FOB Korea Paper BalMo (USD/MT)-AVERAGE,Benzene FOB Korea Paper BalMo MAvg (USD/MT)-AVERAGE,...,"Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_4","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_5","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_6","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_7","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_8","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_9","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_10","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_11","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_12","Xylenes (mixed)-Spot, Next Month, Low-N/A-Cents per Gallon-Houston, TX-North America_lag_13"
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-12 00:00:00+00:00,833.30,780.000,820.0,770.0,731.8,789.700,826.610,762.740,796.6,800.90,...,1953.280,1919.76,1880.80,1815.20,1702.08,1631.44,1571.60,1511.84,1549.44,1580.80
2015-04-19 00:00:00+00:00,833.30,816.600,820.0,770.0,761.6,836.600,826.610,799.080,835.4,800.90,...,1796.280,1953.28,1919.76,1880.80,1815.20,1702.08,1631.44,1571.60,1511.84,1549.44
2015-04-26 00:00:00+00:00,833.30,870.100,820.0,770.0,795.2,864.900,826.610,842.260,843.0,800.90,...,1687.700,1796.28,1953.28,1919.76,1880.80,1815.20,1702.08,1631.44,1571.60,1511.84
2015-05-03 00:00:00+00:00,823.79,865.100,852.0,808.0,816.2,841.550,816.788,857.016,843.0,806.00,...,1815.120,1687.70,1796.28,1953.28,1919.76,1880.80,1815.20,1702.08,1631.44,1571.60
2015-05-10 00:00:00+00:00,785.75,841.600,900.0,865.0,784.0,849.100,777.500,845.972,855.2,826.40,...,1795.136,1815.12,1687.70,1796.28,1953.28,1919.76,1880.80,1815.20,1702.08,1631.44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-06-26 00:00:00+00:00,1374.58,1314.200,1350.0,1305.0,1646.4,1250.202,1333.410,1323.376,1317.0,1385.21,...,3217.800,2944.40,2584.84,2792.94,2624.40,1864.44,2024.08,2623.80,2571.76,2638.74
2022-07-03 00:00:00+00:00,1374.58,1255.434,1344.0,1302.0,1622.2,1253.334,1333.410,1250.826,1281.6,1385.21,...,2859.000,3217.80,2944.40,2584.84,2792.94,2624.40,1864.44,2024.08,2623.80,2571.76
2022-07-10 00:00:00+00:00,1374.58,1236.236,1320.0,1290.0,1565.0,1171.600,1333.410,1236.984,1190.8,1385.21,...,3491.820,2859.00,3217.80,2944.40,2584.84,2792.94,2624.40,1864.44,2024.08,2623.80
2022-07-17 00:00:00+00:00,1374.58,1133.266,1320.0,1290.0,1331.0,1066.066,1333.410,1147.496,1085.8,1385.21,...,3299.080,3491.82,2859.00,3217.80,2944.40,2584.84,2792.94,2624.40,1864.44,2024.08


In [15]:
# Validate that no inf and -inf values remain
inf_vals = df.isin([np.inf, -np.inf]).sum().sum()
nan_vals = df.isna().sum().sum()
print('There are:',inf_vals,'inf or -inf values in df')
print('There are:',nan_vals,'NaN values in df')
if (inf_vals>0) or (nan_vals>0): 
    df = df.replace([np.inf, -np.inf], np.NaN)
    df = df.dropna(axis=0)
    inf_vals = df.isin([np.inf, -np.inf]).sum().sum()
    nan_vals = df.isna().sum().sum()
    print('After conversion there are:', inf_vals,'inf or -inf values in df')
    print('After conversion there are:',nan_vals,'NaN values in df')

There are: 0 inf or -inf values in df
There are: 0 NaN values in df


In [16]:
# Show list of models that match newest_df_date above
#list(s)
#model_list = [x for x in s if x.startswith(newest_df_date)]

# Show list of models for the most recent addtion 
# (13 models per optimizataion so steps backward need to be in multiples of 13)
# (i.e. -1:Latest, -14:one before latest, etc.)
model_list = [x for x in s if x.startswith(list(s)[-1].split('_')[0])]
#model_list = [x for x in s if x.startswith(list(s)[-14].split('_')[0])]
#shelf[list(shelf)[lcnt]].best_estimator_
for model_num in range(0,len(model_list)):
    print(model_list[model_num], " : \n\t ", s[model_list[model_num]].best_estimator_, "\n")

2022/08/10 15:17 _lag_1  : 
	  RandomForestRegressor(max_depth=35, min_samples_split=24, n_estimators=839,
                      n_jobs=-1) 

2022/08/10 15:17 _lag_2  : 
	  RandomForestRegressor(max_depth=98, min_samples_leaf=23, min_samples_split=42,
                      n_estimators=702, n_jobs=-1) 

2022/08/10 15:17 _lag_3  : 
	  RandomForestRegressor(max_depth=37, min_samples_leaf=25, min_samples_split=27,
                      n_estimators=190, n_jobs=-1) 

2022/08/10 15:17 _lag_4  : 
	  RandomForestRegressor(max_depth=41, min_samples_leaf=23, min_samples_split=22,
                      n_estimators=771, n_jobs=-1) 

2022/08/10 15:17 _lag_5  : 
	  RandomForestRegressor(max_depth=21, min_samples_leaf=25, min_samples_split=14,
                      n_estimators=20, n_jobs=-1) 

2022/08/10 15:17 _lag_6  : 
	  RandomForestRegressor(max_depth=48, min_samples_leaf=25, min_samples_split=50,
                      n_estimators=1215, n_jobs=-1) 

2022/08/10 15:17 _lag_7  : 
	  RandomForest

In [17]:
# Simple utility to find a column name based on what it starts with
# May need it to find target column values
[col for col in df if col.startswith('Benzene-Spot, Current Month, High')]

['Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_1',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_2',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_3',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_4',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_5',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_6',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_7',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_8',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America_lag_9',
 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon

## 3.0 Predicting

### 3.1 Configure Prediction Inputs

In [18]:
# Our target and Identifying columns won't change for any of the models so we'll define them once here
#target = 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America'
target = 'Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America'

# Our Identifying (IDcol) is 'date' for time series analyses
IDcol = ['date']

# All columns that are not Targets(predictee's) or ID's(dates) should be used to predict the Target
predictors = [x for x in df.columns if x not in [target]+[IDcol]]

print("Algorithm will attempt to predict:\n\t", target, "\nusing:\n\t", IDcol, "\nbased on:\n\t", len(predictors), "predictors")

Algorithm will attempt to predict:
	 Benzene-Spot, Current Month, High-N/A-Cents per Gallon-FOB Houston, TX-North America 
using:
	 ['date'] 
based on:
	 1693 predictors


### 3.2 Looped Prediction using Lagging and CV Optimizations

Function def for looped prediction

In [19]:
# Define storage location for output files
pred_loc = '../../Data/Models/RF_pred_output.parquet'
lpred_loc = '../../Data/Models/RF_lpred_output.parquet'
param_loc = '../../Data/Models/RF_fit_params_output.parquet'

In [20]:
# Return row of data that matches a date, date must match exactly
#df_full['2018-12-24':]

# Return row index# that matches iw_date, date must match exactly
iw_date = '2018-12-30' # Validation end date
#iw_date = df.index[-1] # Full dataset (true future prediction)
iw_num = df.index.get_loc(iw_date)
iw_num

194

In [21]:
#Test date info in df
#df.iloc[:iw_num+2,:]


In [22]:
# Test if model supports Feature Importance saving
model_name = str(s[model_list[model_num]].best_estimator_).split('(')[0]
if model_name in ['RandomForestRegressor','XGBoost']:
    feat_import_toggle = True
else:
    feat_import_toggle = False

In [23]:
# Test if we are validating the system on old data only or performing regular use future forecasting
if df.shape[0] - iw_num == 1:
    print('General use future forecasting. No psuedo-forecasting of history for validation\n\n Running Lag_RF_predict')
    # Run a single new prediction using new data
    if feat_import_toggle:
        df_preds, df_predsl, df_predsh, df_predshl, fit_params, param_hist = Lag_Predict(df, predictors, target, model_list, s, pred_loc, lpred_loc, param_loc)
    else:
        df_preds, df_predsl, df_predsh, df_predshl = Lag_Predict(df, predictors, target, model_list, s, pred_loc, lpred_loc, param_loc)

else: 
    # Print number of steps that should be taken in the function call below (index_max - iw_int)
    print(df.shape[0] - iw_num,'weeks of data will be forecasted as if done in in the past for validation in Lag_RF_History_Predict\napprox.', df.shape[0] - iw_num,'x 91 =',(df.shape[0] - iw_num)*91,'long data points should be created')
    # Simulate the past as though we used the method for each weekly step
    if feat_import_toggle:
        df_preds, df_predsl, df_predsh, df_predshl, fit_params, param_hist = Lag_History_Predict(df, iw_num, predictors, target, model_list, s, pred_loc, lpred_loc, param_loc, verbose=True)
    else:
        df_preds, df_predsl, df_predsh, df_predshl = Lag_History_Predict(df, iw_num, predictors, target, model_list, s, pred_loc, lpred_loc, param_loc, verbose=True)


# Close the shelf so that it doesn't get broken if it stays locked when we exit
s.close()
print('Shelf closed. Please rerun open sequence if you need to re-execute the model')

187 weeks of data will be forecasted as if done in in the past for validation in Lag_RF_History_Predict
approx. 187 x 91 = 17017 long data points should be created
RandomForestRegressor
Model Date: 2022/08/12 13:45 
Simulating run on  2018/12/30 00:00 UTC to predict weeks after 2019-01-06 00:00:00+00:00
Simulating run on  2019/01/06 00:00 UTC to predict weeks after 2019-01-13 00:00:00+00:00
Simulating run on  2019/01/13 00:00 UTC to predict weeks after 2019-01-20 00:00:00+00:00
Simulating run on  2019/01/20 00:00 UTC to predict weeks after 2019-01-27 00:00:00+00:00
Simulating run on  2019/01/27 00:00 UTC to predict weeks after 2019-02-03 00:00:00+00:00
Simulating run on  2019/02/03 00:00 UTC to predict weeks after 2019-02-10 00:00:00+00:00
Simulating run on  2019/02/10 00:00 UTC to predict weeks after 2019-02-17 00:00:00+00:00
Simulating run on  2019/02/17 00:00 UTC to predict weeks after 2019-02-24 00:00:00+00:00
Simulating run on  2019/02/24 00:00 UTC to predict weeks after 2019-03-0

In [24]:
df_predshl

Unnamed: 0,pred_date,value,lag,run_date,model_date
0,20190106,1511.991475,1,2018/12/30 00:00 UTC,2022/07/26 00:45
1,20190106,1618.263235,2,2018/12/30 00:00 UTC,2022/07/26 00:45
2,20190113,1618.263235,2,2018/12/30 00:00 UTC,2022/07/26 00:45
3,20190106,1569.814278,3,2018/12/30 00:00 UTC,2022/07/26 00:45
4,20190113,1595.386020,3,2018/12/30 00:00 UTC,2022/07/26 00:45
...,...,...,...,...,...
93543,20220918,2424.198419,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93544,20220925,2426.623608,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93545,20221002,2457.432674,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93546,20221009,2486.219778,13,2022/07/17 00:00 UTC,2022/08/12 13:45


In [25]:
# Simulate the past as hough we used the method for each weekly step
#df_preds, df_predsl, df_predsh, df_predshl = Lag_RF_History_Predict(df, iw_num, predictors, target, model_list, s, pred_loc, lpred_loc, verbose=True)

In [26]:
# Run a single new prediction using new data
#df_preds, df_predsl, df_predsh, df_predshl = Lag_RF_Predict(df, predictors, target, model_list, s, pred_loc, lpred_loc)
#df_preds

In [27]:
#s.close()

In [28]:
df_predshl

Unnamed: 0,pred_date,value,lag,run_date,model_date
0,20190106,1511.991475,1,2018/12/30 00:00 UTC,2022/07/26 00:45
1,20190106,1618.263235,2,2018/12/30 00:00 UTC,2022/07/26 00:45
2,20190113,1618.263235,2,2018/12/30 00:00 UTC,2022/07/26 00:45
3,20190106,1569.814278,3,2018/12/30 00:00 UTC,2022/07/26 00:45
4,20190113,1595.386020,3,2018/12/30 00:00 UTC,2022/07/26 00:45
...,...,...,...,...,...
93543,20220918,2424.198419,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93544,20220925,2426.623608,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93545,20221002,2457.432674,13,2022/07/17 00:00 UTC,2022/08/12 13:45
93546,20221009,2486.219778,13,2022/07/17 00:00 UTC,2022/08/12 13:45


# 5.0 Dataset Organization and Testing

sqlite3 - Requires specific local driver in Power BI that isn't installable. Use Parquet instead for data

Shelve - Work great for storing models over time as pickles on the shelf

# Documentation and Links

#### Time Series WFV and Optimization
* https://notebooks.githubusercontent.com/view/ipynb?browser=chrome&color_mode=auto&commit=24f6be86f95bfc1ec246dee7dcdd455e0a84a872&device=unknown&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f616c616e2d747572696e672d696e737469747574652f736b74696d652f323466366265383666393562666331656332343664656537646364643435356530613834613837322f6578616d706c65732f77696e646f775f73706c6974746572732e6970796e62&logged_in=false&nwo=alan-turing-institute%2Fsktime&path=examples%2Fwindow_splitters.ipynb&platform=android&repository_id=156401841&repository_type=Repository&version=98
* https://quantile.app/blog/cross_validation
* https://towardsdatascience.com/dont-use-k-fold-validation-for-time-series-forecasting-30b724aaea64
* https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
* https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/
* https://machinelearningmastery.com/what-is-bayesian-optimization/
* https://www.analyticsvidhya.com/blog/2021/05/bayesian-optimization-bayes_opt-or-hyperopt/
* https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html

#### Error
* https://towardsdatascience.com/forecast-kpi-rmse-mae-mape-bias-cdc5703d242d#:~:text=The%20Mean%20Absolute%20Percentage%20Error,average%20of%20the%20percentage%20errors.

#### SKTime Documentation:
* https://www.sktime.org/en/latest/examples/01_forecasting.html

#### SKforcast (PIP, not installed in environment):
* https://www.cienciadedatos.net/documentos/py27-time-series-forecasting-python-scikitlearn.html

#### Incremental Forecast loop using standard sklearn algorithms:
* https://www.analyticsvidhya.com/blog/2021/06/random-forest-for-time-series-forecasting/
* https://towardsdatascience.com/time-series-modeling-using-scikit-pandas-and-numpy-682e3b8db8d1
* https://www.ethanrosenthal.com/2019/02/18/time-series-for-scikit-learn-people-part3/

#### TS with Random Forest
* https://towardsdatascience.com/multivariate-time-series-forecasting-using-random-forest-2372f3ecbad1
* https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/

#### ARIMA
* https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

#### Prophet 
* https://towardsdatascience.com/implementing-facebook-prophet-efficiently-c241305405a3