**Notebook to model electricity price differences.**<br>

Brief summary of the exercise:
* Target: predict the price change of electricity from the Close price of one hour to the Close price of the next hour
* Input: the elementwise difference between the hourly 'candels' of the electricity price; explanation follows.

  Hourly candels refers to the Open, High, Low, Close prices and the total Volume traded within an hour; thus each hourly candel has five elements. Historically, one could reconstruct these candels for each consecutive hour. We refer to the length of history for which the candels are reconstructed as 'window_size'. Thus, for a window_size of 5, one has candels from 5 consecutive hours.
  
  Left to explain is the 'elementwise difference' remark. Out of the N window_size of history, we compute differences between the consecutive candels: Open in t minus Open in t-1, Low in t minus low in t-1, etc. Thus, from N candels we create N-1 differentiated candels.

  In this exercise, the elementwise differentiated data for window_sizes of 5 and 15 are provided

  The data is further enchanced by mostly encoded time-related information. These encode what day of the week the data correspond to, whether the day is a bank holiday or not, etc. A list will be given further below.

* ML models are to be built to predict the Target from the Input data.

In [1]:
# Load the data using the pandas library
# For the below demonstration, only the
#    * window_size of 5 files
#    * and only the training sets will be loaded here
# The validation and test sets can be similarly loaded.
# One can also use a similar kind of method for loading as was used for the download
# This can avoid the many lines of manual specification of the files. Manual is also fine, btw.

import pandas as pd
pd.set_option('display.max_columns', None)

X_train_5 = pd.read_csv('X_train_window_size_5_time_encoding_True.csv')
y_train_5 = pd.read_csv('y_train_window_size_5_time_encoding_True.csv')

# Display the first 5 rows of the predictor and the target data; for a description of the content, 
# see the text below this cell
print("Predictor data:")
display(X_train_5.head(10))
print("Target data:")
display(y_train_5.head())

Predictor data:


Unnamed: 0,total_hours,dlvry_weekend,dlvry_bank_holiday,dlvry_day_sin,dlvry_day_cos,dlvry_weekday_sin,dlvry_weekday_cos,dlvry_hour_sin,dlvry_hour_cos,lasttrade_weekend,lasttrade_bank_holiday,lasttrade_day_sin,lasttrade_day_cos,lasttrade_weekday_sin,lasttrade_weekday_cos,lasttrade_hour_sin,lasttrade_hour_cos,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0,10.001383,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.707107,-0.7071068,0.03,0.8,-1.17,-1.18,92.2,600.0,-9.51,-1.98,-8.31,-2.83,176.6,540.0,5.48,2.58,5.39,5.41,187.7,480.0,4.41,-0.21,1.3,-1.31,-18.2,420.0
1,10.001383,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.866025,-0.5,-9.51,-1.98,-8.31,-2.83,176.6,540.0,5.48,2.58,5.39,5.41,187.7,480.0,4.41,-0.21,1.3,-1.31,-18.2,420.0,-0.31,-0.2,-0.66,-3.18,-78.3,360.0
2,10.001383,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.965926,-0.258819,5.48,2.58,5.39,5.41,187.7,480.0,4.41,-0.21,1.3,-1.31,-18.2,420.0,-0.31,-0.2,-0.66,-3.18,-78.3,360.0,-3.49,-2.59,-7.97,-3.21,796.3,300.0
3,10.001383,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-1.0,-1.83697e-16,4.41,-0.21,1.3,-1.31,-18.2,420.0,-0.31,-0.2,-0.66,-3.18,-78.3,360.0,-3.49,-2.59,-7.97,-3.21,796.3,300.0,-2.9,-1.69,2.44,-0.01,-355.6,240.0
4,10.001383,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.965926,0.258819,-0.31,-0.2,-0.66,-3.18,-78.3,360.0,-3.49,-2.59,-7.97,-3.21,796.3,300.0,-2.9,-1.69,2.44,-0.01,-355.6,240.0,-0.7,-1.11,-1.11,-2.69,53.3,180.0
5,7.080482,0.0,0.0,0.394356,0.918958,0.0,1.0,0.258819,0.965926,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.707107,0.7071068,0.0,2.99,0.0,0.0,20.0,420.0,1.77,-1.2,0.0,0.5,40.6,360.0,-1.59,-1.09,-2.65,-2.21,555.0,300.0,-1.9,1.1,0.38,1.61,-332.7,240.0
6,7.080482,0.0,0.0,0.394356,0.918958,0.0,1.0,0.258819,0.965926,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.8660254,1.77,-1.2,0.0,0.5,40.6,360.0,-1.59,-1.09,-2.65,-2.21,555.0,300.0,-1.9,1.1,0.38,1.61,-332.7,240.0,1.94,-1.58,-2.42,-2.26,421.6,180.0
7,7.726036,0.0,0.0,0.394356,0.918958,0.0,1.0,0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.8660254,-0.21,-1.0,1.26,1.26,23.1,420.0,-1.69,0.07,-0.86,-0.36,328.4,360.0,-1.39,0.63,-0.7,2.23,-86.4,300.0,3.02,-0.4,-1.93,-2.61,377.2,240.0
8,7.726036,0.0,0.0,0.394356,0.918958,0.0,1.0,0.5,0.866025,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.258819,0.9659258,-1.69,0.07,-0.86,-0.36,328.4,360.0,-1.39,0.63,-0.7,2.23,-86.4,300.0,3.02,-0.4,-1.93,-2.61,377.2,240.0,-2.61,-1.7,0.23,0.49,104.2,180.0
9,8.726036,0.0,0.0,0.394356,0.918958,0.0,1.0,0.707107,0.707107,1.0,0.0,0.201299,0.97953,-0.781831,0.62349,-0.5,0.8660254,-0.6,-1.21,-3.48,-4.6,13.0,480.0,-4.08,-1.6,-3.2,-3.2,175.2,420.0,-0.01,5.01,1.99,7.16,108.5,360.0,5.08,-2.93,-0.99,-3.26,121.3,300.0


Target data:


Unnamed: 0,y
0,-3.18
1,-3.21
2,-0.01
3,-2.69
4,2.41


In [2]:
X_test_5 = pd.read_csv('X_test_window_size_5_time_encoding_True.csv')
y_test_5 = pd.read_csv('y_test_window_size_5_time_encoding_True.csv')

X_valid_5 = pd.read_csv('X_valid_window_size_5_time_encoding_True.csv')
y_valid_5 = pd.read_csv('y_valid_window_size_5_time_encoding_True.csv')

### Normalization

In [3]:
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")

def minmax_scale(df_x, series_y, features_to_minmax, normalizers=None):
    features_to_minmax.append('y')
    if not normalizers:
        normalizers = {}
    for feat in features_to_minmax:
        if feat not in normalizers:
            normalizers[feat] = MinMaxScaler()
            if feat != 'y':
                normalizers[feat].fit(df_x[feat].values.reshape(-1, 1))
            else:
                normalizers[feat].fit(series_y.values.reshape(-1, 1))
        if feat != 'y':
            df_x[feat] = normalizers[feat].transform(df_x[feat].values.reshape(-1, 1))

    series_y=normalizers["y"].transform(series_y.values.reshape(-1, 1))

    return df_x, series_y, normalizers

In [4]:
features_to_minmax = X_train_5.columns.tolist()[17:] + [X_train_5.columns.tolist()[0]]
X_train_norm, y_train_norm, normalizers = minmax_scale(X_train_5, y_train_5, features_to_minmax)
X_valid_norm, y_valid_norm, _ = minmax_scale(X_valid_5, y_valid_5, features_to_minmax, normalizers=normalizers)
X_test_norm, y_test_norm, _ = minmax_scale(X_test_5, y_test_5, features_to_minmax, normalizers=normalizers)

In [5]:
from sklearn.metrics import mean_squared_error 
import numpy as np

def evaluate_model(model, X_valid, y_valid_true):
    predictions = model.predict(X_valid)
    mse = mean_squared_error(y_valid_true, predictions)
    print("Mean squared error on valid:",mse)
    return mse

# Modified from assignment 6
def evaluate_model_2(model, X_valid, y_valid_true):
    predictions = model.predict(X_valid)
    mse = mean_squared_error(y_valid_true, predictions)
    print("Mean squared error on valid:",mse)
    normalized_mse = normalizers["y"].inverse_transform(np.array([mse]).reshape(1, -1))[0][0]
    print("Mean squared error on valid inverse transformed from normalization:",normalized_mse)
    return normalized_mse

In [6]:
X_train_norm

Unnamed: 0,total_hours,dlvry_weekend,dlvry_bank_holiday,dlvry_day_sin,dlvry_day_cos,dlvry_weekday_sin,dlvry_weekday_cos,dlvry_hour_sin,dlvry_hour_cos,lasttrade_weekend,lasttrade_bank_holiday,lasttrade_day_sin,lasttrade_day_cos,lasttrade_weekday_sin,lasttrade_weekday_cos,lasttrade_hour_sin,lasttrade_hour_cos,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0,0.153634,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.500000,0.866025,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.707107,-7.071068e-01,0.548624,0.342375,0.483483,0.450294,0.285111,0.16,0.314494,0.486784,0.301933,0.333574,0.511595,0.16,0.367678,0.596980,0.518639,0.445045,0.511702,0.16,0.378254,0.289166,0.470863,0.302977,0.509702,0.16
1,0.153634,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.500000,0.866025,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.866025,-5.000000e-01,0.481299,0.330252,0.445686,0.436614,0.286310,0.12,0.454300,0.502022,0.393267,0.399807,0.511702,0.12,0.359601,0.585893,0.482680,0.397880,0.509702,0.12,0.347688,0.289220,0.453237,0.292983,0.509118,0.12
2,0.153634,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.500000,0.866025,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.965926,-2.588190e-01,0.587085,0.350137,0.518211,0.504933,0.286467,0.08,0.444320,0.492698,0.366000,0.345792,0.509702,0.08,0.323973,0.585933,0.465448,0.384756,0.509118,0.08,0.327095,0.276471,0.387500,0.292822,0.517617,0.08
3,0.153634,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.500000,0.866025,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-1.000000,-1.836970e-16,0.579534,0.337970,0.496559,0.449216,0.283544,0.04,0.400298,0.492732,0.352933,0.330761,0.509118,0.04,0.299970,0.576436,0.401178,0.384545,0.517617,0.04,0.330916,0.281272,0.481115,0.309925,0.506423,0.04
4,0.153634,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.500000,0.866025,1.0,0.0,0.201299,0.979530,-0.781831,0.623490,-0.965926,2.588190e-01,0.546224,0.338014,0.486183,0.433712,0.282690,0.00,0.370640,0.484745,0.304200,0.330520,0.517617,0.00,0.304423,0.580012,0.492703,0.407004,0.506423,0.00,0.345163,0.284365,0.449191,0.295602,0.510396,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91507,0.445676,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.965926,-0.258819,0.0,0.0,-0.587785,-0.809017,-0.433884,-0.900969,-0.258819,9.659258e-01,0.539450,0.329423,0.484860,0.468701,0.283815,0.20,0.404216,0.498179,0.371867,0.361948,0.510634,0.20,0.330767,0.588913,0.474943,0.413040,0.510071,0.20,0.363295,0.317597,0.478957,0.346000,0.509847,0.20
91508,0.445676,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.965926,-0.258819,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.000000,1.000000e+00,0.549188,0.345122,0.501218,0.465882,0.284907,0.16,0.408692,0.495238,0.360133,0.363154,0.510071,0.16,0.342165,0.607073,0.490593,0.454380,0.509847,0.16,0.391012,0.320371,0.518345,0.331623,0.512149,0.16
91509,0.445676,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.965926,-0.258819,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.258819,9.659258e-01,0.552576,0.341285,0.491900,0.467125,0.284083,0.12,0.422776,0.510510,0.372000,0.410498,0.509847,0.12,0.374472,0.609140,0.529101,0.435500,0.512149,0.12,0.367569,0.296901,0.476079,0.307787,0.508224,0.12
91510,0.445676,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.965926,-0.258819,1.0,0.0,-0.743145,-0.669131,-0.974928,-0.222521,0.500000,8.660254e-01,0.563232,0.361214,0.501323,0.515961,0.283757,0.08,0.462694,0.512247,0.401200,0.388875,0.512149,0.08,0.347147,0.591655,0.487779,0.404197,0.508224,0.08,0.344321,0.284952,0.456924,0.309283,0.509084,0.08


In [7]:
from sklearn.ensemble import RandomForestRegressor

N_ESTIMATORS = 20
RANDOM_STATE = 452543634
RF_model = RandomForestRegressor(n_estimators=N_ESTIMATORS, random_state=RANDOM_STATE, n_jobs=4)

RF_model.fit(X_train_norm, y_train_norm)

RandomForestRegressor(n_estimators=20, n_jobs=4, random_state=452543634)

In [8]:
result = evaluate_model_2(RF_model, X_valid_norm, y_valid_norm)

Mean squared error on valid: 0.00045966155038540103
Mean squared error on valid inverse transformed from normalization: -74.14704736722378


### Simple LSTM

In [9]:
# LSTM_CELL_SIZE=64
BATCH_SIZE = 200
EPOCHS = 15
# DROPOUT_RATE=0.15

#### Create an artificial 3rd dimension

In [10]:
X_train_norm = X_train_norm.to_numpy().reshape(X_train_norm.shape[0], 1, X_train_norm.shape[1])
X_valid_norm = X_valid_norm.to_numpy().reshape(X_valid_norm.shape[0], 1, X_valid_norm.shape[1])

In [11]:
import tensorflow
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as be
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler

column_count = np.shape(X_train_norm)[2]

be.clear_session()

# def schedule(epoch, lr):
#    ...
#    return lr

# lr_scheduler = LearningRateScheduler(schedule)

es = EarlyStopping(monitor='val_loss', mode='min', 
                   verbose=1, patience=5)
mc = ModelCheckpoint('best_model.h5', monitor='val_loss', 
                     mode='min', save_best_only=True)

# Build your whole LSTM model here!
model = Sequential()
model.add(LSTM(80, input_shape=(1, column_count), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(120))
# model.add(Bidirectional(LSTM(150)))
model.add(Dropout(0.4))
model.add(Dense(1))

#For shape remeber, we have a variable defining the "window" and the features in the window...
optimizer = tensorflow.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mean_squared_error', optimizer=optimizer)

# Fit on the train data
# USE the batch size parameter!
# Use validation data - warning, a tuple of stuff!
# Epochs as deemed necessary...
# You should avoid shuffling the data maybe.
# You can use the callbacks for LR schedule or model saving as seems fit.

history = model.fit(X_train_norm, y=y_train_norm,
                    validation_data=(X_valid_norm, y_valid_norm),
                    shuffle=False,
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    callbacks=[mc, es])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [12]:
result = evaluate_model_2(model, X_valid_norm, y_valid_norm)

Mean squared error on valid: 0.00042431547780852314
Mean squared error on valid inverse transformed from normalization: -74.1557329576381


#### Question: how can the inverse transformed version of the MSE be negative?