# Experiment with an XGBoost Regression on today_energy data

Updated to use the 2020 inv01 data

This will use XGBoost Regression as the model. It will use 5x K-Folds Cross Validation to train then fit the model and evaluate the MAE and RMSE. For each fold, it will write out the data with the predictions to the /predictions folder so we can look at what the model is predicting vs the actual today_energy

In [1]:
# pip install xgboost

In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb

from utilities import data_basic_utility as databasic
from utilities import dataframe_utility as dfutil

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

df_inv01_2020 = pd.read_csv("inv01_2020.csv")
thisFileName = "01a.RegressionXGboostV1"

print(df_inv01_2020.shape)
print(df_inv01_2020.info())
df_inv01_2020.head()

(12687, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12687 entries, 0 to 12686
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Inv01_Temp  12687 non-null  float64
 1   Wms01_Irr   12687 non-null  float64
 2   Wms01_Temp  12687 non-null  float64
 3   Avg_Energy  12687 non-null  float64
 4   Date        12687 non-null  object 
 5   Hour        12687 non-null  int64  
 6   Quarter     12687 non-null  int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 693.9+ KB
None


Unnamed: 0,Inv01_Temp,Wms01_Irr,Wms01_Temp,Avg_Energy,Date,Hour,Quarter
0,36.981818,943.272727,36.909091,0.6,2020-03-15,12,3
1,41.857143,939.785714,35.785714,0.571429,2020-03-15,12,4
2,45.11875,940.3125,35.0125,0.5,2020-03-15,13,1
3,47.273333,928.466667,35.206667,0.533333,2020-03-15,13,2
4,48.64,905.933333,35.46,0.533333,2020-03-15,13,3


### Feature Engineering

In [3]:
# df_inv01_2020.loc[:, 'Year'] = df_inv01_2020.Date.apply(lambda x: int(str(x).split('-')[0]))
# df_inv01_2020.loc[:, 'Month'] = df_inv01_2020.Date.apply(lambda x: int(str(x).split('-')[1]))
# df_inv01_2020.loc[:, 'Day'] = df_inv01_2020.Date.apply(lambda x: int(str(x).split('-')[2]))
df_inv01_2020 = df_inv01_2020.drop(['Date'], axis=1)

In [4]:
df_inv01_2020

Unnamed: 0,Inv01_Temp,Wms01_Irr,Wms01_Temp,Avg_Energy,Hour,Quarter
0,36.981818,943.272727,36.909091,0.600000,12,3
1,41.857143,939.785714,35.785714,0.571429,12,4
2,45.118750,940.312500,35.012500,0.500000,13,1
3,47.273333,928.466667,35.206667,0.533333,13,2
4,48.640000,905.933333,35.460000,0.533333,13,3
...,...,...,...,...,...,...
12682,41.621429,68.714286,19.335714,0.571429,8,4
12683,42.287500,98.125000,19.406250,0.875000,9,1
12684,42.946667,166.066667,19.726667,1.466667,9,2
12685,43.553333,96.266667,19.846667,0.866667,9,3


Do a K-Folds Cross Validation using XGBoost and get an MAE and an RMSE for mean error and indication of variance

In [5]:
# Test a basic XGBoost Regression with KFolds Cross Validation
randomSeed = databasic.get_random_seed()
model = xgb.XGBRegressor(objective="reg:squarederror", booster="gbtree", n_estimators=10, seed=randomSeed)
modellingLog = ""   

targetColName = "Avg_Energy"
col_names = df_inv01_2020.columns
feature_cols = col_names.drop([targetColName])
trainFeatures = df_inv01_2020[feature_cols]
trainTargets = df_inv01_2020[targetColName]


In [6]:

lstMae = []
lstRmse = []
kfolds = KFold(n_splits=5, random_state=randomSeed, shuffle=True)
for k, (train_index, test_index) in enumerate(kfolds.split(df_inv01_2020)):
    # x_train = trainFeatures.loc[train_index, ]
    # x_vali = trainFeatures.loc[test_index, ]

    # y_train = trainTargets.loc[train_index, ]
    # y_vali = trainTargets.loc[test_index, ]
    x_train = trainFeatures.loc[trainFeatures.index.intersection(train_index)]
    x_vali = trainFeatures.loc[trainFeatures.index.intersection(test_index)]
    
    y_train = trainTargets.loc[trainTargets.index.intersection(train_index)]
    y_vali = trainTargets.loc[trainTargets.index.intersection(test_index)]
        
    model.fit(x_train, y_train)
    y_pred = model.predict(x_vali)

    # Compute the mae
    mae = mean_absolute_error(y_pred, y_vali)
    lstMae.append(mae)

    # Compute the rmse
    rmse = np.sqrt(mean_squared_error(y_pred, y_vali))
    lstRmse.append(rmse)
    
    print("Fold {0} MAE: {1}, RMSE: {2}".format(str(k), str(mae), str(rmse)))

    dfPredicted = x_vali
    dfPredicted["Avg_Energy"] = y_vali
    dfPredicted["Avg_Energy_predicted"] = y_pred
    dfPredicted.to_csv("./predictions/" + thisFileName+"_KFold" + str(k) + ".csv", index=False)

print("Final Result")
print("----------")
print("Average Mean Absolute Error (MAE): " + str(np.mean(lstMae)))
print("Average Root Mean Squared Error (RMSE): " + str(np.mean(lstRmse)))


Fold 0 MAE: 0.45925616342993747, RMSE: 2.661261777060362
Fold 1 MAE: 0.45222406808155474, RMSE: 1.2362454798349884
Fold 2 MAE: 0.43811310198660863, RMSE: 1.4253872451387763
Fold 3 MAE: 0.5230430900663451, RMSE: 2.952917940521714
Fold 4 MAE: 0.47257124783736454, RMSE: 1.0125252350284082
Final Result
----------
Average Mean Absolute Error (MAE): 0.46904153428036216
Average Root Mean Squared Error (RMSE): 1.8576675355168497


Run 1:
- Average Mean Absolute Error (MAE): 0.46779408603269407
- Average Root Mean Squared Error (RMSE): 1.7394352909511028

Run 2:
- Average Mean Absolute Error (MAE): 0.46667602942588476
- Average Root Mean Squared Error (RMSE): 1.7791142535879072

Run 3:
- Average Mean Absolute Error (MAE): 0.47512829519461297
- Average Root Mean Squared Error (RMSE): 1.9174012042996345

In [7]:
realTimeEnergy = np.mean(df_inv01_2020["Avg_Energy"])
avgMae = np.mean([ 0.46779408603269407, 0.46667602942588476, 0.47512829519461297 ])
avgRmse = np.mean([ 1.7394352909511028, 1.7791142535879072, 1.9174012042996345 ])

predictionAccuracy = 100 - np.round((avgMae / realTimeEnergy) * 100, 2)
percentAvgAccuracyError = np.round((avgRmse / realTimeEnergy) * 100, 2)

print("Predictions made to an accuracy of: " + str(predictionAccuracy) + "%")
print("Predictions Error: +/-" + str(percentAvgAccuracyError) + "%")

Predictions made to an accuracy of: 84.12%
Predictions Error: +/-61.24%
