<a href="https://colab.research.google.com/github/MMakovec/PoemWebsite/blob/gh-pages/Copy_of_UK_dale.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- UK-dale dataset contains 5 houses, each house having its own directory with a different number of channels. Each channel represents an appliance that has its consumption measured. The original data shows consumption for every 5-6 seconds. I had to sum up all the appliances of each house in order to get a dataset on which I can predict the energy consumption of the whole house. 
- The dataset also includes a file with metadata for each house, telling us for example what type of house it is.

- Importing the needed libraries, the names of all the files and mounting google drive.

In [None]:
import shutil
import pathlib
import os
import pandas as pd
import numpy as np
from google.colab import drive
import pickle

drive.mount("/content/drive")

house_1_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/house_1/')
house_2_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/house_2/')
house_3_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/house_3/')
house_4_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/house_4/')
house_5_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/house_5/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
  import matplotlib.pyplot as plt
  import seaborn as sns

  from keras.callbacks import ModelCheckpoint
  from keras.models import Sequential
  from tensorflow.keras.layers import LSTM
  from tensorflow.keras.layers import Dense, Dropout
  import warnings 
  warnings.filterwarnings('ignore')
  warnings.filterwarnings('ignore', category=DeprecationWarning)

  from sklearn.model_selection import train_test_split
  from sklearn.metrics import mean_absolute_error 
  from sklearn.preprocessing import StandardScaler
  from sklearn.model_selection import KFold
  from sklearn.model_selection import cross_val_score
  from sklearn.pipeline import Pipeline
  from sklearn.ensemble import RandomForestRegressor
  from sklearn.model_selection import GridSearchCV
  from sklearn.datasets import make_classification
  from sklearn.model_selection import cross_val_predict
  import xgboost as xgb

- Copying all the files from google drive to the disk.

In [None]:
granularity = input("Enter sample time interval (5min/60min): ")

Enter sample time interval (5min/60min): 60min


In [None]:
for file_name in house_1_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_1/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

- For each house in the dataset a few steps were needed before I was able to sum them.
- House 1 was especially tricky to sum as it is by far the biggest dataset and on top of that it also has the most channels (it made me almost run out of RAM just from importing). So for house 1 I had to gradually import the data, sum it up into subsums and delete the not needed variables and repeat the process many times, making bigger and bigger sums. I also had to lower the frequency of data to every 5 minutes instead of every 5-6 seconds.

- House 1 follows same steps as other houses, just split up into many subsets.

- All other houses follow these steps:
  - Importing data
  - Getting the data in order to begin merging
  - Merging all house channels with fill dataset, wich makes all channels the same shape and makes them possible to sum up later
    - When merging selecting "how='outer'" means that when matching two datasets, pandas takes the union of both instead of an intersection,"right_index=True" and "left_index=True" tell pandas to do the matching on indexes, so timestamps.
  - Replacing all the missing values with 0, as missing values mean appliance was turned off or the meter was not active. We also need misssing values to be 0 so we can properly sum.
  - Converting the datatypes
  - Resampling the data to every 5 minutes, instead of every 5-6 seconds
  - Summing up everything
  - Deleting not needed variables

- Function that sums two channels of house 1.

In [None]:
def sum2(x1,x2):
  ########################## 1 ################################

  fill = pd.DataFrame({'ts':pd.date_range(start='2012-11-09 22:28:15',end='2017-04-26 17:35:53',freq=granularity)})    ###################################### change 5min
  fill.set_index('ts',inplace=True)

  x1['energy_Wh'] = x1['energy_Wh'].astype('float32')
  x2['energy_Wh'] = x2['energy_Wh'].astype('float32')
  x1 = pd.merge(fill, x1,how='outer',right_index=True,left_index=True)
  x2 = pd.merge(fill, x2,how='outer',right_index=True,left_index=True)

  house_1_channels = [x1,x2]

  for h in house_1_channels:
    h.replace(np.NaN,0,inplace=True)

  x1['energy_Wh'] = x1['energy_Wh'].astype('float32')
  x2['energy_Wh'] = x2['energy_Wh'].astype('float32')

  house_1_channels = [x1,x2]

  house_1_array = []
  for h in house_1_channels:
    house_1_array.append(h.resample(granularity).aggregate(np.mean))     ######################################### change 5min
  sum2 = house_1_array[0]+house_1_array[1]
  return sum2

In [None]:
##############    hs_1    ##################


hs_1_ch1 = pd.read_csv('channel_1.dat', sep=' ',header=None)
hs_1_ch2 = pd.read_csv('channel_2.dat', sep=' ',header=None)
hs_1_ch3 = pd.read_csv('channel_3.dat', sep=' ',header=None)
hs_1_ch4 = pd.read_csv('channel_4.dat', sep=' ',header=None)
hs_1_ch5 = pd.read_csv('channel_5.dat', sep=' ',header=None)
hs_1_ch6 = pd.read_csv('channel_6.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch1,hs_1_ch2,hs_1_ch3,hs_1_ch4,hs_1_ch5, hs_1_ch6]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_1 = sum2(hs_1_ch1,hs_1_ch2)
del hs_1_ch1,hs_1_ch2
hs_1_2 = sum2(hs_1_ch3,hs_1_ch4)
del hs_1_ch3, hs_1_ch4
hs_1_3 = sum2(hs_1_ch5,hs_1_ch6)
del hs_1_ch5, hs_1_ch6
H_1 = hs_1_1+hs_1_2+hs_1_3
del hs_1_1,hs_1_2,hs_1_3

hs_1_ch7 = pd.read_csv('channel_7.dat', sep=' ',header=None)
hs_1_ch8 = pd.read_csv('channel_8.dat', sep=' ',header=None)
hs_1_ch9 = pd.read_csv('channel_9.dat', sep=' ',header=None)
hs_1_ch10 = pd.read_csv('channel_10.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch7,hs_1_ch8,hs_1_ch9,hs_1_ch10]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_4 = sum2(hs_1_ch7,hs_1_ch8)
del hs_1_ch7, hs_1_ch8
hs_1_5 = sum2(hs_1_ch9,hs_1_ch10)
del hs_1_ch9, hs_1_ch10
Hs_1_1 = H_1+hs_1_4+hs_1_5
Hs_1_1.dropna(inplace=True)
Hs_1_1 = Hs_1_1.astype('float32')
del hs_1_4,hs_1_5,H_1

hs_1_ch11 = pd.read_csv('channel_11.dat', sep=' ',header=None)
hs_1_ch12 = pd.read_csv('channel_12.dat', sep=' ',header=None)
hs_1_ch13 = pd.read_csv('channel_13.dat', sep=' ',header=None)
hs_1_ch14 = pd.read_csv('channel_14.dat', sep=' ',header=None)
hs_1_ch15 = pd.read_csv('channel_15.dat', sep=' ',header=None)
hs_1_ch16 = pd.read_csv('channel_16.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch11,hs_1_ch12,hs_1_ch13,hs_1_ch14, hs_1_ch15, hs_1_ch16]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_6 = sum2(hs_1_ch11,hs_1_ch12)
del hs_1_ch11,hs_1_ch12
hs_1_7 = sum2(hs_1_ch13,hs_1_ch14)
del hs_1_ch13, hs_1_ch14
hs_1_8 = sum2(hs_1_ch15,hs_1_ch16)
del hs_1_ch15, hs_1_ch16
H_2 = hs_1_6+hs_1_7+hs_1_8
del hs_1_6,hs_1_7,hs_1_8

hs_1_ch17 = pd.read_csv('channel_17.dat', sep=' ',header=None)
hs_1_ch18 = pd.read_csv('channel_18.dat', sep=' ',header=None)
hs_1_ch19 = pd.read_csv('channel_19.dat', sep=' ',header=None)
hs_1_ch20 = pd.read_csv('channel_20.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch17,hs_1_ch18,hs_1_ch19,hs_1_ch20]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_9 = sum2(hs_1_ch17,hs_1_ch18)
del hs_1_ch17, hs_1_ch18
hs_1_10 = sum2(hs_1_ch19,hs_1_ch20)
del hs_1_ch19, hs_1_ch20
Hs_1_2 = H_2+hs_1_9+hs_1_10
Hs_1_2.dropna(inplace=True)
Hs_1_2 = Hs_1_2.astype('float32')
del hs_1_9,hs_1_10,H_2

hs_1_ch21 = pd.read_csv('channel_21.dat', sep=' ',header=None)
hs_1_ch22 = pd.read_csv('channel_22.dat', sep=' ',header=None)
hs_1_ch23 = pd.read_csv('channel_23.dat', sep=' ',header=None)
hs_1_ch24 = pd.read_csv('channel_24.dat', sep=' ',header=None)
hs_1_ch25 = pd.read_csv('channel_25.dat', sep=' ',header=None)
hs_1_ch26 = pd.read_csv('channel_26.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch21,hs_1_ch22,hs_1_ch23,hs_1_ch24,hs_1_ch25,hs_1_ch26]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_11 = sum2(hs_1_ch21,hs_1_ch22)
del hs_1_ch21,hs_1_ch22
hs_1_12 = sum2(hs_1_ch23,hs_1_ch24)
del hs_1_ch23, hs_1_ch24
hs_1_13 = sum2(hs_1_ch25,hs_1_ch26)
del hs_1_ch25, hs_1_ch26
H_3 = hs_1_11+hs_1_12+hs_1_13
del hs_1_11,hs_1_12,hs_1_13

hs_1_ch27 = pd.read_csv('channel_27.dat', sep=' ',header=None)
hs_1_ch28 = pd.read_csv('channel_28.dat', sep=' ',header=None)
hs_1_ch29 = pd.read_csv('channel_29.dat', sep=' ',header=None)
hs_1_ch30 = pd.read_csv('channel_30.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch27,hs_1_ch28,hs_1_ch29,hs_1_ch30]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_14 = sum2(hs_1_ch27,hs_1_ch28)
del hs_1_ch27, hs_1_ch28
hs_1_15 = sum2(hs_1_ch29,hs_1_ch30)
del hs_1_ch29, hs_1_ch30
Hs_1_3 = H_3+hs_1_14+hs_1_15
Hs_1_3.dropna(inplace=True)
Hs_1_3 = Hs_1_3.astype('float32')
del H_3,hs_1_14,hs_1_15

hs_1_ch31 = pd.read_csv('channel_31.dat', sep=' ',header=None)
hs_1_ch32 = pd.read_csv('channel_32.dat', sep=' ',header=None)
hs_1_ch33 = pd.read_csv('channel_33.dat', sep=' ',header=None)
hs_1_ch34 = pd.read_csv('channel_34.dat', sep=' ',header=None)
hs_1_ch35 = pd.read_csv('channel_35.dat', sep=' ',header=None)
hs_1_ch36 = pd.read_csv('channel_36.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch31,hs_1_ch32,hs_1_ch33,hs_1_ch34, hs_1_ch35, hs_1_ch36]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_16 = sum2(hs_1_ch31,hs_1_ch32)
del hs_1_ch31,hs_1_ch32
hs_1_17 = sum2(hs_1_ch33,hs_1_ch34)
del hs_1_ch33, hs_1_ch34
hs_1_18 = sum2(hs_1_ch35,hs_1_ch36)
del hs_1_ch35, hs_1_ch36
H_4 = hs_1_16+hs_1_17+hs_1_18
del hs_1_16,hs_1_17,hs_1_18

hs_1_ch37 = pd.read_csv('channel_37.dat', sep=' ',header=None)
hs_1_ch38 = pd.read_csv('channel_38.dat', sep=' ',header=None)
hs_1_ch39 = pd.read_csv('channel_39.dat', sep=' ',header=None)
hs_1_ch40 = pd.read_csv('channel_40.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch37,hs_1_ch38,hs_1_ch39,hs_1_ch40]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_19 = sum2(hs_1_ch37,hs_1_ch38)
del hs_1_ch37, hs_1_ch38
hs_1_20 = sum2(hs_1_ch39,hs_1_ch40)
del hs_1_ch39, hs_1_ch40
Hs_1_4 = H_4+hs_1_19+hs_1_20
Hs_1_4.dropna(inplace=True)
Hs_1_4 = Hs_1_4.astype('float32')
del H_4,hs_1_19,hs_1_20

hs_1_ch41 = pd.read_csv('channel_41.dat', sep=' ',header=None)
hs_1_ch42 = pd.read_csv('channel_42.dat', sep=' ',header=None)
hs_1_ch43 = pd.read_csv('channel_43.dat', sep=' ',header=None)
hs_1_ch44 = pd.read_csv('channel_44.dat', sep=' ',header=None)
hs_1_ch45 = pd.read_csv('channel_45.dat', sep=' ',header=None)
hs_1_ch46 = pd.read_csv('channel_46.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch41,hs_1_ch42,hs_1_ch43,hs_1_ch44, hs_1_ch45, hs_1_ch46]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_21 = sum2(hs_1_ch41,hs_1_ch42)
del hs_1_ch41,hs_1_ch42
hs_1_22 = sum2(hs_1_ch43,hs_1_ch44)
del hs_1_ch43, hs_1_ch44
hs_1_23 = sum2(hs_1_ch45,hs_1_ch46)
del hs_1_ch45, hs_1_ch46
H_5 = hs_1_21+hs_1_22+hs_1_23
del hs_1_21,hs_1_22,hs_1_23

hs_1_ch47 = pd.read_csv('channel_47.dat', sep=' ',header=None)
hs_1_ch48 = pd.read_csv('channel_48.dat', sep=' ',header=None)
hs_1_ch49 = pd.read_csv('channel_49.dat', sep=' ',header=None)
hs_1_ch50 = pd.read_csv('channel_50.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch47,hs_1_ch48,hs_1_ch49,hs_1_ch50]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_24 = sum2(hs_1_ch47,hs_1_ch48)
del hs_1_ch47, hs_1_ch48
hs_1_25 = sum2(hs_1_ch49,hs_1_ch50)
del hs_1_ch49, hs_1_ch50
Hs_1_5 = H_5+hs_1_24+hs_1_25
Hs_1_5.dropna(inplace=True)
Hs_1_5 = Hs_1_5.astype('float32')
del H_5,hs_1_24,hs_1_25

hs_1_ch51 = pd.read_csv('channel_51.dat', sep=' ',header=None)
hs_1_ch52 = pd.read_csv('channel_52.dat', sep=' ',header=None)
hs_1_ch53 = pd.read_csv('channel_53.dat', sep=' ',header=None)

house_1_channels = [hs_1_ch51,hs_1_ch52,hs_1_ch53]

for h in house_1_channels:
  h[0] = pd.to_datetime(h[0], unit='s')
  h.rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
  h.set_index(['ts'], inplace=True)

hs_1_26 = sum2(hs_1_ch51,hs_1_ch52)
del hs_1_ch51,hs_1_ch52
hs_1_26.dropna(inplace=True)
Hs_1_6 = sum2(hs_1_26,hs_1_ch53)
del hs_1_ch53
Hs_1_6.dropna(inplace=True)
Hs_1_6 = Hs_1_6.astype('float32')

hs_1 = Hs_1_1 + Hs_1_2 + Hs_1_3 + Hs_1_4 + Hs_1_5 + Hs_1_6
hs_1.dropna(inplace=True)
hs_1 = hs_1.astype('float32')
hs_1 = hs_1[(hs_1 != 0).all(1)]
del Hs_1_1,Hs_1_2,Hs_1_3,Hs_1_4,Hs_1_5,Hs_1_6,house_1_channels,house_1_files

- Pickling out the result for later importing to save RAM.

In [None]:
pickle_out1 = open("hs_1.pkl", "wb")
pickle.dump(hs_1, pickle_out1)
pickle_out1.close()
del hs_1

In [None]:
def read_house(num_channels):
  house = {}
  for i in range(0,num_channels,1):
    house['ch_'+str(i+1)] = pd.read_csv('channel_'+str(i+1)+'.dat',sep=' ', header=None)
  house['labels'] = pd.read_csv('labels.dat',sep=' ',header=None)
  return house

def sum_house(house, fill):
  house_array = []
  for i in range(1,len(house),1):
    house['ch_'+str(i)].rename(columns={0:'ts',1:'energy_Wh'},inplace=True)
    house['ch_'+str(i)]['ts'] = pd.to_datetime(house['ch_'+str(i)]['ts'], unit='s')
    house['ch_'+str(i)].set_index(['ts'], inplace=True)
    house['ch_'+str(i)] = pd.merge(fill,house['ch_'+str(i)],how='outer',right_index=True,left_index=True)
    house['ch_'+str(i)].replace(np.NaN,0,inplace=True)
    house['ch_'+str(i)]['energy_Wh'] = house['ch_'+str(i)]['energy_Wh'].astype('float32')
    house_array.append(house['ch_'+str(i)].resample(granularity).aggregate(np.mean))                            #################### change 5min
  hs = house_array[0]+house_array[1]
  for i in range(2,len(house)-1,1):
    hs += house_array[i]
  hs = hs[(hs != 0).all(1)]
  return hs

In [None]:
'''##############    hs_1    ###################
for file_name in house_1_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_1/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

house_1 = read_house(num_channels = 53)

fill = pd.DataFrame({'ts':pd.date_range(start='2012-11-09 22:28:15',end='2017-04-26 17:35:53',freq='5Min')})
fill.set_index('ts',inplace=True)
 
hs_1 = sum_house(house_1,fill)

del house_1,fill'''

'##############    hs_1    ###################\nfor file_name in house_1_files:\n    full_file_name = os.path.join(\'/content/drive/MyDrive/IJS/UK-dale/house_1/\', file_name)\n    if os.path.isfile(full_file_name):\n        shutil.copy(full_file_name, "/content")\n\nhouse_1 = read_house(num_channels = 53)\n\nfill = pd.DataFrame({\'ts\':pd.date_range(start=\'2012-11-09 22:28:15\',end=\'2017-04-26 17:35:53\',freq=\'5Min\')})\nfill.set_index(\'ts\',inplace=True)\n \nhs_1 = sum_house(house_1,fill)\n\ndel house_1,fill'

In [None]:
##############    hs_2    ###################

for file_name in house_2_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_2/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

house_2 = read_house(num_channels = 19)

fill = pd.DataFrame({'ts':pd.date_range(start='2013/02/17',end='2013/10/11',freq=granularity)})           ####################### change 5min
fill.set_index('ts',inplace=True)

hs_2 = sum_house(house_2,fill)

del house_2,fill

- Pickling out the result for later importing to save RAM.

In [None]:
pickle_out2 = open("hs_2.pkl", "wb")
pickle.dump(hs_2, pickle_out2)
pickle_out2.close()
del hs_2

In [None]:
##############    hs_3    ##################

for file_name in house_3_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_3/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

house_3 = read_house(num_channels = 5)

fill = pd.DataFrame({'ts':pd.date_range(start='2013/02/27 20:35:14',end='2013/04/08 05:15:05',freq=granularity)})     ############################# change 5min
fill.set_index('ts',inplace=True)

hs_3 = sum_house(house_3,fill)
hs_3.dropna(inplace=True)

del house_3,fill

- Pickling out the result for later importing to save RAM.

In [None]:
pickle_out3 = open("hs_3.pkl", "wb")
pickle.dump(hs_3, pickle_out3)
pickle_out3.close()
del hs_3

In [None]:
##############    hs_4    ##################

for file_name in house_4_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_4/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

house_4 = read_house(num_channels = 6)

fill = pd.DataFrame({'ts':pd.date_range(start='2013-03-09 14:40:00',end='2013-10-01 05:15:14',freq=granularity)})         ######################### change 5min
fill.set_index('ts',inplace=True)

hs_4 = sum_house(house_4,fill)

del house_4,fill

- Pickling out the result for later importing to save RAM.

In [None]:
pickle_out4 = open("hs_4.pkl", "wb")
pickle.dump(hs_4, pickle_out4)
pickle_out4.close()
del hs_4

In [None]:
##############    hs_5    ##################

for file_name in house_5_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/house_5/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

house_5 = read_house(num_channels = 25)

fill = pd.DataFrame({'ts':pd.date_range(start='2014-06-29 16:23:48',end='2014-11-13 18:00:03',freq=granularity)})           ###################### change 5min
fill.set_index('ts',inplace=True)

hs_5 = sum_house(house_5,fill)

del house_5,fill

- Pickling out the result for later importing to save RAM.

In [None]:
pickle_out5 = open("hs_5.pkl", "wb")
pickle.dump(hs_5, pickle_out5)
pickle_out5.close()
del hs_5

- Deleting not needed variables to save RAM.

In [None]:
del house_2_files,house_3_files,house_4_files,house_5_files,full_file_name

- Importing the weather datasets for each year, which includes "max_air_temp" and "min_air_temp" for two periods each day, from 9am to 9pm and from 9pm to 9am (day and night).

In [None]:
def read_weather(weather_dict, i):
  weather = pd.read_csv('weather_'+str(i)+'.csv')
  weather_dict[str(i)] = weather[:-1]   # dropping last sample, which just marks the end
  return weather_dict

In [None]:
weather_files = os.listdir('/content/drive/MyDrive/IJS/UK-dale/Weather/')

for file_name in weather_files:
    full_file_name = os.path.join('/content/drive/MyDrive/IJS/UK-dale/Weather/', file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, "/content")

weather_dict = {}
for i in range(2012,2018,1):
  weather_dict = read_weather(weather_dict, i)

- Choosing the usefull features from the weather dataset and converting the timestamps into pandas DateTime format.
- Setting the timestamps as the index.
- Resampling (expanding) the dataset to show temperature for each hour and then using forward fill to fill in the missing values.

In [None]:
def sort_weather(weather_dict):
  for i in range(2012,2018,1):
    weather_dict[str(i)] = weather_dict[str(i)][['ob_end_time','max_air_temp','min_air_temp']]
    weather_dict[str(i)]['ob_end_time'] = pd.to_datetime(weather_dict[str(i)]['ob_end_time'])
    weather_dict[str(i)].set_index(['ob_end_time'],inplace=True)
    weather_dict[str(i)] = weather_dict[str(i)].resample('1H').aggregate(np.mean)
    weather_dict[str(i)].ffill(inplace=True)
  return weather_dict

In [None]:
weather_dict = sort_weather(weather_dict)

- Joining together weather datasets of each year.
- Reindexing the timestamp to every hour instead of twice a day, using forward fill to fill the missing values in the middle of the dataset and dropping the remaining missing values.
- Deleting not needed variables to save RAM.

In [None]:
weather = pd.concat([weather_dict['2012'],weather_dict['2013'],weather_dict['2014'],weather_dict['2015'],\
                     weather_dict['2016'],weather_dict['2017']])
ix = pd.date_range(start='2012-01-01', end='2017-12-31', freq='H')
weather = weather.reindex(ix)
weather.ffill(inplace=True)
weather.dropna(inplace=True)

del weather_dict

- Importing all of the house sums using pickle.

In [None]:
pickle_h1 = open("hs_1.pkl","rb")
hs_1 = pickle.load(pickle_h1)

pickle_h2 = open("hs_2.pkl","rb")
hs_2 = pickle.load(pickle_h2)

pickle_h3 = open("hs_3.pkl","rb")
hs_3 = pickle.load(pickle_h3)

pickle_h4 = open("hs_4.pkl","rb")
hs_4 = pickle.load(pickle_h4)

pickle_h5 = open("hs_5.pkl","rb")
hs_5 = pickle.load(pickle_h5)

# putting all houses in a dictionary

houses = [hs_1,hs_2,hs_3,hs_4,hs_5]
house = {}
i=1
for h in houses:
  house[str(i)] = h
  i+=1

del hs_1,hs_2,hs_3,hs_4,hs_5

- Merging house sums with weather and using forward fill to fill in the resulting missing values (the weather dataset at this point only has values for every hour which need to be forward filled to match the 5 minute resolution of the houses).
- dropping the missing values at the top of the dataset which remain even after ffill and we don't need them anyway (these values are before consumption was getting measured). 
- choosing the period of the dataset in which we have both energy consumption data and weather, because everything else is just weather data.
- Writing in the type of each house for the "type" feature.
- Converting data to use less memory.

In [None]:
def make_dataset(house_dict, weather):
  date_ranges = [['2012-11-09 22:25:00','2017-04-26 17:30:00'],
                 ['2013-02-17 15:35:00','2013-10-10 05:15:00'],
                 ['2013-02-27 20:35:00','2013-04-08 05:15:00'],
                 ['2013-03-09 14:40:00','2013-10-01 05:15:00'],
                 ['2014-06-29 16:20:00','2014-11-13 17:55:00']]
  types = ['end-of-terrace',
           'end-of-terrace',
           0,
           'mid-terrace',
           'flat'] 
  for i in range(1,len(house_dict)+1,1):
    house_dict[str(i)] = pd.merge(house_dict[str(i)],weather,how='outer',right_index=True,left_index=True)
    house_dict[str(i)]['max_air_temp'].ffill(inplace=True)
    house_dict[str(i)]['min_air_temp'].ffill(inplace=True)
    house_dict[str(i)].dropna(inplace=True)
    house_dict[str(i)] = house_dict[str(i)][date_ranges[i-1][0]:date_ranges[i-1][1]]
    house_dict[str(i)]['type'] = types[i-1]
    house_dict[str(i)]['max_air_temp'] = house_dict[str(i)]['max_air_temp'].astype('float32')
    house_dict[str(i)]['min_air_temp'] = house_dict[str(i)]['min_air_temp'].astype('float32')
  return house_dict

In [None]:
house = make_dataset(house, weather)

- Reindexing the house data to have a sample for every five minutes, to be able to correctly shift the values. Before that, missing values didn't have their own sample containing NaNs, instead the sample was just missing, so some days had less samples than others.

In [None]:
if granularity=="5min":
  ix = pd.date_range(start='2012-11-09 22:25:00', end='2017-04-26 17:30:00', freq='5Min')    
  house['1'] = house['1'].reindex(ix)
  ix = pd.date_range(start='2013-02-17 15:35:00', end='2013-10-10 05:15:00', freq='5Min')
  house['2'] = house['2'].reindex(ix)
  ix = pd.date_range(start='2013-02-27 20:35:00', end='2013-04-08 05:15:00', freq='5Min')
  house['3'] = house['3'].reindex(ix)
  ix = pd.date_range(start='2013-03-09 14:40:00', end='2013-10-01 05:15:00', freq='5Min')
  house['4'] = house['4'].reindex(ix)
  ix = pd.date_range(start='2014-06-29 16:20:00', end='2014-11-13 17:55:00', freq='5Min')
  house['5'] = house['5'].reindex(ix)

- You can notice that the number of samples has increased, because we added Nans for missing values

In [None]:
house['5'].head()

Unnamed: 0,energy_Wh,max_air_temp,min_air_temp,type
2014-06-29 17:00:00,1548.897461,15.5,10.5,flat
2014-06-29 18:00:00,2475.766602,15.5,10.5,flat
2014-06-29 19:00:00,1224.447388,15.5,10.5,flat
2014-06-29 20:00:00,1357.487549,15.5,10.5,flat
2014-06-29 21:00:00,1362.998901,16.799999,10.4,flat


- Making "energy_Wh_houseMean", "energy_Wh_daymean", "energy_per_hour", "energy_next_hour", "energy_hour_2", "energy_hour_3", "energy_previous_hour" if I want to do the predictions for every hour and also "energy diff","energy_5_min", "energy_10_min", "energy_15_min" features:
  - energy_Wh_houseMean: energy consumption mean of each house
  - energy_Wh_daymean: energy consumption mean of the previous day
  - energy_per_hour: energy consumption mean of current hour
  - energy_next_hour: energy consumption mean an hour into the future
  - energy_hour_2: energy consumption mean two hours into the future
  - energy_hour_3: energy consumption mean three hours into the future
  - energy_previous_hour: energy consumption mean of the previous hour
  - energy_diff: difference between current energy consumption and the previous sample energy consumption (5 minutes before)
  - energy_5_min: energy consumption 5 minutes into the future
  - energy_10_min: energy consumption 10 minutes into the future
  - energy_15_min: energy consumption 15 minutes into the future


In [None]:
def create_energy_features(house_dict,granularity):
  if granularity=='5min':
    for i in range(1,len(house_dict)+1,1):
      house_dict[str(i)]['energy_Wh_houseMean'] = np.mean(house_dict[str(i)]['energy_Wh'])
      house_dict[str(i)]['energy_Wh_daymean'] = house_dict[str(i)]['energy_Wh'].dropna().resample('1d').mean()
      house_dict[str(i)]['energy_Wh_daymean'].ffill(axis=0, inplace=True)
      house_dict[str(i)]['energy_Wh_daymean'] = house_dict[str(i)].energy_Wh_daymean.shift(24*12)
      house_dict[str(i)]['energy_per_hour'] = house_dict[str(i)]['energy_Wh'].dropna().resample('1H').mean()
      house_dict[str(i)]['energy_per_hour'].ffill(axis=0,inplace=True)
      house_dict[str(i)]['energy_next_hour'] = house_dict[str(i)]['energy_per_hour'].shift(-12)
      house_dict[str(i)]['energy_hour_2'] = house_dict[str(i)]['energy_per_hour'].shift(-12*2)
      house_dict[str(i)]['energy_hour_3'] = house_dict[str(i)]['energy_per_hour'].shift(-12*3)
      house_dict[str(i)]['energy_previous_hour'] = house_dict[str(i)]['energy_per_hour'].shift(12)
      house_dict[str(i)]['energy_hour_diff'] = house_dict[str(i)]['energy_per_hour'] - house_dict[str(i)]['energy_previous_hour']
      house_dict[str(i)]['energy_diff'] = house_dict[str(i)]['energy_Wh'] - house_dict[str(i)]['energy_Wh'].shift(1)
      house_dict[str(i)]['energy_5_min'] = house_dict[str(i)]['energy_Wh'].shift(-1)
      house_dict[str(i)]['energy_10_min'] = house_dict[str(i)]['energy_Wh'].shift(-2)
      house_dict[str(i)]['energy_15_min'] = house_dict[str(i)]['energy_Wh'].shift(-3)
  elif granularity=='60min':
    for i in range(1,len(house_dict)+1,1):
      house_dict[str(i)]['energy_next_hour'] = house_dict[str(i)]['energy_Wh'].shift(-1)
      house_dict[str(i)]['energy_hour_2'] = house_dict[str(i)]['energy_Wh'].shift(-2)
      house_dict[str(i)]['energy_hour_3'] = house_dict[str(i)]['energy_Wh'].shift(-3)
      house_dict[str(i)]['energy_Wh_daymean'] = house_dict[str(i)]['energy_Wh'].dropna().resample('1d').mean()
      house_dict[str(i)]['energy_Wh_daymean'].ffill(axis=0, inplace=True)
      house_dict[str(i)]['energy_Wh_daymean'] = house_dict[str(i)].energy_Wh_daymean.shift(24)
      house_dict[str(i)]['energy_previous_hour'] = house_dict[str(i)]['energy_Wh'].shift(1)
      house_dict[str(i)]['energy_diff'] = house_dict[str(i)]['energy_Wh'] - house_dict[str(i)]['energy_Wh'].shift(1)
      house_dict[str(i)]['energy_Wh_houseMean'] = np.mean(house_dict[str(i)]['energy_Wh'])
  return house_dict

In [None]:
house = create_energy_features(house,granularity)
house['1'].head()

Unnamed: 0,energy_Wh,max_air_temp,min_air_temp,type,energy_next_hour,energy_hour_2,energy_hour_3,energy_Wh_daymean,energy_previous_hour,energy_diff,energy_Wh_houseMean
2012-11-09 23:00:00,607.758423,10.5,7.1,end-of-terrace,1092.620361,1494.370728,201.601059,,,,702.527588
2012-11-10 00:00:00,1092.620361,10.5,7.1,end-of-terrace,1494.370728,201.601059,164.422775,,607.758423,484.861938,702.527588
2012-11-10 01:00:00,1494.370728,10.5,7.1,end-of-terrace,201.601059,164.422775,173.645035,,1092.620361,401.750366,702.527588
2012-11-10 02:00:00,201.601059,10.5,7.1,end-of-terrace,164.422775,173.645035,164.823685,,1494.370728,-1292.769653,702.527588
2012-11-10 03:00:00,164.422775,10.5,7.1,end-of-terrace,173.645035,164.823685,211.479492,,201.601059,-37.178284,702.527588


- Making "part_of_day", "part_of_year" and "weekend" feature:

  - part_of_day: each day divided into periods (0 ... least active, 5 ... most active)
  - part_of_year: each year divided into periods (1 ... least active, 4 ... most acitve)
  - weekend: workday is represented by a 0 and weekends are represented by a 1

In [None]:
def create_time_features(house_dict):
  for i in range(1,len(house_dict)+1,1):
    house_dict[str(i)].reset_index(inplace=True)
    house_dict[str(i)]['hour'] = pd.DatetimeIndex(house_dict[str(i)]['index']).hour
    house_dict[str(i)]['month'] = pd.DatetimeIndex(house_dict[str(i)]['index']).month
    house_dict[str(i)]['hour'].replace([0,1,2,3,4],0, inplace=True)
    house_dict[str(i)]['hour'].replace([5,23],1,inplace=True)
    house_dict[str(i)]['hour'].replace([9,10,11,12,13,14],2,inplace=True)
    house_dict[str(i)]['hour'].replace([6,7,1,8,15,22],3,inplace=True)
    house_dict[str(i)]['hour'].replace([16,21],4,inplace=True)
    house_dict[str(i)]['hour'].replace([17,18,19,20],5,inplace=True)
    house_dict[str(i)]['month'].replace([4,5,6],4,inplace=True)
    house_dict[str(i)]['month'].replace([3,7,12],3,inplace=True)
    house_dict[str(i)]['month'].replace([2,8,9],2,inplace=True)
    house_dict[str(i)]['month'].replace([1,10,11],1,inplace=True)
    house_dict[str(i)].rename(columns={'hour':'part_of_day'},inplace=True)
    house_dict[str(i)].rename(columns={'month':'part_of_year'},inplace=True)
    house_dict[str(i)]['weekend'] = pd.DatetimeIndex(house_dict[str(i)]['index']).dayofweek
    house_dict[str(i)]['weekend'].replace([0,1,2,3,4],0,inplace=True)
    house_dict[str(i)]['weekend'].replace([5,6],1,inplace=True)
    house_dict[str(i)].set_index(['index'],inplace=True)
  return house_dict

In [None]:
house = create_time_features(house)
house['1'].head()

Unnamed: 0_level_0,energy_Wh,max_air_temp,min_air_temp,type,energy_next_hour,energy_hour_2,energy_hour_3,energy_Wh_daymean,energy_previous_hour,energy_diff,energy_Wh_houseMean,part_of_day,part_of_year,weekend
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-11-09 23:00:00,607.758423,10.5,7.1,end-of-terrace,1092.620361,1494.370728,201.601059,,,,702.527588,3,1,0
2012-11-10 00:00:00,1092.620361,10.5,7.1,end-of-terrace,1494.370728,201.601059,164.422775,,607.758423,484.861938,702.527588,0,1,1
2012-11-10 01:00:00,1494.370728,10.5,7.1,end-of-terrace,201.601059,164.422775,173.645035,,1092.620361,401.750366,702.527588,0,1,1
2012-11-10 02:00:00,201.601059,10.5,7.1,end-of-terrace,164.422775,173.645035,164.823685,,1494.370728,-1292.769653,702.527588,0,1,1
2012-11-10 03:00:00,164.422775,10.5,7.1,end-of-terrace,173.645035,164.823685,211.479492,,201.601059,-37.178284,702.527588,0,1,1


- Converting all data types to use less RAM.

In [None]:
def convert_dtypes(house_dict,granularity):
  for i in range(1,len(house_dict)+1,1):
    house_dict[str(i)].dropna(inplace=True)     # TODO: if I drop NaNs later I could save some samples
    house_dict[str(i)]['part_of_day'] = house_dict[str(i)]['part_of_day'].astype('int8')
    house_dict[str(i)]['part_of_year'] = house_dict[str(i)]['part_of_year'].astype('int8')
    house_dict[str(i)]['weekend'] = house_dict[str(i)]['weekend'].astype('int8')
    house_dict[str(i)]['energy_Wh_daymean'] = house_dict[str(i)]['energy_Wh_daymean'].astype('float32')
    house_dict[str(i)]['energy_Wh_houseMean'] = house_dict[str(i)]['energy_Wh_houseMean'].astype('float32')
    house_dict[str(i)]['energy_diff'] = house_dict[str(i)]['energy_diff'].astype('float32')
    if granularity == '5min':
      house_dict[str(i)]['energy_5_min'] = house_dict[str(i)]['energy_5_min'].astype('float32')
      house_dict[str(i)]['energy_10_min'] = house_dict[str(i)]['energy_10_min'].astype('float32')
      house_dict[str(i)]['energy_15_min'] = house_dict[str(i)]['energy_15_min'].astype('float32')
  return house_dict

In [None]:
house = convert_dtypes(house, granularity)

- Joining all houses into "data".

In [None]:
data = pd.concat([house['1'],house['2'],house['3'],house['4'],house['5']])
data['energy_Wh'] = data['energy_Wh'].astype('int32')
data.corr()

Unnamed: 0,energy_Wh,max_air_temp,min_air_temp,energy_next_hour,energy_hour_2,energy_hour_3,energy_Wh_daymean,energy_previous_hour,energy_diff,energy_Wh_houseMean,part_of_day,part_of_year,weekend
energy_Wh,1.0,-0.09855,-0.070684,0.51656,0.288317,0.205333,0.25621,0.516543,0.491725,0.208212,0.326541,-0.099702,-0.011875
max_air_temp,-0.09855,1.0,0.935067,-0.100264,-0.098487,-0.090861,-0.078072,-0.095748,-0.00286,0.132912,-0.116553,0.130549,-0.006048
min_air_temp,-0.070684,0.935067,1.0,-0.072699,-0.071202,-0.065146,-0.066197,-0.068526,-0.002204,0.149848,-0.084693,0.058732,-0.00173
energy_next_hour,0.51656,-0.100264,-0.072699,1.0,0.516569,0.288292,0.254649,0.28823,0.232237,0.208259,0.331646,-0.099615,-0.013167
energy_hour_2,0.288317,-0.098487,-0.071202,0.516569,1.0,0.516516,0.25424,0.205315,0.084427,0.20836,0.277846,-0.099565,-0.014368
energy_hour_3,0.205333,-0.090861,-0.065146,0.288292,0.516516,1.0,0.253915,0.139652,0.066808,0.208386,0.164574,-0.099542,-0.014976
energy_Wh_daymean,0.25621,-0.078072,-0.066197,0.254649,0.25424,0.253915,1.0,0.2618,-0.005667,0.513697,-0.001826,-0.242393,-0.042271
energy_previous_hour,0.516543,-0.095748,-0.068526,0.28823,0.205315,0.139652,0.2618,1.0,-0.491593,0.208044,0.2704,-0.099582,-0.009668
energy_diff,0.491725,-0.00286,-0.002204,0.232237,0.084427,0.066808,-0.005667,-0.491593,1.0,0.000181,0.057118,-0.000132,-0.002239
energy_Wh_houseMean,0.208212,0.132912,0.149848,0.208259,0.20836,0.208386,0.513697,0.208044,0.000181,1.0,-0.00036,-0.252159,-0.005165


In [None]:
data.head()

Unnamed: 0_level_0,energy_Wh,max_air_temp,min_air_temp,type,energy_next_hour,energy_hour_2,energy_hour_3,energy_Wh_daymean,energy_previous_hour,energy_diff,energy_Wh_houseMean,part_of_day,part_of_year,weekend
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-11-11 00:00:00,181,9.4,4.1,end-of-terrace,225.359573,192.141846,232.582809,560.627258,184.402435,-3.12796,702.527588,0,1,1
2012-11-11 01:00:00,225,9.4,4.1,end-of-terrace,192.141846,232.582809,178.039078,560.627258,181.274475,44.085098,702.527588,0,1,1
2012-11-11 02:00:00,192,9.4,4.1,end-of-terrace,232.582809,178.039078,258.700439,560.627258,225.359573,-33.217728,702.527588,0,1,1
2012-11-11 03:00:00,232,9.4,4.1,end-of-terrace,178.039078,258.700439,204.097733,560.627258,192.141846,40.440964,702.527588,0,1,1
2012-11-11 04:00:00,178,9.4,4.1,end-of-terrace,258.700439,204.097733,413.0867,560.627258,232.582809,-54.543732,702.527588,0,1,1


In [None]:
#data.drop(['level_0'], axis=1, inplace=True)

- Splitting data into predictors and real values.
- OneHotEncoding on the "type" feature.

In [None]:
def hour_prediction_data(data, drop_features=[]):
  if granularity=='60min':
    drop_features.extend(['energy_next_hour','energy_hour_2','energy_hour_3'])
    X = data.drop(drop_features,axis=1)
    Y = data[['energy_next_hour','energy_hour_2','energy_hour_3']]
    if 'type' in drop_features:
      return X,Y
    else:
      X_e = pd.get_dummies(X, columns=["type"])
      return X_e,Y
  else:
    return 0,0
def min_prediction_data(data, drop_features=[]):
  if granularity=='5min':
    drop_features.extend(['energy_5_min','energy_15_min','energy_10_min','energy_next_hour','energy_hour_2','energy_hour_3',\
                          'energy_per_hour','energy_previous_hour','energy_hour_diff'])
    x = data.drop(drop_features,axis=1)
    y = data[['energy_5_min','energy_10_min','energy_15_min']]
    if 'type' in drop_features:
      return x,y
    else:
      x_e = pd.get_dummies(x, columns=["type"])
    return x_e,y
  else:
    return 0,0
  

In [None]:
X,Y = hour_prediction_data(data)
x,y = min_prediction_data(data)

- Data used for training each house individually (houses).
- Dropping features that are not usefull when predicting on individual houses.

In [None]:
def indiv_house_data(house_dict, drop_features=[]):
  drop_features.extend(['type','energy_Wh_houseMean'])
  for i in range(1,len(house)+1,1):
    house_dict[str(i)].drop(drop_features,axis=1,inplace=True)
    house_dict[str(i)].dropna(inplace=True)
  return house_dict

In [None]:
#if granularity == '5min':
houses = indiv_house_data(house)

- Data used for training when all houses are aggregated, predictions 1-3 hours.

In [None]:
X

Unnamed: 0_level_0,energy_Wh,max_air_temp,min_air_temp,energy_Wh_daymean,energy_previous_hour,energy_diff,energy_Wh_houseMean,part_of_day,part_of_year,weekend,type_0,type_end-of-terrace,type_flat,type_mid-terrace
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-11-11 00:00:00,181,9.4,4.1,560.627258,184.402435,-3.127960,702.527588,0,1,1,0,1,0,0
2012-11-11 01:00:00,225,9.4,4.1,560.627258,181.274475,44.085098,702.527588,0,1,1,0,1,0,0
2012-11-11 02:00:00,192,9.4,4.1,560.627258,225.359573,-33.217728,702.527588,0,1,1,0,1,0,0
2012-11-11 03:00:00,232,9.4,4.1,560.627258,192.141846,40.440964,702.527588,0,1,1,0,1,0,0
2012-11-11 04:00:00,178,9.4,4.1,560.627258,232.582809,-54.543732,702.527588,0,1,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-11-13 10:00:00,888,9.2,7.1,1014.129333,827.150146,61.167542,1078.384277,2,1,0,0,0,1,0
2014-11-13 11:00:00,934,9.2,7.1,1014.129333,888.317688,46.355774,1078.384277,2,1,0,0,0,1,0
2014-11-13 12:00:00,2823,9.2,7.1,1014.129333,934.673462,1889.303833,1078.384277,2,1,0,0,0,1,0
2014-11-13 13:00:00,1199,9.2,7.1,1014.129333,2823.977295,-1624.763184,1078.384277,2,1,0,0,0,1,0


In [None]:
Y

Unnamed: 0_level_0,energy_next_hour,energy_hour_2,energy_hour_3
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-11-11 00:00:00,225.359573,192.141846,232.582809
2012-11-11 01:00:00,192.141846,232.582809,178.039078
2012-11-11 02:00:00,232.582809,178.039078,258.700439
2012-11-11 03:00:00,178.039078,258.700439,204.097733
2012-11-11 04:00:00,258.700439,204.097733,413.086700
...,...,...,...
2014-11-13 10:00:00,934.673462,2823.977295,1199.214111
2014-11-13 11:00:00,2823.977295,1199.214111,2329.972168
2014-11-13 12:00:00,1199.214111,2329.972168,1198.886230
2014-11-13 13:00:00,2329.972168,1198.886230,1732.475220


- Data used for training when all houses are aggregated, predictions 5-15 mins.

In [None]:
x

In [None]:
y

- Data used for RNN LSTM model.

In [None]:
if granularity == '5min':
  dataLSTM = X.drop(['energy_per_hour', 'energy_previous_hour','energy_hour_diff'],axis=1)
  dataLSTM.head()
if granularity == '60min':
  dataLSTM = data.drop(['energy_next_hour','energy_hour_2','energy_hour_3'],axis=1)

- Pickling out data before deleting variables in case of later use.

In [None]:
if granularity == '5min':
  name_tag = ['house','dataLSTM','houses','X','Y','x','y']
  data_array = [house,dataLSTM,houses,X,Y,x,y]

  pickle_out = {}
  i=0
  for t in name_tag:
    pickle_out[t] = open(t+'.pkl','wb')
    pickle.dump(data_array[i],pickle_out[t])
    pickle_out[t].close()
    i+=1

In [None]:
def reload_data(name_tag):
  pickle_in = {}
  for t in name_tag:
   pickle_in[t] = open(t+'.pkl','rb')
  house = pickle.load(pickle_in['house'])
  dataLSTM = pickle.load(pickle_in['dataLSTM'])
  houses = pickle.load(pickle_in['houses'])
  X = pickle.load(pickle_in['X'])
  Y = pickle.load(pickle_in['Y'])
  x = pickle.load(pickle_in['x'])
  y = pickle.load(pickle_in['y'])
  return house,dataLSTM,houses,X,Y,x,y

- I used the recurrent neural network to predict one sample ahead, so 5 minutes into the future.
- When optimizing the recurrent neural network model I tried different number of units for the LSTM layer, I also tried a stacked LSTM with two LSTM layers, which was alot more prone to overfitting and slower to train. At the end I went for 32 units which didn't overfit, while also learning the model pretty decently.
- I included a Dropout layer, which helped with overfitting.
- I also tweaked the n_past parameter, the more samples I took, the longer was the training time, I found many different values which gave roughly the same result and chose 5 in the end, because it was the fastest.
- Because there was no signs of overfitting, I tried doing up to 30 epochs when the reults weren't getting better or worse anymore.
- At the end, I graphed out the plot, showing my training loss compared to the loss on the test set, which helped me see, if my model was overfitting or underfitting.

In [None]:
def RNN_LSTM(data, n_future, n_past):
  # scaling the data
  scaler = StandardScaler()
  scaler = scaler.fit(data)
  x_e = scaler.transform(data)

  X = []
  Y = []

  # Reshaping the data so I can feed it into the LSTM RNN. Shape of the data looks like 
  # (number of samples, n_past, number of features) where n_past is the number of previous
  # samples we use to predict the next value and n_future is how many values ahead do we want to predict.  
  for i in range(n_past, len(x_e) - n_future +1):
    X.append(x_e[i - n_past:i, 0:x_e.shape[1]])
    Y.append(x_e[i + n_future - 1:i + n_future, 0])

  X, Y = np.array(X), np.array(Y)

  x_train,x_test,y_train,y_test = train_test_split(X, Y, test_size=0.2, shuffle=True)
  x_train, x_test, y_train, y_test = np.array(x_train), np.array(x_test), np.array(y_train), np.array(y_test)

  # Checkpoint saves the state of the model at the epoch where it's performing the best.
  checkpoint_name = 'Weights-{epoch:03d}--{val_loss:.5f}.hdf5' 
  checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 1, save_best_only = True, mode ='auto')
  callbacks_list = [checkpoint]

  model = Sequential()
  model.add(LSTM(32, activation='relu', input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=False))
  model.add(Dropout(0.5))
  model.add(Dense(Y.shape[1]))

  model.compile(optimizer='adam', loss='mae',metrics=['mean_absolute_error'])
  model.summary()

  # fit the model
  history = model.fit(x_train, y_train, epochs=30,batch_size=16,validation_split=0.1, verbose=1, callbacks=callbacks_list)

  plt.plot(history.history['loss'], label='Training loss')
  plt.plot(history.history['val_loss'], label='Validation loss')
  plt.legend()

  # make predictions
  trainPredict = model.predict(x_train)
  testPredict = model.predict(x_test)
  # invert predictions
  trainPredict_copies = np.repeat(trainPredict, x_e.shape[1], axis=-1)
  trainPredict = scaler.inverse_transform(trainPredict_copies)[:,0]
  y_train_copies = np.repeat(y_train, x_e.shape[1], axis=-1)
  y_train = scaler.inverse_transform(y_train_copies)[:,0]
  testPredict_copies = np.repeat(testPredict, x_e.shape[1], axis=-1)
  testPredict = scaler.inverse_transform(testPredict_copies)[:,0]
  y_test_copies = np.repeat(y_test, x_e.shape[1], axis=-1)
  y_test = scaler.inverse_transform(y_test_copies)[:,0]
  # calculate and print mean absolute error
  trainScore = mean_absolute_error(y_train, trainPredict)
  print('Train Score: %.2f MAE' % (trainScore))
  testScore = mean_absolute_error(y_test, testPredict)
  print('Test Score: %.2f MAE' % (testScore))

  return trainScore,testScore

In [None]:
if granularity == '5min':
  house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
  del X,Y,houses,x,y,house
else:
  dataLSTM = pd.get_dummies(dataLSTM, columns=["type"])
RNN_LSTM(data=dataLSTM,n_future=3,n_past=5)

In [None]:
if granularity == '5min':
  pass
else:
  .

RNN LSTM prediction on each house individually 5min, 10min, 15min and 1 hour, 2 hours, 3 hours

In [None]:
if granularity=='5min':
  house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
  del X,Y,dataLSTM,x,y,house
  for i in range(1,len(houses)+1,1):
    RNN_LSTM(data=houses[str(i)].drop(['energy_per_hour', 'energy_previous_hour','energy_hour_diff','energy_hour_2','energy_hour_3','energy_hour_diff','energy_5_min','energy_10_min',\
                        'energy_15_min','energy_next_hour'],axis=1),n_future=3,n_past=5)
else:
  for i in range(1,len(houses)+1,1):
    RNN_LSTM(data=houses[str(i)].drop(['energy_next_hour', 'energy_previous_hour','energy_hour_2','energy_hour_3'],axis=1),n_future=1,n_past=5)
    

n_future = 2, 220,90 MAE

n_future = 3, 240,71 MAE

- Results using RNN LSTM (mean absolute error in Wh):
  - Prediction 5 minutes:
    - Train Score: 174.30 MAE
    - Test Score: 172.11 MAE

*in progress*
- Plotting the predictions.

In [None]:
'''# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()'''

*in progress*
- Writing a custom loss function, so I can punish the undershooting mistakes more than overshooting.

In [None]:
'''
def custom_f(predt, dtrain):
    alpha = 1.5
    #y = dtrain.get_label()
    if predt-dtrain<0:
      return -alpha*np.abs(predt-dtrain)
    else:
      return -np.abs(predt-dtrain)'''
'''
def gradient(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
    #Compute the gradient squared log error.
    y = dtrain.get_label()
    return (np.log1p(predt) - np.log1p(y)) / (predt + 1)

def hessian(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
    #Compute the hessian for squared log error.
    y = dtrain.get_label()
    return ((-np.log1p(predt) + np.log1p(y) + 1) /
            np.power(predt + 1, 2))

def squared_log(predt: np.ndarray,
                dtrain: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
    #Squared Log Error objective. A simplified version for RMSLE used as objective function.
    
    predt[predt < -1] = -1 + 1e-6
    grad = gradient(predt, dtrain)
    hess = hessian(predt, dtrain)
    return grad, hess'''

In [None]:
labels = ['energy_next_hour','energy_hour_2','energy_hour_3']
labels2 = ['energy_5_min','energy_10_min','energy_15_min']

- With the XGboost model, I was using two different datasets. One was the same as with the recurrent neural network, where I predict for 5 minutes ahead. For the other one I also did predictions for one hour, two and three hours ahead, similar to what I was doing on the Reach dataset.
- Testing out the XGBoost model:
  - For adjusting the hyperparameters I used GridSearchCV.
  - For most of my training I set n_estimators to equal 10. I did this to save time, while not losing that much on the accuracy of the model.
- Best results (mean absolute error in Wh):
  - dataset 1:
    - 5 minutes: 267.45 MAE
    - 10 minutes: 282.61 MAE
    - 15 minutes: 290.68 MAE
  - dataset 2:
    - 1 hour: 243.61 MAE
    - 2 hours: 247.92 MAE
    - 3 hours: 246.93 MAE

In [None]:
def XGB_aggreg_hour(X,Y):
  xgb_regr = xgb.XGBRegressor(verbose=1,gamma=0,max_depth=25,min_child_weight=10,n_jobs=-1,n_estimators=10)
  resultsXGB = []
  shiftsXGB = []
  print("HOURS:")
  for l in labels:
    kfold = KFold(n_splits=4,shuffle=True)
    results = cross_val_score(xgb_regr,X,Y[l],cv=kfold,scoring='neg_mean_absolute_error')
    print(f"Mean, std: {results.mean()}, {results.std()}")
    shiftsXGB.append(results.mean())
  resultsXGB.append(shiftsXGB)
  print(f"results hours: {resultsXGB}")

  return resultsXGB

def XGB_aggreg_min(x,y):
  xgb_regr = xgb.XGBRegressor(verbose=1,gamma=0,max_depth=8,min_child_weight=14,n_jobs=-1,n_estimators=10)
  resultsXGB_min = []
  shiftsXGB = []
  print("MINUTES:")
  for l in labels2:
    kfold = KFold(n_splits=4,shuffle=True)
    results = cross_val_score(xgb_regr,x,y[l],cv=kfold,scoring='neg_mean_absolute_error')
    print(f"Mean, std: {results.mean()}, {results.std()}")
    shiftsXGB.append(results.mean())
  resultsXGB_min.append(shiftsXGB)

  print(f"results minutes: {resultsXGB_min}")
  
  return resultsXGB_min

In [None]:
house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del houses,house,dataLSTM,x,y
XGB_aggreg_hours(X,Y)

house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del houses,house,dataLSTM,X,Y
XGB_aggreg_min(x,y)

- I did results for every house individually aswell. I did a similar thing as before where I used two different datasets one for predicting 5 minutes ahead and the other for 1-3 hours ahead. First I did it with RandomForestRegressor, which I optimized using GridSearchCV.
- The results of two seperate trainings are listed below.

In [None]:
def RF_indiv_hour(houses): 
  RF_regr = RandomForestRegressor(n_estimators=10)
  results_every_house = []
  for i in range(1,len(houses)+1,1):
    shifts_every_house = []
    for l in labels:
      kfold = KFold(n_splits=4, shuffle=True, random_state=42)
      results = cross_val_score(RF_regr,houses[str(i)].drop(['energy_next_hour','energy_hour_2','energy_hour_3','energy_5_min','energy_10_min','energy_15_min'],axis=1),\
                              houses[str(i)][l],cv=kfold,scoring='neg_mean_absolute_error')
      #print(f"Mean,std: {results.mean()}, {results.std()}")
      shifts_every_house.append(results.mean())
    results_every_house.append(shifts_every_house)
  return results_every_house


In [None]:
def RF_indiv_min(houses):
  RF_regr = RandomForestRegressor(n_estimators=10)
  results_every_house_min = []
  for i in range(1,len(houses)+1,1):
    shifts_every_house = []
    for l in labels2:
      kfold = KFold(n_splits=4, shuffle=True, random_state=42)
      results = cross_val_score(RF_regr,houses[str(i)].drop(['energy_next_hour','energy_hour_2','energy_hour_3','energy_5_min','energy_10_min','energy_15_min'],axis=1),\
                              houses[str(i)][l],cv=kfold,scoring='neg_mean_absolute_error')
      #print(f"Mean,std: {results.mean()}, {results.std()}")
      shifts_every_house.append(results.mean())
    results_every_house_min.append(shifts_every_house)
  return results_every_house_min

In [None]:
house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del house,dataLSTM,X,Y,x,y

RF_indiv_hour(houses)
RF_indiv_min(houses)

- The results using RandomForestRegressor (negative MAE in Wh) for each of the five houses.


In [None]:
# Hour shifts of 1, 2 and 3 for each house
res1 = [[-8.940321554901315, -5.907694852169678, -6.081659686491101],
 [-3.1259870620033503, -3.3141521606969047, -2.7314996533381835],
 [-16.634527979399458, -9.018266701288404, -8.445598022745479],
 [-2.53823452536365, -2.5633788857156032, -3.616244763433274],
 [-4.35738370942241, -4.0452113971608945, -4.529010026977254]]

res2 = [[-9.147939178881995, -5.61363458901363, -5.631915073723677],
 [-3.069280121205534, -3.4800966887110203, -2.6198912478655654],
 [-17.04767340492508, -11.202566931669569, -8.768770778505989],
 [-2.644403943231151, -2.6088597424836477, -3.4567297063981623],
 [-3.249339643759429, -4.2898561738367125, -4.844694880434709]]

# Prediction for 5, 10 and 15 minutes ahead
res3 = [[-205.21729301043933, -264.68734899936373, -278.7132719689562],
 [-178.5423738333146, -224.1490214555531, -232.40861009156316],
 [-133.84860141265835, -178.22609567867588, -199.03604223464993],
 [-173.63488655849287, -210.0114049618994, -219.1221518750832],
 [-170.43731613068456, -227.50765566239184, -243.22578830127958]]

- Then I tried it with XGBRegressor aswell.
- Results are listed below.

In [None]:
def XGB_indiv_hour(houses):
  xgb_regr = xgb.XGBRegressor(verbose=1,gamma=0,max_depth=25,min_child_weight=10,n_jobs=-1,n_estimators=10)
  results_every_house = []
  for i in range(1,len(houses)+1,1):
    shifts_every_house = []
    for l in labels:
      kfold = KFold(n_splits=4, shuffle=True, random_state=42)
      results = cross_val_score(xgb_regr,houses[str(i)].drop(['energy_next_hour','energy_hour_2','energy_hour_3','energy_5_min','energy_10_min','energy_15_min'], axis=1),\
                              houses[str(i)][l],cv=kfold,scoring='neg_mean_absolute_error')
      #print(f"Mean,std: {results.mean()}, {results.std()}")
      shifts_every_house.append(results.mean())
    results_every_house.append(shifts_every_house)
  print(f"Results hour: {results_every_house}")
  return results_every_house

In [None]:
def XGB_indiv_min(houses):
  xgb_regr = xgb.XGBRegressor(verbose=1,gamma=0,max_depth=8,min_child_weight=14,n_jobs=-1,n_estimators=10)
  results_every_house_min = []
  for i in range(1,len(houses)+1,1):
    shifts_every_house = []
    for l in labels2:
      kfold = KFold(n_splits=4, shuffle=True, random_state=42)
      results = cross_val_score(xgb_regr,houses[str(i)].drop(['energy_next_hour','energy_hour_2','energy_hour_3','energy_5_min','energy_10_min','energy_15_min'], axis=1),\
                              houses[str(i)][l],cv=kfold,scoring='neg_mean_absolute_error')
      #print(f"Mean,std: {results.mean()}, {results.std()}")
      shifts_every_house.append(results.mean())
    results_every_house_min.append(shifts_every_house)
  print(f"Results min: {results_every_house_min}")
  return results_every_house_min

In [None]:
XGB_indiv_hour(houses)
XGB_indiv_min(houses)

- The results using XGBRegressor (negative MAE in Wh) for each of the five houses.

In [None]:
# Hours shifts of 1, 2 and 3 for each house
res = [[-244.98690032958984, -251.21905517578125, -253.93219375610352],
 [-180.08655166625977, -181.13837051391602, -181.60640335083008],
 [-237.07272338867188, -250.8828125, -250.80009078979492],
 [-194.4733657836914, -195.52861404418945, -195.61228942871094],
 [-380.97105407714844, -382.78367614746094, -382.90311431884766]]
# Prediction for 5, 10 and 15 minutes ahead
res2 = [[-267.6378173828125, -291.14539337158203, -296.9206848144531],
 [-208.86071014404297, -228.75857543945312, -232.5463638305664],
 [-244.11033630371094, -262.56711196899414, -273.9931869506836],
 [-214.06250381469727, -227.1177635192871, -233.57318878173828],
 [-387.7323226928711, -401.28368377685547, -406.7476348876953]]

- Here, I did the same predictions as with XGBoost on all houses aggregated, using RandomForestRegressor.
- Best results (mean absolute error in Wh):
  - dataset 1:
    - 5 minutes: 199.00 MAE
    - 10 minutes: 183.54 MAE
    - 15 minutes: 198.02 MAE
  - dataset 2:
    - 1 hour: 5.75 MAE
    - 2 hours: 3.57 MAE
    - 3 hours: 4.62 MAE

- We can see that the last 3 results are suspiciously good, and probably a result of overfitting, even though I used cross validation.

In [None]:
def RF_aggreg_hour(X,Y):  
  RF_regr = RandomForestRegressor(n_estimators=10)
  resultsRF = []
  shiftsRF = []
  for l in labels:
    kfold = KFold(n_splits=4, shuffle=True)
    results = cross_val_score(RF_regr,X,Y[l],cv=kfold,scoring='neg_mean_absolute_error')
    print(f"Mean,std: {results.mean()}, {results.std()}")
    shiftsRF.append(results.mean())
  resultsRF.append(shiftsRF)
  print(f"Results hour: {resultsRF}")
  return resultsRF

In [None]:
def RF_aggreg_min(x,y):
  RF_regr = RandomForestRegressor(n_estimators=10, min_samples_leaf=40, min_samples_split=20)
  resultsRF_min = []
  shiftsRF = []
  for l in labels2:
    kfold = KFold(n_splits=4, shuffle=True)
    results = cross_val_score(RF_regr,x,y[l],cv=kfold,scoring='neg_mean_absolute_error')
    print(f"Mean,std: {results.mean()}, {results.std()}")
    shiftsRF.append(results.mean())
  resultsRF_min.append(shiftsRF)
  print(f"Results min: {resultsRF_min}")
  return resultsRF_min

In [None]:
house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del houses,house,dataLSTM,x,y
RF_aggreg_hour(X,Y)

house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del houses,house,dataLSTM,X,Y
RF_aggreg_min(x,y)

- Results using RandomForestRegressor when all houses are aggregated (negative MAE).

In [None]:
# prediction 1, 2, 3 hours into the future:
r1 = [-8.033944268003609, -7.22307783236139, -5.029273831055247]
r2 = [-5.750566512889883, -3.5746119976256003, -4.625614552566425]

# prediction 5, 10, 15 minutes into the future:
r3 = [-201.79729266526888, -283.5402434192109, -298.02007765958325] 

- Optimizing hyperparameters using GridSearchCV.

In [None]:
param_grid = {
    'min_samples_split': [15,20,25],
    'min_samples_leaf': [25,30,35,40,45,50]
}
kfold = KFold(n_splits=4, shuffle=True, random_state=0)
clf =  GridSearchCV(RandomForestRegressor(n_estimators=10, n_jobs=-1),param_grid, cv=kfold, return_train_score=False,scoring='neg_mean_absolute_error')
clf.fit(X,Y['energy_next_hour'])
print(f'Best estimator: {clf.best_estimator_}')
print(f'Best params: {clf.best_params_}')
print(f'Best score: {clf.best_score_}')

Best estimator: RandomForestRegressor(min_samples_leaf=40, min_samples_split=25,
                      n_estimators=10, n_jobs=-1)
Best params: {'min_samples_leaf': 40, 'min_samples_split': 25}
Best score: -271.1933033788696


In [None]:
Y

Unnamed: 0_level_0,energy_next_hour,energy_hour_2,energy_hour_3
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-11-11 00:00:00,225.359573,192.141846,232.582809
2012-11-11 01:00:00,192.141846,232.582809,178.039078
2012-11-11 02:00:00,232.582809,178.039078,258.700439
2012-11-11 03:00:00,178.039078,258.700439,204.097733
2012-11-11 04:00:00,258.700439,204.097733,413.086700
...,...,...,...
2014-11-13 10:00:00,934.673462,2823.977295,1199.214111
2014-11-13 11:00:00,2823.977295,1199.214111,2329.972168
2014-11-13 12:00:00,1199.214111,2329.972168,1198.886230
2014-11-13 13:00:00,2329.972168,1198.886230,1732.475220


- Plotting out the graph of feature importances when predicting with RandomForestRegressor.

In [None]:
feature_names = ['energy_Wh','max_air_temp','min_air_temp','energy_Wh_houseMean','energy_Wh_daymean','energy_per_hour','energy_previous_hour','energy_hour_diff','energy_diff',\
                 'part_of_day','part_of_year','weekend','type_0','type_eot','type_flat','type_mt']
forest = RandomForestRegressor(n_estimators=10)
forest.fit(X_e,Y)
importances = forest.feature_importances_

forest_importances = pd.DataFrame(importances, index=feature_names,columns=['importance']).sort_values('importance',ascending=False)

plt.rcParams.update({'figure.figsize': (20, 10.0)})
plt.rcParams.update({'font.size': 14})
plt.barh(feature_names, importances)

- Making predictions with the models, so I can later plot them on a graph.
- Predictions and graphs are made predicting one hour into the future.

In [None]:
house,dataLSTM,houses,X,Y,x,y = reload_data(name_tag)
del x,y,house,dataLSTM,X,Y

kfold = KFold(n_splits=4, shuffle=True, random_state=0)
predictions_1=predictions_2=predictions_3=predictions_4=predictions_5=np.ndarray(0)
house_predictions = [predictions_1,predictions_2,predictions_3,predictions_4,predictions_5]


for i in range(0,len(houses),1):
  house_predictions[i] = cross_val_predict(xgb.XGBRegressor(verbose=1,gamma=0,max_depth=25,min_child_weight=13,n_jobs=-1,n_estimators=10),\
                                    houses[str(i+1)].drop(['energy_next_hour','energy_hour_2','energy_hour_3','energy_5_min','energy_10_min','energy_15_min'],axis=1),\
                                    houses[str(i+1)]['energy_next_hour'], cv=kfold,n_jobs=-1,verbose=1)

for i in range(0,len(houses),1):
  house_predictions[i] = pd.DataFrame({'real':houses[str(i+1)].energy_next_hour,'pred':house_predictions[i]})


- Choosing a one week period in the data, so I can see the details in the lineplot, otherwise the data is too dense.

In [None]:
# function drops index row, for graphing purposes
def drop_index(dataset):
  dataset = dataset.reset_index()
  return dataset.drop(['index'], axis=1)

# dataset for lineplot
def make_lineplot(data, start_week, end_week):
  lineplot = {}
  i = 1
  for p in data:
    lineplot['House_'+str(i)] = p[start_week*168 : end_week*168]
    lineplot['House_'+str(i)] = drop_index(lineplot['House_'+str(i)])
    i+=1
  return lineplot

# dataset for kdeplot
def make_kdeplot(data):
  kdeplot = {}
  i = 1
  for p in data:
    kdeplot['House_'+str(i)] = p
    i+=1
  return kdeplot

def draw_kdeplot(house_data):
  plt.figure(figsize=(15,15))
  sns.kdeplot(x=house_data['real'],data=house_data)
  sns.kdeplot(x=house_data['pred'],data=house_data)
  plt.legend(['real','predicted'])

def draw_lineplot(house_data):
  plt.figure(figsize=(15,15))
  sns.lineplot(x=house_data.index,y=house_data['real'],data=house_data)
  sns.lineplot(x=house_data.index,y=house_data['pred'],data=house_data)
  plt.legend(['real','predicted'])

In [None]:
lineplot = make_lineplot(house_predictions, 0, 2)
kdeplot = make_kdeplot(house_predictions)

- KDE plots for each house.
  - Similar to the Reach data, again all plots show a common trait and that is, that they have two peaks in values. Although here I noticed that my model really overshoots the first peak and overall the prediction graph is slightly shifted to the left, which would mean that in general I undershoot the prediction. 

In [None]:
draw_kdeplot(kdeplot['House_1'])

In [None]:
draw_kdeplot(kdeplot['House_2'])

In [None]:
draw_kdeplot(kdeplot['House_3'])

In [None]:
draw_kdeplot(kdeplot['House_4'])

In [None]:
draw_kdeplot(kdeplot['House_5'])

- Lineplots of a one week period, including the real values of energy consumption and the predictions for each house.
  - For this dataset I decided to show the graphs using XGBRegressor. I am quite happy with the general shape of the graph, although again I can clearly see that predictions aren't "flexible" enough in other words, I am undershooting the "peaks" of consumption and overshooting the "valleys". But even more obvious is that in general all my predictions are too low. I will try to fix that by writing a custom loss function, which will hopefully just shift the graph up.
  - I also graphed out the results using RandomForestRegressor, although the graphs were alot better, because the fit was alot better I could still notice the "flexibility" problem, that I cannot fix.

In [None]:
draw_lineplot(lineplot['House_1'])

In [None]:
draw_lineplot(lineplot['House_2'])

In [None]:
draw_lineplot(lineplot['House_3'])

In [None]:
draw_lineplot(lineplot['House_4'])

In [None]:
draw_lineplot(lineplot['House_5'])