In this Notebook we are going to explore all about Energy consumption forecasting data set and also dive into in detail.

# Dataset Description

## Data and Problem Description
More accurate forecasts of building energy consumption mean better planning and more efficient energy use. The objective is to forecast energy consumption from the following data:
(For each data set, several test periods over which a forecast is required will be specified.)

### Historical Consumption
A selected time series of consumption data for over 260 buildings.

- **obs_id** - An arbitrary ID for the observation
- **SiteId** - An arbitrary ID number for the building, matches across datasets
- **ForecastId** - An ID for a timeseries that is part of a forecast (can be matched with the submission file)
- **Timestamp** - The time of the measurement
- **Value** - A measure of consumption for that building

### Building Metadata
Additional information about the included buildings.

- **SiteId** - An arbitrary ID number for the building, matches across datasets
- **Surface** - The surface area of the building
- **Sampling** - The number of minutes between each observation for this site. The timestep size for each ForecastId can be found in the separate "Submission Forecast Period" file on the data download page.
- **BaseTemperature** - The base temperature for the building
- **IsDayOff** - True if DAY_OF_WEEK is not a work day

### Historical Weather Data
This dataset contains temperature data from several stations near each site. For each site, several temperature measurements were retrieved from stations in a radius of 30 km if available. Note: Not all sites will have available weather data.

- **SiteId** - An arbitrary ID number for the building, matches across datasets
- **Timestamp** - The time of the measurement
- **Temperature** - The temperature as measured at the weather station
- **Distance** - The distance in km from the weather station to the building in km

### Public Holidays
Public holidays at the sites included in the dataset, which may be helpful for identifying days where consumption may be lower than expected. Note: Not all sites will have available public holiday data.

- **SiteId** - An arbitrary ID number for the building, matches across datasets
- **Date** - The date of the holiday
- **Holiday** - The name of the holiday


Importing Libraries

In [1]:
import numpy as np
import pandas as pd

### All dataset overview

In [4]:
test_data_path="..\Dataset\power-laws-forecasting-energy-consumption-test-data.csv"
test_data=pd.read_csv(test_data_path,sep=';')
test_data

Unnamed: 0,obs_id,SiteId,Timestamp,ForecastId,Value
0,323604,235,2014-01-02T19:00:00+00:00,5004,157265.446409
1,2813181,235,2014-01-02T22:00:00+00:00,5004,155498.418922
2,4006999,235,2014-01-02T22:45:00+00:00,5004,155498.418922
3,106973,235,2014-01-03T00:30:00+00:00,5004,157265.446409
4,6793052,235,2014-01-13T14:15:00+00:00,5005,91885.429363
...,...,...,...,...,...
1309171,268524,261,2014-03-24T03:00:00+00:00,5487,38545.410018
1309172,3203969,261,2014-03-24T12:30:00+00:00,5487,48937.146780
1309173,7721495,261,2014-03-24T22:00:00+00:00,5487,
1309174,5696668,261,2014-03-24T23:30:00+00:00,5487,


In [5]:
train_data_path="..\Dataset\power-laws-forecasting-energy-consumption-training-data.csv"
train_data=pd.read_csv(train_data_path,sep=';')
train_data

Unnamed: 0,obs_id,SiteId,Timestamp,ForecastId,Value
0,4852050,42,2016-10-18T02:45:00+00:00,1087,26397.049623
1,1638923,42,2016-10-18T11:45:00+00:00,1087,42958.364641
2,5748910,42,2016-10-18T20:45:00+00:00,1087,27096.919666
3,38199,42,2016-10-20T10:45:00+00:00,1087,50211.408087
4,1338204,42,2016-10-20T18:45:00+00:00,1087,50503.305105
...,...,...,...,...,...
6559825,1127574,300,2017-09-22T18:45:00+00:00,6719,7740.955427
6559826,4695712,300,2017-09-23T17:45:00+00:00,6719,7133.180234
6559827,978979,300,2017-09-24T22:45:00+00:00,6719,7339.789365
6559828,6317358,300,2017-09-25T08:45:00+00:00,6719,18873.744081


In [6]:
holiday_data_path="..\Dataset\power-laws-forecasting-energy-consumption-holidays.csv"
holiday_data=pd.read_csv(holiday_data_path,sep=';')
holiday_data

Unnamed: 0,Date,Holiday,SiteId
0,2016-02-15,Washington's Birthday,1
1,2017-05-29,Memorial Day,1
2,2017-11-23,Thanksgiving Day,1
3,2017-12-29,New Years Eve Shift,1
4,2017-12-31,New Years Eve,1
...,...,...,...
8382,2015-12-26,Boxing Day,303
8383,2016-05-01,International Workers' Day,304
8384,2015-04-25,Liberation Day,304
8385,2016-03-28,Easter Monday,305


In [7]:
weather_data_path="..\Dataset\power-laws-forecasting-energy-consumption-weather.csv"
weather_data=pd.read_csv(weather_data_path,sep=';')
weather_data

Unnamed: 0,Timestamp,Temperature,Distance,SiteId
0,2017-03-03T19:00:00+00:00,10.6,27.489346,51
1,2017-03-03T19:20:00+00:00,11.0,28.663082,51
2,2017-03-03T20:00:00+00:00,6.3,28.307039,51
3,2017-03-03T21:55:00+00:00,10.0,29.797449,51
4,2017-03-03T23:00:00+00:00,5.4,28.307039,51
...,...,...,...,...
3957030,2016-09-11T11:00:00+00:00,25.9,28.307039,51
3957031,2016-09-11T11:20:00+00:00,27.0,27.489346,51
3957032,2016-09-11T12:00:00+00:00,27.1,28.307039,51
3957033,2016-09-11T15:50:00+00:00,28.0,27.489346,51


In [8]:
metadata_path="..\Dataset\power-laws-forecasting-energy-consumption-metadata.csv"
metadata=pd.read_csv(metadata_path,sep=';')
metadata

Unnamed: 0,SiteId,Surface,Sampling,BaseTemperature,MondayIsDayOff,TuesdayIsDayOff,WednesdayIsDayOff,ThursdayIsDayOff,FridayIsDayOff,SaturdayIsDayOff,SundayIsDayOff
0,207,7964.873347,30.0,18.0,False,False,False,False,False,True,True
1,7,15168.125971,30.0,18.0,False,False,False,False,False,True,True
2,74,424.340663,15.0,18.0,False,False,False,False,False,True,True
3,239,1164.822636,15.0,18.0,False,False,False,False,False,True,True
4,274,1468.246690,5.0,18.0,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...
262,192,11188.881545,15.0,18.0,False,False,False,False,False,True,True
263,58,1149.050606,15.0,18.0,False,False,False,False,False,True,True
264,123,5470.205018,15.0,18.0,False,False,False,False,False,True,True
265,122,6843.612340,15.0,18.0,False,False,False,False,False,True,True


In [9]:
submission_data_path="..\Dataset\power-laws-forecasting-energy-consumption-submission-forecast-period.csv"
submission_data=pd.read_csv(submission_data_path,sep=';')
submission_data

Unnamed: 0,ForecastId,ForecastPeriodNS
0,123,900000000000
1,264,900000000000
2,596,900000000000
3,914,900000000000
4,1053,900000000000
...,...,...
6969,6387,900000000000
6970,6487,3600000000000
6971,6569,900000000000
6972,6777,900000000000


### More about Training dataset

In [13]:
train_list=train_data['SiteId'].unique()
train_list.sort()
print("ID number for the building: ")
print(train_list)
print()
print("No of Buildings: ")
print(len(train_list))

ID number for the building: 
[  1   2   3   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
  20  21  22  23  25  26  27  29  32  33  34  38  39  40  41  42  43  44
  45  46  47  48  49  50  51  52  53  54  57  58  59  60  61  62  63  64
  65  66  67  68  69  70  72  73  74  75  76  77  78  83  84  85  86  87
  88  89  90  92  93  94  96  98  99 100 101 102 105 106 107 108 109 110
 111 112 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
 131 132 134 135 136 139 140 141 142 143 145 146 148 149 150 151 152 153
 154 155 156 157 158 159 160 161 162 163 164 165 167 169 170 171 172 173
 174 175 176 177 178 180 181 182 183 184 185 186 189 190 191 192 193 194
 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
 213 215 216 217 218 219 221 222 223 224 225 226 227 228 229 230 231 232
 233 234 235 236 237 238 239 240 241 243 244 245 246 247 248 249 250 251
 252 253 254 255 256 257 259 260 261 262 263 264 265 266 267 268 269 270
 271 272 273 274 275 2

##### Looking for how many values for each ID number for the building

In [17]:
new = train_data.groupby('SiteId')
result = new['ForecastId'].count()
print(result)

SiteId
1         900
2       34704
3         360
5         964
6      140744
        ...  
301      2250
302    223648
303      2250
304       450
305       964
Name: ForecastId, Length: 267, dtype: int64


In [22]:
Timestamp_count=test_data.groupby('SiteId')['Timestamp'].count()
Tcount=Timestamp_count.to_dict()
print(Tcount)

{1: 238, 2: 6911, 3: 58, 5: 192, 6: 28032, 7: 118, 8: 23040, 9: 18432, 10: 238, 11: 192, 12: 118, 13: 238, 14: 18432, 15: 118, 16: 4607, 17: 238, 18: 238, 19: 16512, 20: 2687, 21: 238, 22: 18048, 23: 192, 25: 16512, 26: 4223, 27: 4223, 29: 118, 32: 118, 33: 19200, 34: 192, 38: 118, 39: 3455, 40: 3455, 41: 11520, 42: 3071, 43: 192, 44: 192, 45: 191, 46: 10368, 47: 58, 48: 192, 49: 11520, 50: 11520, 51: 191, 52: 118, 53: 2303, 54: 2687, 57: 2687, 58: 118, 59: 10752, 60: 2687, 61: 118, 62: 11520, 63: 11520, 64: 118, 65: 192, 66: 2687, 67: 118, 68: 58, 69: 118, 70: 118, 72: 2687, 73: 2687, 74: 11136, 75: 192, 76: 2687, 77: 2303, 78: 58, 83: 119, 84: 10368, 85: 191, 86: 3455, 87: 13440, 88: 12672, 89: 8448, 90: 118, 92: 9216, 93: 4223, 94: 58, 96: 2303, 98: 3455, 99: 3455, 100: 3071, 101: 191, 102: 118, 105: 192, 106: 9600, 107: 9600, 108: 10752, 109: 11904, 110: 118, 111: 118, 112: 2687, 115: 14592, 116: 118, 117: 2687, 118: 118, 119: 13440, 120: 13824, 121: 18432, 122: 18048, 123: 18432, 