<h1><font color='darkred'>Great Baseline Energy Consumption Predictor</font></h1>

<p><strong>Welcome!</strong> This notebook will walk you though baseline energy consumption prediction using a deep learning model. In particular, we will see the Long-Short Term Memory commonly known as <code><i>LSTM Model</i></code>.</p>
    
<p>This project is designed to ease the difficulty of verifying the baseline energy consumption of a building without undergoing any energy efficiency improvement measures. This will help DOE, third parity incentive providers and the customer to know what to expect regarding how much they could save and how much they would get as an incentive from the DOE or incentive provider companies.
    
In general, <b>energy saving</b> is measured by taking the difference in energy consumption between:
    <ul>
        <li> the energy consumption without any measures</li>
        <li> And by comparing it with the energy consumption after conducting certain energy saving improvement</li>
    </ul>
</p>

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="imports">Import <code>Python</code> Packages</a></li>
        <li><a href="dataset">About the Data</a></li>
        <ul>
            <li><a href="files">Loading Data</a></li>
            <li><a href='memory'>Reducing Memory Usage of the Data</a></li>
        </ul>
        <li><a href="eda">Merging the Data into one set</a></li>
        <li><a href="eda">Training Set: Exploratory Data Analysis (EDA)</a></li>
         <ul>
            <li><a href="train">Checking for Missing Values</a></li>
            <li><a href="weather">Checking for Missing Values in the Weather Dataset</a></li>
            <li><a href="metadata">Checking for Missing Values in the Metadata</a></li>
        </ul>
    </ul>
    <p>
        Estimated read time: <strong>17 min</strong>
    </p>
</div>

<hr>

<h2 id="imports"><font color='darkblue'>Import <code>Python</code> Packages</font></h2>

In [30]:
# Data analysis packages:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f'%x)

# Visualization packages:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.patches as patches

In [31]:
# import warnings
import warnings

import itertools
import gc

import modules from <b>sklearn</b>

In [32]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.gofplots import qqplot
import statsmodels.api as sm
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import datetime as dt
from IPython.display import HTML # to see everything
plt.style.use('seaborn-darkgrid')
warnings.filterwarnings("ignore")
# matplotlib.rcParams['figure.dpi'] = 100

import <b>helper functions</b>

In [33]:
# helper function
import data_cleaning as dc
import helper_function as f
# import visualization as vis

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


import <b>plotly</b>

In [34]:
import plotly.express as px
import plotly.graph_objects as go

<h2 id="dataset"><font color='darkblue'>About the Data</font></h2>

<p>The data source has three <i>".csv"</i> files and all the columns are explained as follows.</p> 
<b><font color='darkred'><i>train.csv</i></font></b>
<ul>
    <li><b>building_id</b>: Foreign key for the building metadata</li>
    <li><b>meter</b>: The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, hotwater: 3}. Not every building has all meter types.</li>
    <li><b>timestamp</b>: When the measurement was taken</li>
    <li><b>meter_reading</b>: The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error. #### building_meta.csv</li>
</ul>
<b><font color='darkred'><i>building_metadata.csv</i></font></b>
<ul>
    <li><b>site_id</b>: Foreign key for the weather files.</li>
    <li><b>building_id</b>: Foreign key for training.csv</li>
    <li><b>primary_use</b>: Indicator of the primary category of activities for the building based on Energy-Star property type definitions.</li>
    <li><b>square_feet</b>: Gross floor area of the building.</li>
    <li><b>year_built</b>: Year building was opened.</li>
    <li><b>floorcount</b>: Number of floors of the building #### weather[train/test].csv Weather data from a meteorological station as close as possible to the site.</li>
</ul>
<b><font color='darkred'><i>weather_train.csv</i></font></b>
<ul>
    <li><b>site_id</b>: Foreign key for the building metadata</li>
    <li><b>timestamp</b>: When the measurement was taken</li>
    <li><b>air_temperature</b>: Degrees Celsius</li>
    <li><b>cloud_coverage</b>: Portion of the sky covered in clouds, in oktas</li>
    <li><b>dew_temperature</b>: Degrees Celsius</li>
    <li><b>precip_depth_1_hr</b>: Millimeters</li>
    <li><b>sea_level_pressure</b>: Millibar/hectopascals</li>
    <li><b>wind_direction</b>: Compass direction (0-360)</li>
    <li><b>wind_speed</b>: Meters per second</li>
</ul>

<h3 id='loading'><font color='darkblue'>Loading Data</font></h3>

In [44]:
# training data
# %%time
train = pd.read_csv("data/train.csv", parse_dates=['timestamp'])
weather_train = pd.read_csv("data/weather_train.csv", parse_dates=['timestamp'])
metadata = pd.read_csv("data/building_metadata.csv")

print('Size of training data', train.shape)
print('Mem. size of original training data {:.2f} Mb'.format(train.memory_usage().sum()/1024**2))
print('----------------------------------')
print('Size of training weather data', weather_train.shape)
print('Mem. size of original training weather data {:.2f} Mb'.format(weather_train.memory_usage().sum()/1024**2))
print('----------------------------------')
print('Size of building meta data', metadata.shape)
print('Mem. size of original building meta data {:.2f} Mb'.format(metadata.memory_usage().sum()/1024**2))

Size of training data (20216100, 4)
Mem. size of original training data 616.95 Mb
----------------------------------
Size of training weather data (139773, 9)
Mem. size of original training weather data 9.60 Mb
----------------------------------
Size of building meta data (1449, 6)
Mem. size of original building meta data 0.07 Mb


<p>as we see above, the training dataset consumes a bit high memory. Let's try to reduce the memory usage by calling a custom defined method inside <i><b>"helper_function.py"</b></i></p>

<h3 id='memory'><font color="darkblue">Reducing Memory Usage of the Data</font></h3>

In [45]:
## Reducing memory
train = dc.reduce_mem_usage(train)
# train_df.to_csv(r'data\train_reduced.csv')
print('Mem. size of reduced training data', train.shape)
print('----------------------------------')
weather_data = dc.reduce_mem_usage(weather_train)
# weather_train_df.to_csv(r'data\weather_train_reduced.csv')
print('Mem. size of reduced training weather data', weather_data.shape)
print('----------------------------------')
metadata = dc.reduce_mem_usage(metadata)
# metadata_train_df.to_csv(r'data\metadata_train_reduced.csv')
print('Mem. size of reduced building meta data', metadata.shape)

Mem. usage decreased to 289.19 Mb (53.1% reduction)
Mem. size of reduced training data (20216100, 4)
----------------------------------
Mem. usage decreased to  3.07 Mb (68.1% reduction)
Mem. size of reduced training weather data (139773, 9)
----------------------------------
Mem. usage decreased to  0.03 Mb (60.3% reduction)
Mem. size of reduced building meta data (1449, 6)


<p>good. now we reduce the memory usage, let's try to merge the training data, weather information, and building metadata into one <b><i>CSV</i></b> file.</p>

<h2 id="eda"><font color="darkblue">Merging the Data into one set</font></h2>

<p><b><i>Exploratory Data Analysis (EDA)</i></b> is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.</p>

<p>let's try to merge training, weather and metadata into one dataset.</p>

In [56]:
import gc # import garbage collector interface
energy_consumption_data = train.merge(metadata, on='building_id', how='left')
energy_consumption_data = energy_consumption_data.merge(weather_data, on=['site_id', 'timestamp'], how='left')

# energy_consumption_data.to_csv(r"data/energy_consumption_data.csv")
print('Training dataset shape: {}'.format(energy_consumption_data.shape))
print('Mem. size of the traning dataset : {:.2f} Mb'.format(energy_consumption_data.memory_usage().sum()/1024**2))

# del metadata, weather_data
# gc.collect();

Training dataset shape: (20216100, 16)
Mem. size of the traning dataset : 1041.10 Mb


In [47]:
energy_consumption_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20216100 entries, 0 to 20216099
Data columns (total 16 columns):
building_id           int16
meter                 int8
timestamp             datetime64[ns]
meter_reading         float32
site_id               int8
primary_use           object
square_feet           int32
year_built            float16
floor_count           float16
air_temperature       float16
cloud_coverage        float16
dew_temperature       float16
precip_depth_1_hr     float16
sea_level_pressure    float16
wind_direction        float16
wind_speed            float16
dtypes: datetime64[ns](1), float16(9), float32(1), int16(1), int32(1), int8(2), object(1)
memory usage: 1.0+ GB


<p>good! let's now only extract the features that significantly affect the energy consumption and move forward with our analysis and forecasting</p>

In [49]:
usecols = ["timestamp", "building_id", "meter", "meter_reading", "air_temperature", "cloud_coverage", "dew_temperature",
          "precip_depth_1_hr", "sea_level_pressure", "wind_direction", "wind_speed"]
energy_consumption_data['meter'].replace({0:"Electricity",1:"ChilledWater",
                       2:"Steam",3:"HotWater"}, inplace=True)
train_data = energy_consumption_data[usecols]
train_data.head(3)

Unnamed: 0,timestamp,building_id,meter,meter_reading,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,2016-01-01,0,Electricity,0.0,25.0,6.0,20.0,,1019.5,0.0,0.0
1,2016-01-01,1,Electricity,0.0,25.0,6.0,20.0,,1019.5,0.0,0.0
2,2016-01-01,2,Electricity,0.0,25.0,6.0,20.0,,1019.5,0.0,0.0


check for the four unique meters 

In [50]:
dc.feat_value_count(train, 'meter')

Unnamed: 0,meter_values,counts
0,0,12060910
1,1,4182440
2,2,2708713
3,3,1264037


<h2 id="eda"><font color='darkblue'>Training Set: Exploratory Data Analysis (EDA)</font></h2>

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

<h3 id="train"><font color='darkblue'>Checking for Missing values</font></h3>
here we check if the training dataset has any missing values

In [51]:
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending=False)
missing__train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__train_data

Unnamed: 0,Total,Percent
meter_reading,0,0.0
timestamp,0,0.0
meter,0,0.0
building_id,0,0.0


<h3 id="weather"><font color='darkblue'>Checking for Missing Values in the Weather Dataset</font></h3>
here we check if the training weather data for any missing values

In [52]:
total = weather_data.isnull().sum().sort_values(ascending=False)
percent = (weather_data.isnull().sum()/weather_data.isnull().count()*100).sort_values(ascending=False)
missing__weather_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__weather_data

Unnamed: 0,Total,Percent
cloud_coverage,69173,49.49
precip_depth_1_hr,50289,35.979
sea_level_pressure,10618,7.597
wind_direction,6268,4.484
wind_speed,304,0.217
dew_temperature,113,0.081
air_temperature,55,0.039
timestamp,0,0.0
site_id,0,0.0


<h3 id="metadata"><font color='darkblue'>Checking for Missing Values in the Metadata </font></h3>
here we check if the training metadata has any missing values 

In [53]:
total = metadata.isnull().sum().sort_values(ascending=False)
percent = (metadata.isnull().sum()/metadata.isnull().count()*100).sort_values(ascending=False)
missing__metadata_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__metadata_data

Unnamed: 0,Total,Percent
floor_count,1094,75.5
year_built,774,53.416
square_feet,0,0.0
primary_use,0,0.0
building_id,0,0.0
site_id,0,0.0


                                        --- The End --- 