<h1><font color='darkblue'>Great Energy Predictor</font></h1>

<p><strong>Forecasting building energy consumption</strong> has immense value in energy efficiency and sustainability research. Accurate energy forecasting models have numerous implications in planning and energy optimization of buildings and campuses.<p>

New buildings, where past recorded is unavailable, rely on computer simulations to perform energy data analysis and forecasting future consumption. However, existing buildings with recorded energy consumption, statistical and machine learning techniques have proved to be more accurate and quick forecasting methods. 

This is why this machine learning capstone project cames in handy - to better forecast the baseline energy consumption based on historical data.

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="imports">Imports</a></li>
        <li><a href="dataset">The Dataset</a></li>
        <li><a href="files">Files</a></li>
        <li><a href="memory">Reducing Memory</a></li>
        <li><a href="merge">Merging files into one Dataset</a></li>
        <li><a href="eda">Training data: Exploratory Data Analysis (EDA)</a></li>
         <ul>
            <li><a href="train">Checking for Missing Data</a></li>
            <li><a href="weather">Checking for Missing Values in the Weather Dataset</a></li>
            <li><a href="metadata">Checking for Missing Values in the Metadata</a></li>
        </ul>
    </ul>
    <p>
        Estimated read time: <strong>17 min</strong>
    </p>
</div>

<hr>

<h2 id="imports"><font color='darkblue'>Imports</font></h2>

In [3]:
# Data analysis packages:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f'%x)

import warnings

import datetime as dt
from IPython.display import HTML # to see everything
warnings.filterwarnings("ignore")

import modules

In [10]:
# Visualization packages:
import matplotlib.pyplot as plt
%matplotlib inline

# import modules
import data_cleaning as dc

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<strong>Energy savings</strong> has two keys elements:
<ul>
    <li>Forecasting energy consumption without any improvement</li>
    <li>Forecasting energy consumption after a certain improvement</li>
</ul>
This project is dedicated to predicting the baseline energy consumption for the four different energy sectors and this notebook is complimentary data cleaning and exploratory data analysis(EDA)

<h2 id="dataset"><font color='darkblue'>The Dataset</font></h2>

In [5]:
# training data
# %%time
train_data = pd.read_csv("data/train.csv")#, parse_dates=['timestamp'])
train_weather = pd.read_csv("data/weather_train.csv")#, parse_dates=['timestamp'])
metadata = pd.read_csv("data/building_metadata.csv")

print('Size of training data', train_data.shape)
print('Mem. size of original training data {:.2f} Mb'.format(train_data.memory_usage().sum()/1024**2))
print('----------------------------------')
print('Size of training weather data', train_weather.shape)
print('Mem. size of original training weather data {:.2f} Mb'.format(train_weather.memory_usage().sum()/1024**2))
print('----------------------------------')
print('Size of building meta data', metadata.shape)
print('Mem. size of original building meta data {:.2f} Mb'.format(metadata.memory_usage().sum()/1024**2))

Size of training data (20216100, 4)
Mem. size of original training data 616.95 Mb
----------------------------------
Size of training weather data (139773, 9)
Mem. size of original training weather data 9.60 Mb
----------------------------------
Size of building meta data (1449, 6)
Mem. size of original building meta data 0.07 Mb


<h2 id="files"><font color='darkblue'>Files</font></h2>
<div class="dataset">
    train.csv
    <ul>
        <li><strong>building_id</strong> - Foreign key for the building metadata.</li>
        <li><strong>meter</strong> - The meter id code. Read as {0: electricity, 1: chilled-water, 2: steam, hot water: 3}. Not every building has all meter types.</li>
        <li><strong>timestamp</strong> - When the measurement was taken.</li>
        <li><strong>meter_reading</strong> - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error. #### building_meta.csv</li>
    </ul>
    metadata.csv
    <ul>
        <li><strong>site_id</strong> - Foreign key for the weather files.</li>
        <li><strong>building_id</strong> - Foreign key for training.csv</li>
        <li><strong>primary_use</strong> - Indicator of the primary category of activities for the building based on EnergyStar property type definitions.</li>
        <li><strong>square_feet</strong> - Gross floor area of the building.</li>
        <li><strong>year_built</strong> - Year building was opened</li>
        <li><strong>floorcount</strong> - Number of floors of the building #### weather[train/test].csv Weather data from a meteorological station as close as possible to the site.</li>
    </ul>
    Weather_train.csv
    <ul>
        <li><strong>site_id</strong> - Primary key for the weather files.</li>
        <li><strong>air_temperature</strong> - Degrees Celsius.</li>
        <li><strong>cloud_coverage</strong> - Portion of the sky covered in clouds, in oktas.</li>
        <li><strong>dew_temperature</strong> - Degrees Celsius.</li>
        <li><strong>precip_depth_1_hr</strong> - Millimeters.</li>
        <li><strong>sea_level_pressure</strong> - Millibar/hectopascals.</li>
        <li><strong>wind_direction</strong> - Compass direction (0-360).</li>
        <li><strong>wind_speed</strong> - Meters per second.</li>
    </ul>
    
</div>

<h2 id="memory"><font color='darkblue'>Reducing Memory</font></h2>
as we saw above, this is a big file. Let's try to reduce dataset memory usage.

In [6]:
## Reducing memory
train = dc.reduce_mem_usage(train_data)
# train_df.to_csv(r'data\train_reduced.csv')
print('Mem. size of reduced training data', train.shape)
print('----------------------------------')
weather_data = dc.reduce_mem_usage(train_weather)
# weather_train_df.to_csv(r'data\weather_train_reduced.csv')
print('Mem. size of reduced training weather data', weather_data.shape)
print('----------------------------------')
metadata = dc.reduce_mem_usage(metadata)
# metadata_train_df.to_csv(r'data\metadata_train_reduced.csv')
print('Mem. size of reduced building meta data', metadata.shape)

Mem. usage decreased to 289.19 Mb (53.1% reduction)
Mem. size of reduced training data (20216100, 4)
----------------------------------
Mem. usage decreased to  3.07 Mb (68.1% reduction)
Mem. size of reduced training weather data (139773, 9)
----------------------------------
Mem. usage decreased to  0.03 Mb (60.3% reduction)
Mem. size of reduced building meta data (1449, 6)


good. now we reduce the memory usage, let's try to merge the training data, weather information, and building metadata into one CSV file. 

<h2 id="merge"><font color='darkblue'>Merging files into one Dataset</font></h2>

In [7]:
import gc # import garbage collector interface
energy_consumption_data = train.merge(metadata, on='building_id', how='left')
energy_consumption_data = energy_consumption_data.merge(weather_data, on=['site_id', 'timestamp'], how='left')

# energy_consumption_data.to_csv(r"data/energy_consumption_data.csv")
print('Training dataset shape: {}'.format(energy_consumption_data.shape))
print('Mem. size of the traning dataset : {:.2f} Mb'.format(energy_consumption_data.memory_usage().sum()/1024**2))

# del metadata, weather_data
# gc.collect();


Training dataset shape: (20216100, 16)
Mem. size of the traning dataset : 1041.10 Mb


In [8]:
energy_consumption_data.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,,25.0,6.0,20.0,,1019.5,0.0,0.0
1,1,0,2016-01-01 00:00:00,0.0,0,Education,2720,2004.0,,25.0,6.0,20.0,,1019.5,0.0,0.0
2,2,0,2016-01-01 00:00:00,0.0,0,Education,5376,1991.0,,25.0,6.0,20.0,,1019.5,0.0,0.0
3,3,0,2016-01-01 00:00:00,0.0,0,Education,23685,2002.0,,25.0,6.0,20.0,,1019.5,0.0,0.0
4,4,0,2016-01-01 00:00:00,0.0,0,Education,116607,1975.0,,25.0,6.0,20.0,,1019.5,0.0,0.0


check for the four unique meters 

In [11]:
dc.feat_value_count(train, 'meter')

Unnamed: 0,meter_values,counts
0,0,12060910
1,1,4182440
2,2,2708713
3,3,1264037


<h2 id="eda"><font color='darkblue'>Training data: Exploratory Data Analysis (EDA)</font></h2>
Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

<h3 id="train"><font color='darkblue'>Checking for Missing Data</font></h3>
here we check if the training dataset has any missing values

In [12]:
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending=False)
missing__train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__train_data

Unnamed: 0,Total,Percent
meter_reading,0,0.0
timestamp,0,0.0
meter,0,0.0
building_id,0,0.0


<h3 id="weather"><font color='darkblue'>Checking for Missing Values in the Weather Dataset</font></h3>
here we check if the training weather data for any missing values

In [15]:
total = weather_data.isnull().sum().sort_values(ascending=False)
percent = (weather_data.isnull().sum()/weather_data.isnull().count()*100).sort_values(ascending=False)
missing__weather_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__weather_data

Unnamed: 0,Total,Percent
cloud_coverage,69173,49.49
precip_depth_1_hr,50289,35.979
sea_level_pressure,10618,7.597
wind_direction,6268,4.484
wind_speed,304,0.217
dew_temperature,113,0.081
air_temperature,55,0.039
timestamp,0,0.0
site_id,0,0.0


<h3 id="metadata"><font color='darkblue'>Checking for Missing Values in the Metadata </font></h3>
here we check if the training metadata has any missing values 

In [16]:
total = metadata.isnull().sum().sort_values(ascending=False)
percent = (metadata.isnull().sum()/metadata.isnull().count()*100).sort_values(ascending=False)
missing__metadata_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__metadata_data

Unnamed: 0,Total,Percent
floor_count,1094,75.5
year_built,774,53.416
square_feet,0,0.0
primary_use,0,0.0
building_id,0,0.0
site_id,0,0.0


In [None]:
tsplot(y=)