# Office Building Energy Prediction Demonstration - Random Forest

This notebook is a demonstration of the use of the Building Data Genome Project Data to illustrate how a prediction competition using whole building electrical meter data could be set up. 

The open data set we're using for this demonstration is the Building Data Genome Project (https://github.com/buds-lab/the-building-data-genome-project)

First we'll load the *meta* data and take a look around - these data show the diversity of building types in this machine learning exercise


In [None]:
import pandas as pd
import os
import numpy as np

In [None]:
meta = pd.read_csv("../input/meta_open.csv", index_col='uid', parse_dates=["datastart","dataend"], dayfirst=True)

In [None]:
meta.head(30)

In [None]:
 meta.info()

One can notice that there are 507 buildings and various attributes are available.

In this analysis, let's only focus on the Office Buildings with one full year of data in 2015

In [None]:
meta.datastart.value_counts()

In [None]:
meta[(meta.datastart == '2012-01-01') & (meta.primaryspaceusage == "Office")]

In [None]:
temporal = pd.read_csv("../input/temp_open_utc_complete.csv", index_col='timestamp', parse_dates=True).tz_localize('utc')

In [None]:
temporal.info()

In [None]:
temporal.iloc[:,:10].info()

The temporal data from these devices is 8760 hourly points. Each building has its own `start` and `stop` times and its own weather files

# Single building energy prediction example

We will take one of the buildings to demonstrate a type of forecasting example. 

**We will take 12 months of hourly data and remove one our of every four months and attempt to predict those gaps. This means that 25% of the data set is testing and 75% is training. **

First, we need to extract a singpe building and adapt its time zone.

In [None]:
singlebuilding = "Office_Bobbi"
single_timezone = meta.T[singlebuilding].timezone
single_start = meta.T[singlebuilding].datastart
single_end = meta.T[singlebuilding].dataend
single_building_data = pd.DataFrame(temporal[singlebuilding].tz_convert(single_timezone).truncate(before=single_start,after=single_end))

In [None]:
single_building_data.plot(figsize=(15,3))

We can resample to smooth out the data to see the macro-level trends 

In [None]:
single_building_data.info()

In [None]:
single_building_data.resample("D").sum().plot(figsize=(15,3))

In [None]:
single_building_data.truncate(after='2015-02').plot(figsize=(15,3))

In [None]:
single_building_data.dropna().index.month.isin(["1","2","3","5","6","7","9","10","11"])

In [None]:
trainingdata = single_building_data[single_building_data.index.month.isin(["1","2","3","5","6","7","9","10","11"])]

In [None]:
trainingdata.plot(figsize=(15,3))

In [None]:
trainingdata.info()

In [None]:
testdata = single_building_data[single_building_data.index.month.isin(["4","8","12"])]

In [None]:
testdata.info()

# Building a simple open-source model to fill in the gaps

Using this link as a guide: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In order to fill in the gaps, we will use a very basic implementation of the random forest model implemented in `sci-kit learn` toa fill in the gaps. 

We will use the following features at each timestamp:
- Day of Week
- Hour of Day
- Outdoor Air Temperature

Remember, this is a very simple example.

First, let's grab the weather data

In [None]:
weatherfilename = meta.T[singlebuilding].newweatherfilename

In [None]:
weatherfilename

In [None]:
weather = pd.read_csv(os.path.join("../input/",weatherfilename),index_col='timestamp', parse_dates=True, na_values='-9999')
weather = weather.tz_localize(single_timezone, ambiguous = 'infer')

In [None]:
weather.info()

Let's get only the code

In [None]:
outdoor_temp = pd.DataFrame(weather[[col for col in weather.columns if 'Temperature' in col]]).resample("H").mean()

In [None]:
outdoor_temp.info()

In [None]:
outdoor_temp = outdoor_temp.reindex(pd.DatetimeIndex(start=outdoor_temp.index[0], periods=len(single_building_data), freq="H")).fillna(method='ffill').fillna(method='bfill')

In [None]:
outdoor_temp.info()

## Create the training data

In [None]:
outdoor_temp[outdoor_temp.index.month.isin(["1","2","3","5","6","7","9","10","11"])].TemperatureC.values

In [None]:
pd.get_dummies(trainingdata.index.dayofweek).head()

In [None]:
train_features = np.array(pd.concat([pd.get_dummies(trainingdata.index.hour),
                                     pd.get_dummies(trainingdata.index.dayofweek),
           pd.Series(outdoor_temp[outdoor_temp.index.month.isin(["1","2","3","5","6","7","9","10","11"])].TemperatureC.values)], axis=1))

In [None]:
train_features.shape

In [None]:
train_labels = np.array(trainingdata[singlebuilding].values)

In [None]:
train_labels

In [None]:
train_labels.shape

## Create the test labels data

In [None]:
test_features = np.array(pd.concat([pd.get_dummies(testdata.index.hour),
                                     pd.get_dummies(testdata.index.dayofweek),
           pd.Series(outdoor_temp[outdoor_temp.index.month.isin(["4","8","12"])].TemperatureC.values)], axis=1))

In [None]:
test_labels = testdata[singlebuilding].values


In [None]:
test_labels.shape

## Use a random forest model to predict the test set and calculate the results

### Train Model
After all the work of data preparation, creating and training the model is pretty simple using Scikit-learn. We import the random forest regression model from skicit-learn, instantiate the model, and fit (scikit-learn’s name for training) the model on the training data. (Again setting the random state for reproducible results). This entire process is only 3 lines in scikit-learn!

In [None]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

### Make Predictions on the Test Set
Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features (the model is never allowed to see the test answers). We then compare the predictions to the known answers. When performing regression, we need to make sure to use the absolute error because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value (as we also did when establishing the baseline).



In [None]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))


In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

In [None]:
NMBE = 100 * (sum(test_labels - predictions) / (pd.Series(test_labels).count() * np.mean(test_labels)))
CVRSME = 100 * ((sum((test_labels - predictions)**2) / (pd.Series(test_labels).count()-1))**(0.5)) / np.mean(test_labels)

In [None]:
CVRSME

In [None]:
NMBE

Calculate R squared

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(test_labels, predictions)

# Visualize the model

In [None]:
testdata["Prediction"]= predictions

In [None]:
testdata.head()

In [None]:
testdata.columns = ['Actual','Prediction']

In [None]:
testdata.plot(figsize=(15,3))

In [None]:
testdata.truncate(after='2015-05-01').plot(figsize=(15,3))

In [None]:
testdata.resample("D").sum().plot(figsize=(15,3))

# Go through and create and test a model for all the buildings in the data set

Let's only look at Office Buildings

In [None]:
buildingnames = temporal.columns[temporal.columns.str.contains("Office")]

In [None]:
buildingnames

In [None]:
def get_model(buildingnames, meta, temporal):
        MAPE_data = {}
        RSQUARED_data = {}
        NMBE_data = {}
        CVRSME_data = {}

        for singlebuilding in buildingnames:
            print("Modelling: "+singlebuilding)
            try:
                # Get Data
                single_timezone = meta.T[singlebuilding].timezone
                single_start = meta.T[singlebuilding].datastart
                single_end = meta.T[singlebuilding].dataend
                single_building_data = pd.DataFrame(temporal[singlebuilding].tz_convert(single_timezone).truncate(before=single_start,after=single_end))

                # Split into Training and Testing
                trainingdata = single_building_data[single_building_data.index.month.isin(["1","2","3","5","6","7","9","10","11"])]
                testdata = single_building_data[single_building_data.index.month.isin(["4","8","12"])]

                # Get weather file
                weatherfilename = meta.T[singlebuilding].newweatherfilename
                print("Weatherfile: "+weatherfilename)
                weather = pd.read_csv(os.path.join("../input/",weatherfilename),index_col='timestamp', parse_dates=True, na_values='-9999')
                weather = weather.tz_localize(single_timezone, ambiguous = 'infer')
                outdoor_temp = pd.DataFrame(weather[[col for col in weather.columns if 'Temperature' in col]]).resample("H").mean()
                outdoor_temp = outdoor_temp.reindex(pd.DatetimeIndex(start=outdoor_temp.index[0], periods=len(single_building_data), freq="H")).fillna(method='ffill').fillna(method='bfill')

                # Create training data array
                train_features = np.array(pd.concat([pd.get_dummies(trainingdata.index.hour),
                                                     pd.get_dummies(trainingdata.index.dayofweek),
                           pd.Series(outdoor_temp[outdoor_temp.index.month.isin(["1","2","3","5","6","7","9","10","11"])].TemperatureC.values)], axis=1))
                train_labels = np.array(trainingdata[singlebuilding].values)

                # Create test data array
                test_features = np.array(pd.concat([pd.get_dummies(testdata.index.hour),
                                                     pd.get_dummies(testdata.index.dayofweek),
                           pd.Series(outdoor_temp[outdoor_temp.index.month.isin(["4","8","12"])].TemperatureC.values)], axis=1))
                test_labels = np.array(testdata[singlebuilding].values)

                # Make model
                rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
                # Train the model on training data
                rf.fit(train_features, train_labels);

                # Use the forest's predict method on the test data
                predictions = rf.predict(test_features)
                # Calculate the absolute errors
                errors = abs(predictions - test_labels)

                # Calculate mean absolute percentage error (MAPE) and add to list
                MAPE = 100 * np.mean((errors / test_labels))
                NMBE = 100 * (sum(test_labels - predictions) / (pd.Series(test_labels).count() * np.mean(test_labels)))
                CVRSME = 100 * ((sum((test_labels - predictions)**2) / (pd.Series(test_labels).count()-1))**(0.5)) / np.mean(test_labels)
                RSQUARED = r2_score(test_labels, predictions)

                print("MAPE: "+str(MAPE))
                print("NMBE: "+str(NMBE))
                print("CVRSME: "+str(CVRSME))
                print("R SQUARED: "+str(RSQUARED))

                MAPE_data[singlebuilding] = MAPE
                NMBE_data[singlebuilding] = NMBE
                CVRSME_data[singlebuilding] = CVRSME
                RSQUARED_data[singlebuilding] = RSQUARED

            except:
                print("There was a problem")
            
        return MAPE_data, NMBE_data, CVRSME_data, RSQUARED_data

In [None]:
MAPE_data, NMBE_data, CVRSME_data, RSQUARED_data = get_model(buildingnames, meta, temporal)

In [None]:
metrics_office = pd.DataFrame([MAPE_data, NMBE_data, CVRSME_data, RSQUARED_data]).T
metrics_office.columns = ["MAPE", "NMBE", "CVRSME", "RSQUARED"]

In [None]:
metrics_office

In [None]:
metrics_office.to_csv("RF_metrics_office.csv")

In [None]:
metrics_office[metrics_office<100].hist(bins=30, figsize=(10,10))

# Let's look at all the other building types also

## Let's do dormitories now

In [None]:
buildingnames_dorm = temporal.columns[temporal.columns.str.contains("UnivDorm")]

In [None]:
buildingnames_dorm

In [None]:
MAPE_data, NMBE_data, CVRSME_data, RSQUARED_data = get_model(buildingnames_dorm, meta, temporal)

In [None]:
metrics_dorm = pd.DataFrame([MAPE_data, NMBE_data, CVRSME_data, RSQUARED_data]).T
metrics_dorm.columns = ["MAPE", "NMBE", "CVRSME", "RSQUARED"]

In [None]:
metrics_dorm

In [None]:
metrics_dorm.to_csv("RF_metrics_dorm.csv")

In [None]:
metrics_dorm[metrics_dorm<100].hist(bins=30, figsize=(10,10))