# Energy Efficiency of Residential Buildings

# A. Problem Understanding

When it comes to efficient building design, the computation of the heating load (HL) and the cooling load (CL) is
required to determine the specifications of the heating and cooling equipment needed to maintain comfortable indoor
air conditions. In order to estimate the required cooling and heating capacities, architects and building desioners
need information about the characteristics of the building and of the conditioned space (for example occupancy and
activity level). For this reason, we will investigate the effect of eight input variables: (RC), surface area, wall area, roof area, overall height, orientation, glazing area, and glazing area distribution, to determine the output variables HL and CL of residential buildings.

To evaluate our model performance we will use R squared (R2 score). R-squared is a statistical measure of how close the data are to the fitted regression line. This is very important to create predictions that are close to the real values. In this case we want to achieve a high R squared, the higher the R squared, the better the model fits the data.

# B. Data Understanding

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.

#### 1. Data Description

**ENB2012_data.xlsx**

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically: 
* X1	Relative Compactness 
* X2	Surface Area 
* X3	Wall Area 
* X4	Roof Area 
* X5	Overall Height 
* X6	Orientation 
* X7	Glazing Area 
* X8	Glazing Area Distribution 
* y1	Heating Load 
* y2	Cooling Load

#### 2. Load The Data

In [None]:
#import library
import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
#import data
data = pd.read_csv('../input/ENB2012_data.csv')

In [None]:
# Read the data to g an overview
data

In [None]:
#Rename columns
data.columns = ['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height',
                'orientation', 'glazing_area', 'glazing_area_distribution', 'heating_load', 'cooling_load']

#### 3. Data Types 

In [None]:
# Memory usage and data types
data.info()

# C. Data Exploration

On the data exploration, we will see the distribution of each variable using a histogram. In the histogram, the horizontal axis is the data of the feature while the vertical axis is the frequency of occurrence. The correlation test is used to evaluate the relationship between two numerical variables. If two variables have a correlation coefficient, then the two variables are numerical variables, while the remainder are categorical variables.

Let's get an overview of variables and its distribution.

In [None]:
# Variables & Distribution
pandas_profiling.ProfileReport(data)

From the above distributions, an interesting fact is that the unique value of the data are not so many
* X1	Relative Compactness has 12 possible values
* X2	Surface Area has 12 possible values
* X3	Wall Area has 7 possible values
* X4	Roof Area has 4 possible values
* X5	Overall Height has 2 possible values
* X6	Orientation has 4 possible values
* X7	Glazing Area has 4 possible values
* X8	Glazing Area Distribution has 6 possible values
* y1	Heating Load has 586 possible values
* y2	Cooling Load has 636 possible values

Now, we want to know the correlation between variables in numbers

In [None]:
# Preview correlation
plt.figure(figsize=(12,12))
sns.heatmap(data.corr(),annot=True)

Because it's still difficult to read, we want to format it and check the correlation again.

In [None]:
# Change number format in correlations
pd.set_option('display.float_format',lambda x: '{:,.2f}'.format(x) if abs(x) < 10000 else '{:,.0f}'.format(x))
data.corr()

The tables shows that there is a strong correlation between targets. We cannot exclude one of those, because heating load and cooling are equally important outputs to be predicted.

In [None]:
# Correlation between inputs and outputs
plt.figure(figsize=(5,5))
sns.pairplot(data=data, y_vars=['cooling_load','heating_load'],
             x_vars=['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height',
                     'orientation', 'glazing_area', 'glazing_area_distribution',])
plt.show()

From the table, we can see some information about correlations between all variables. For example, the overall_height (an input) has a strong correlation (0.90) with the output - cooling_load. Besides, the pairplot depicts the the relationship between them. For the overall_height and cooling_load plot, there is only 2 values of overall height due to the distribution and made us difficult to see the linear correlations of those variables. We will use preprocessing method to refine the distributions.

# D. Data Preprocessing

#### 1. Data Selection

Considering the distribution and correlation on our exploration of the data, we well use all vales as we want to create the best fit for prediction lines by evaluating R squared. R squared will always increase as we add more independent variables. To get a high R, we includes all variables.

#### 2. Preprocessing & Data Transformation

In [None]:
# Check missing values
data.isnull().sum()

In [None]:
#Summary statistics
data.describe()

Each feature has different scale, as we can see the minimum and maximum values for each of variables. To obtain a better scale, it is good to normalize the data because it makes distributions better.

In [None]:
#Normalize the inputs and set the output
from sklearn.preprocessing import Normalizer
nr = Normalizer(copy=False)

X = data.drop(['heating_load','cooling_load'], axis=1)
X = nr.fit_transform(X)
y = data[['heating_load','cooling_load']]

# E. Data Modelling

Let's prapare our input and output using tran test split before we create models.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

Then create a function to evaluate our model using R squared (R2 score).

In [None]:
#Create model evaluation function
def evaluate(model, test_features, test_labels):
    from sklearn.metrics import r2_score
    predictions = model.predict(test_features)
    R2 = np.mean(r2_score(test_labels, predictions))
    print('R2 score = %.3f' % R2)
    return r2_score

The histograms have already told us that our data seems to be descret , like categorical data but in numbers. We should therefore use tree based algorithms to expect the best model using those type of data. We create 3 basic models and then optimze each models using Hyperparameter Search technique. The model we used are:
1. Decission Tree Regression
2. Random Forest Regression
3. Extra Trees Regression

#### 1. Decission Tree Regressor

In [None]:
#Import decision tree regressor
from sklearn.tree import DecisionTreeRegressor
# Create decision tree model 
dt_model = DecisionTreeRegressor(random_state=123)
# Apply the model
dt_model.fit(X_train, y_train)
# Predicted value
y_pred1 = dt_model.predict(X_test)

In [None]:
#R2 score before optimization
R2_before_dt= evaluate(dt_model, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output before optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred1[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load Before Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output before optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred1[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load Before Optimization')
ax2.legend(loc = 'upper right')

plt.show()

In [None]:
# Finding the best decision tree optimization parameters

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
# Max Depth
dt_acc = []
dt_depth = range(1,11)
for i in dt_depth:
    dt = DecisionTreeRegressor(random_state=123, max_depth=i)
    dt.fit(X_train, y_train)
    dt_acc.append(dt.score(X_test, y_test))
ax1.plot(dt_depth,dt_acc)
ax1.set_title('Max Depth')

#Min Samples Split
dt_acc = []
dt_samples_split = range(10,21)
for i in dt_samples_split:
    dt = DecisionTreeRegressor(random_state=123, min_samples_split=i)
    dt.fit(X_train, y_train)
    dt_acc.append(dt.score(X_test, y_test))
ax2.plot(dt_samples_split,dt_acc)
ax2.set_title('Min Samples Split')

plt.show()


In [None]:
#Min Sample Leaf
plt.figure(figsize = (5,5))
dt_acc = []
dt_samples_leaf = range(1,10)
for i in dt_samples_leaf:
    dt = DecisionTreeRegressor(random_state=123, min_samples_leaf=i)
    dt.fit(X_train, y_train)
    dt_acc.append(dt.score(X_test, y_test))

plt.plot(dt_samples_leaf,dt_acc)
plt.title('Min Sample Leaf')

plt.show()

From the y axis, we can see the accuracy of the models is pretty high. We then choose some indexes that produce the maximum values and put them in our parameters.

In [None]:
# Decision tree optimization parameters
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth' : [7,8,9],
              'min_samples_split': [16,17,18],
              'min_samples_leaf' : [6,7,8]}


#Create new model using the GridSearch
dt_random = GridSearchCV(dt_model, parameters, cv=10)

#Apply the model
dt_random.fit(X_train, y_train)

In [None]:
#View the best parameters
dt_random.best_params_

In [None]:
# Predicted value
y_pred1_ = dt_random.best_estimator_.predict(X_test)

In [None]:
#R2 score after optimization
dt_best_random = dt_random.best_estimator_
R2_after_dt= evaluate(dt_best_random, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output after optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred1_[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load After Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output after optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred1_[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load After Optimization')
ax2.legend(loc = 'upper right')

plt.show()

#### 2. Random Forest Regressor

In [None]:
#Import random forest regressor
from sklearn.ensemble import RandomForestRegressor
# Create random forest model 
rf_model = RandomForestRegressor(random_state=123)
# Apply the model
rf_model.fit(X_train, y_train)
# Predicted value
y_pred2 = rf_model.predict(X_test)

In [None]:
#R2 score before optimization
R2_before_rf= evaluate(rf_model, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output before optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred2[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load Before Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output before optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred2[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load Before Optimization')
ax2.legend(loc = 'upper right')

plt.show()

In [None]:
# Finding the best random forest optimization parameters

f, axarr = plt.subplots(2, 2)

# Max Depth
rf_acc = []
rf_depth = range(1,11)
for i in rf_depth:
    rf = RandomForestRegressor(random_state=123, max_depth=i)
    rf.fit(X_train, y_train)
    rf_acc.append(rf.score(X_test, y_test))
axarr[0, 0].plot(rf_depth,rf_acc)
axarr[0, 0].set_title('Max Depth')

#Min Samples Split
rf_acc = []
rf_samples_split = range(10,21)
for i in rf_samples_split:
    rf = RandomForestRegressor(random_state=123, min_samples_split=i)
    rf.fit(X_train, y_train)
    rf_acc.append(rf.score(X_test, y_test))
axarr[0, 1].plot(rf_samples_split,rf_acc)
axarr[0, 1].set_title('Min Samples Split')

#Min Sample Leaf
rf_acc = []
rf_samples_leaf = range(1,10)
for i in rf_samples_leaf:
    rf = RandomForestRegressor(random_state=123, min_samples_leaf=i)
    rf.fit(X_train, y_train)
    rf_acc.append(rf.score(X_test, y_test))

axarr[1, 0].plot(rf_samples_leaf,rf_acc)
axarr[1, 0].set_title('Min Sample Leaf')

#N Estimator
rf_acc = []
rf_estimators = range(50,59)
for i in rf_estimators:
    rf = RandomForestRegressor(random_state=123, n_estimators=i)
    rf.fit(X_train, y_train)
    rf_acc.append(rf.score(X_test, y_test))

axarr[1, 1].plot(rf_estimators,rf_acc)
axarr[1, 1].set_title('N Estimator')

plt.show()


In [None]:
# Random forest optimization parameters
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth' : [6,7,8],
              'min_samples_split': [11,12,13],
              'min_samples_leaf' : [4,5,6],
              'n_estimators': [49,50,51]}


#Create new model using the GridSearch
rf_random = GridSearchCV(rf_model, parameters, cv=10)

#Apply the model
rf_random.fit(X_train, y_train)

In [None]:
#View the best parameters
rf_random.best_params_

In [None]:
# Predicted value
y_pred2_ = rf_random.best_estimator_.predict(X_test)

In [None]:
#R2 score after optimization
best_random_rf = rf_random.best_estimator_
R2_after_rf= evaluate(best_random_rf, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output after optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred2_[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load After Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output after optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred2_[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load After Optimization')
ax2.legend(loc = 'upper right')

plt.show()

#### 3. Extra Trees Regressor

In [None]:
#Import extra trees regressor
from sklearn.ensemble import ExtraTreesRegressor
# Create extra trees model 
etr_model = ExtraTreesRegressor(random_state=123)
# Apply the model
etr_model.fit(X_train, y_train)
# Predicted value
y_pred3 = etr_model.predict(X_test)

In [None]:
#R2 score before optimization
R2_before_etr= evaluate(etr_model, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output before optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred3[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load Before Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output before optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred3[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load Before Optimization')
ax2.legend(loc = 'upper right')

plt.show()

In [None]:
# Finding the best extra trees regressor optimization parameters

f, axarr = plt.subplots(2, 2)

# Max Depth
etr_acc = []
etr_depth = range(1,11)
for i in etr_depth:
    etr = ExtraTreesRegressor(random_state=123, max_depth=i)
    etr.fit(X_train, y_train)
    etr_acc.append(etr.score(X_test, y_test))
axarr[0, 0].plot(etr_depth,etr_acc)
axarr[0, 0].set_title('Max Depth')

#Min Samples Split
etr_acc = []
etr_samples_split = range(16,26)
for i in etr_samples_split:
    etr = ExtraTreesRegressor(random_state=123, min_samples_split=i)
    etr.fit(X_train, y_train)
    etr_acc.append(etr.score(X_test, y_test))
axarr[0, 1].plot(etr_samples_split,etr_acc)
axarr[0, 1].set_title('Min Samples Split')

#Min Sample Leaf
etr_acc = []
etr_samples_leaf = range(3,8)
for i in etr_samples_leaf:
    etr = ExtraTreesRegressor(random_state=123, min_samples_leaf=i)
    etr.fit(X_train, y_train)
    etr_acc.append(etr.score(X_test, y_test))

axarr[1, 0].plot(etr_samples_leaf,etr_acc)
axarr[1, 0].set_title('Min Sample Leaf')

#N Estimator
etr_acc = []
etr_estimators = range(40,46)
for i in etr_estimators:
    etr = ExtraTreesRegressor(random_state=123, n_estimators=i)
    etr.fit(X_train, y_train)
    etr_acc.append(etr.score(X_test, y_test))

axarr[1, 1].plot(etr_estimators,etr_acc)
axarr[1, 1].set_title('N Estimator')

plt.show()

In [None]:
# Extra trees regressor optimization parameters
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth' : [6,7,8],
              'min_samples_split': [19,20,21],
              'min_samples_leaf' : [4,5,6],
              'n_estimators': [43,44,45]}


#Create new model using the GridSearch
etr_random = GridSearchCV(etr_model, parameters, cv=10)

#Apply the model
etr_random.fit(X_train, y_train)

In [None]:
#View the best parameters
etr_random.best_params_

In [None]:
# Predicted value
y_pred3_ = etr_random.best_estimator_.predict(X_test)

In [None]:
#R2 score after optimization
best_random_etr = etr_random.best_estimator_
R2_after_etr= evaluate(best_random_etr, X_test, y_test)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
#Visualize the heating load output after optimization
plt.figure(figsize = (5,5))
ax1.plot(range(0,len(X_test)),y_test.iloc[:,0],'o',color='red',label = 'Actual Values')
ax1.plot(range(0,len(X_test)),y_pred3_[:,0],'X',color='yellow',label = 'Predicted Values')
ax1.set_xlabel('Test Cases')
ax1.set_ylabel('Heating Load')
ax1.set_title('Heating  Load After Optimization')
ax1.legend(loc = 'upper right')

#Visualize the cooling load output after optimization
plt.figure(figsize = (5,5))
ax2.plot(range(0,len(X_test)),y_test.iloc[:,1],'o',color='green',label = 'Actual Values')
ax2.plot(range(0,len(X_test)),y_pred3_[:,1],'X',color='blue',label = 'Predicted Values')
ax2.set_xlabel('Test Cases')
ax2.set_ylabel('Cooling Load')
ax2.set_title('Cooling Load After Optimization')
ax2.legend(loc = 'upper right')

plt.show()

# F. Evaluation

#### 1. Conclusion

Overall, the model perform well to predict the the Heating Load and Cooling Load with R2 score >= 97% even not using hyperparameter optimization

* R2 score of **Decision Tree Regressor** model = 97.0%
* R2 score of **Random Forest Regressor** model = 97.8%
* R2 score of **Extra Trees Regressor** model = 97.7%

To increase the F1 score, we have applied hyperparameter tuning using GridSearch and obtain
* R2 score of **Decision Tree Regressor** model = 97.9%
* R2 score of **Random Forest Regressor** model = 98.0%
* R2 score of **Extra Trees Regressor** model = 98.3%

We can conclude that Extra Tree Regressor is the best model to predict the Heating Load and Cooling Load values with the optimum R2 score of 98.3%

#### 2. Recommendation

R squared sometimes is not a good indicator of fit. R squared (R2 score) will always increase as we add more independent variables – but adjusted R2 will decrease if we add an independent variable that does not help the model. This is good way to determine if an additional variable should even be included. However, adjusted R2 which penalizes model complexity to control for overfitting, generally under penalizes complexity. 

In [None]:
# create a fitted model with all features
import statsmodels.formula.api as smf
data2=data.copy()
lm1 = smf.ols(formula='heating_load ~ relative_compactness + surface_area + wall_area + roof_area + overall_height + orientation + glazing_area + glazing_area_distribution', data=data2).fit()

In [None]:
# Summarize fitted mode
lm1.summary()

In [None]:
# create a fitted model with all features excluding "orientation"
lm2 = smf.ols(formula='heating_load ~ relative_compactness + surface_area + wall_area + roof_area + overall_height + glazing_area + glazing_area_distribution', data=data2).fit()

In [None]:
# Summarize fitted mode
lm2.summary()

Adjusted R squared is recommended than R squared in terms of goodness of fit. The above steps is the example of creating two models, one model uses all features and the other one loses one feature. In this case, it shows that adding variables does not always make our model better (see the R squared and Adj. R squared from the summary tables). So, we do not need to include all variable for multiple regression if not helping to create better model.