# Q19
Energy Efficiency
A study looked into assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters.
We perform energy analysis using 12 different building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses (heating load and cooling load).

File: MLR_Q19_BuildingEffciency.csv
https://drive.google.com/drive/u/0/folders/1ILKastUTJWccxaxIpJpjqCJDpsMJ-oC8

All variables are numerical except 'orientation' which is categorical.

    1) Which features impact the heating load?
    2) Which features impact the cooling load?

In [1]:
import numpy as np
import pandas as pd 

import seaborn as sns 
import matplotlib.pyplot as plt

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [3]:
df = pd.read_csv("MLR_Q19_BuildingEfficiency.csv")

df.head(1)

Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Orientation,Glazing_Area,Glazing_Area_Distribution,Heating_Load,Cooling_Load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33


In [11]:
df[['Heating_Load', 'Cooling_Load', 'Relative_Compactness', 'Surface_Area', 'Wall_Area', 'Roof_Area', 'Overall_Height', 'Glazing_Area', 
    'Glazing_Area_Distribution']].corr().apply(lambda x: pd.Series.round(x, 3))

Unnamed: 0,Heating_Load,Cooling_Load,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Glazing_Area,Glazing_Area_Distribution
Heating_Load,1.0,0.976,0.622,-0.658,0.456,-0.862,0.889,0.27,0.087
Cooling_Load,0.976,1.0,0.634,-0.673,0.427,-0.863,0.896,0.208,0.051
Relative_Compactness,0.622,0.634,1.0,-0.992,-0.204,-0.869,0.828,-0.0,-0.0
Surface_Area,-0.658,-0.673,-0.992,1.0,0.196,0.881,-0.858,0.0,0.0
Wall_Area,0.456,0.427,-0.204,0.196,1.0,-0.292,0.281,-0.0,0.0
Roof_Area,-0.862,-0.863,-0.869,0.881,-0.292,1.0,-0.973,-0.0,-0.0
Overall_Height,0.889,0.896,0.828,-0.858,0.281,-0.973,1.0,0.0,-0.0
Glazing_Area,0.27,0.208,-0.0,0.0,-0.0,-0.0,0.0,1.0,0.213
Glazing_Area_Distribution,0.087,0.051,-0.0,0.0,0.0,-0.0,-0.0,0.213,1.0


In [12]:
num_vars = ['Heating_Load', 'Cooling_Load', 'Relative_Compactness', 'Surface_Area', 'Wall_Area', 'Roof_Area', 'Overall_Height', 'Glazing_Area', 
            'Glazing_Area_Distribution'
           ]
cat_vars = ['Orientation']
df_dummy = pd.get_dummies(df, prefix="gp", 
                           columns=cat_vars, 
                           drop_first=True)

In [16]:
# convert all numerical variables to zscores,so as to compare coefficients
from scipy.stats.mstats import zscore
for vrb in num_vars:
    df_dummy[vrb] = zscore(df_dummy[vrb])
df_dummy.head(2)

Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Glazing_Area,Glazing_Area_Distribution,Heating_Load,Cooling_Load,gp_3,gp_4,gp_5
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,-1.814575,-0.670116,-0.342666,0,0,0
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,-1.814575,-0.670116,-0.342666,1,0,0


In [32]:
# check the vif values, before forward feature selection
# 1 Remove Relative_Compactness, since its highly correlated with Surface area(-99.2%)
# 2 Remove Overall_Height, since its highly correlated with Roof_Area(97.3%)
# 3 Remove Roof_Area, since its highly correlated with Surface_Area(88.1%)
from statsmodels.stats.outliers_influence import variance_inflation_factor
x_var = ['Surface_Area', 'Wall_Area', 'Glazing_Area', 
         'Glazing_Area_Distribution', 'gp_3', 'gp_4', 'gp_5'
        ]
X = df_dummy[x_var]
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], index=X.columns)

Surface_Area                 1.039740
Wall_Area                    1.039740
Glazing_Area                 1.047508
Glazing_Area_Distribution    1.047508
gp_3                         1.000000
gp_4                         1.000000
gp_5                         1.000000
dtype: float64

In [21]:
def fit_lin_reg_with_intercept(X, Y):
    X = sm.add_constant(X) # adding a constan
    reg_model = sm.OLS(Y,X).fit()
    return reg_model

## Model for Heating_Load

In [26]:
# remove insignificant variables gp_5, gp_4, , 'gp_3'
x_var = ['Surface_Area', 'Wall_Area', 'Glazing_Area', 
         'Glazing_Area_Distribution' 
        ]
y_var = 'Heating_Load'
reg_model  = fit_lin_reg_with_intercept(df_dummy[x_var], df_dummy[y_var])
print(reg_model.summary())

                            OLS Regression Results                            
Dep. Variable:           Heating_Load   R-squared:                       0.862
Model:                            OLS   Adj. R-squared:                  0.861
Method:                 Least Squares   F-statistic:                     1190.
Date:                Fri, 27 May 2022   Prob (F-statistic):               0.00
Time:                        06:52:39   Log-Likelihood:                -329.54
No. Observations:                 768   AIC:                             669.1
Df Residuals:                     763   BIC:                             692.3
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

After standardizing all variabls, we see highest following variables, having highest impact on Heating load 


| Variable                  | Coefficient |
| -----------               | ----------- |
| Surface_Area              | -0.7769     |
| Wall_Area                 |  0.6076     |
| Glazing_Area              |  0.2632     |
| Glazing_Area_Distribution |  0.0313     |

## Model for Cooling_Load

In [31]:
# remove insignificant variables gp_4, , 'gp_3', Glazing_Area_Distribution, gp_5
x_var = ['Surface_Area', 'Wall_Area', 'Glazing_Area'
        ]
y_var = 'Cooling_Load'
reg_model  = fit_lin_reg_with_intercept(df_dummy[x_var], df_dummy[y_var])
print(reg_model.summary())

                            OLS Regression Results                            
Dep. Variable:           Cooling_Load   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     1164.
Date:                Fri, 27 May 2022   Prob (F-statistic):          2.15e-284
Time:                        06:57:04   Log-Likelihood:                -430.14
No. Observations:                 768   AIC:                             868.3
Df Residuals:                     764   BIC:                             886.9
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         1.605e-17      0.015   1.05e-15   

After standardizing all variabls, we see highest following variables, having highest impact on Cooling load 


| Variable     | Coefficient |
| -----------  | ----------- |
| const        | 1.605e-17   |
| Surface_Area | -0.7866     |
| Wall_Area    | 0.5809      |
| Glazing_Area | 0.2075      |

# Answers

1) Which features impact the heating load?
After standardizing all variabls, we see highest following variables, having highest impact on Heating load 



| Variable                  | Coefficient |
| -----------               | ----------- |
| Surface_Area              | -0.7769     |
| Wall_Area                 |  0.6076     |
| Glazing_Area              |  0.2632     |
| Glazing_Area_Distribution |  0.0313     |

2) Which features impact the cooling load?
After standardizing all variabls, we see highest following variables, having highest impact on Cooling load 


| Variable     | Coefficient |
| -----------  | ----------- |
| Surface_Area | -0.7866     |
| Wall_Area    | 0.5809      |
| Glazing_Area | 0.2075      |
