# Heterogeneity in buildings by automated meter reader installations

#### Using ordinary least squares (OLS) to determine whether buildings with AMR technology release more or less greenhouse gasses than those that do not.

## Introduction
In this kernel, I will be looking at how buildings that have installed automated meter reading (AMR) technology on their water meters differ from those that do not in terms of total greenhouse gas emissions. AMR technology allows water meters to send readings to the cities computerized billing system up to four times per day. This allows the customer more access to water usage information at the daily, weekly, monthly, and yearly resolution ([NYC Environmental Protection](https://www1.nyc.gov/site/dep/pay-my-bills/automated-meter-reading-frequently-asked-questions.page)).

### Hypothesis
The hypothesis is that customers that opt into the installation of AMR technologies, are more aware of their environmental impact, and thus buildings with AMR tech installed have lower GHG emissions than those that don't. In order to control for exogenous and endogenous effects on total GHG (ghg), the buildings energy star score (ess), property floor area (area), and weather normalized energy use intensity (weui) are used as explantory variables. The following model is proposed,
$$ghg_i = \beta_0 + \beta_1 amr_i + \beta_2 ess_i +\beta_3 areai_i + \beta_3 weui_i + \epsilon_i$$
A note of caution: the data available to us is very limited, and in many ways, incomplete. As such, it is expected that our explanatory variables will do a poor job at accurately predicting ghg emissions. In addition, building owners often do not have control over these explanatory variables, so we expect that there will be ommitted variables and that the explantory variables will over-fit the regression.

## Preparing the data
First, lets get a good look at how AMR buildings are distributed and how their charactaristics differ from those that are not fitted with AMR technology.

In [None]:
#Initial Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Data import
df_base = pd.read_csv('../input/ny-energy-and-water-data-disclosure-local-law-84/energy-and-water-data-disclosure-for-local-law-84-2014-data-for-calendar-year-2013.csv')
df_base.info()

In [None]:
#Determining how to subsample the data
df_base['Automatic Water Benchmarking Eligible'].value_counts()

In [None]:
#Renaming columns for easier coding
df_base.rename(columns = {'Total GHG Emissions(MtCO2e)':'t_ghg','Indirect GHG Emissions(MtCO2e)':'i_ghg','ENERGY STAR Score':'ess',
                         'Weather Normalized Source EUI(kBtu/ft2)':'weui','Reported Property Floor Area (Building(s)) (ft²)':'area',
                         '':''}, inplace=True)

#Subsampling, but removing the 657 observations that are "See Primary BBL", and removing extra variables
df_amr = df_base[df_base['Automatic Water Benchmarking Eligible'] == "Yes"]
df_amr = df_amr[['ess','t_ghg','i_ghg','weui','area','Borough','Latitude','Longitude']]
df_noamr = df_base[df_base['Automatic Water Benchmarking Eligible'] == "No"]
df_noamr = df_noamr[['ess','t_ghg','i_ghg','weui','area','Borough','Latitude','Longitude']]

#Validation of subsample
print('AMR subsample has',len(df_amr),'observations')
print('Non-AMR subsample has', len(df_noamr), 'observations')

In [None]:
#Missing observations
print('AMR Subsample:\n Total GHG Emissions has',df_amr['t_ghg'].isnull().sum(), 'missing obs.\n Energy STAR score has',
    df_amr['ess'].isnull().sum(), 'missing obs. \n Property area has',
      df_amr['area'].isnull().sum(), 'missing obs. \n Weather normalizes EUI has',
      df_amr['weui'].isnull().sum(), 'missing obs.\n \n','Non-AMR Subsample:\n Total GHG Emissions has',df_noamr['t_ghg'].isnull().sum(), 'missing obs.\n Energy STAR score has',
    df_noamr['ess'].isnull().sum(), 'missing obs. \n Property area has',
      df_noamr['area'].isnull().sum(), 'missing obs. \n Weather normalizes EUI has',
      df_noamr['weui'].isnull().sum(), 'missing obs.' 
)

In [None]:
#Removing the observations with missing data
df_amr = df_amr.dropna()
df_noamr = df_noamr.dropna()

df_amr = df_amr[df_amr['area'] != "Not Available"]
df_noamr = df_noamr[df_noamr['area'] != "Not Available"]
df_amr = df_amr[df_amr['t_ghg'] != "Not Available"]
df_noamr = df_noamr[df_noamr['t_ghg'] != "Not Available"]
df_amr = df_amr[df_amr['weui'] != "Not Available"]
df_noamr = df_noamr[df_noamr['weui'] != "Not Available"]
df_amr = df_amr[df_amr['ess'] != "Not Available"]
df_noamr = df_noamr[df_noamr['ess'] != "Not Available"]
df_amr = df_amr[df_amr['i_ghg'] != "Not Available"]
df_noamr = df_noamr[df_noamr['i_ghg'] != "Not Available"]


print('AMR Subsample:\n Total GHG Emissions has',df_amr['t_ghg'].isnull().sum(), 'missing obs.\n Energy STAR score has',
    df_amr['ess'].isnull().sum(), 'missing obs. \n Property area has',
      df_amr['area'].isnull().sum(), 'missing obs. \n Weather normalizes EUI has',
      df_amr['weui'].isnull().sum(), 'missing obs.\n There are',len(df_amr),'obs left.\n\n','Non-AMR Subsample:\n Total GHG Emissions has',df_noamr['t_ghg'].isnull().sum(), 'missing obs.\n Energy STAR score has',
    df_noamr['ess'].isnull().sum(), 'missing obs. \n Property area has',
      df_noamr['area'].isnull().sum(), 'missing obs. \n Weather normalizes EUI has',
      df_noamr['weui'].isnull().sum(), 'missing obs.\n There are',len(df_noamr),'obs left.' 
)

In [None]:
#Converting objects to floats
df_amr['t_ghg'] = df_amr['t_ghg'].astype(float)
df_amr['area'] = df_amr['area'].astype(float)
df_amr['ess'] = df_amr['ess'].astype(float)
df_amr['weui'] = df_amr['weui'].astype(float)
df_amr['i_ghg'] = df_amr['i_ghg'].astype(float)

df_noamr['t_ghg'] = df_noamr['t_ghg'].astype(float)
df_noamr['area'] = df_noamr['area'].astype(float)
df_noamr['ess'] = df_noamr['ess'].astype(float)
df_noamr['weui'] = df_noamr['weui'].astype(float)
df_noamr['i_ghg'] = df_noamr['i_ghg'].astype(float)

#### Note
Unfortunately, there are only 338 buildings left in our AMR compliant sample and 799 buildings left in our non-AMR compliant sample after all na, nan, and "Not Available" records were removed. While these samples are still large enough to do analysis, caution should come in interpreting these results herein.

## Descriptive statistics

In [None]:
print('AMR Compliant Sample:\n\n',df_amr.describe(percentiles=[]).transpose())

In [None]:
print('Non-AMR Compliant Sample:\n\n',df_noamr.describe(percentiles=[]).transpose())

In [None]:
#Where are our buildings?
print('AMR Compliant Sample:\n\n', df_amr['Borough'].value_counts(), '\n\n Non-AMR Compliant Sample:\n\n',df_noamr['Borough'].value_counts())

In [None]:
#Looking at the distribution of each variable before assigning the type of regression, AMR Compliant
plt.subplot(2,2,1)
df_amr['t_ghg'].plot.hist()
plt.xlabel('GHG Emissions (MtCO2e)')
plt.subplot(2,2,2)
df_amr['area'].plot.hist()
plt.xlabel('Floor Space (ft^2)')
plt.xticks(rotation=45)
plt.subplot(2,2,3)
df_amr['ess'].plot.hist()
plt.xlabel('ENERGY STAR Score')
plt.subplot(2,2,4)
df_amr['weui'].plot.hist()
plt.xlabel('Weather Standardized EUI')
plt.tight_layout()
print('Figure 1: AMR Compliant Sample')
plt.show()

In [None]:
#Looking at the distribution of each variable before assigning the type of regression, AMR Compliant
plt.subplot(2,2,1)
df_noamr['t_ghg'].plot.hist()
plt.xlabel('GHG Emissions (MtCO2e)')
plt.subplot(2,2,2)
df_noamr['area'].plot.hist()
plt.xlabel('Floor Space (ft^2)')
plt.xticks(rotation=45)
plt.subplot(2,2,3)
df_noamr['ess'].plot.hist()
plt.xlabel('ENERGY STAR Score')
plt.subplot(2,2,4)
df_noamr['weui'].plot.hist()
plt.xlabel('Weather Standardized EUI')
plt.tight_layout()
print('Figure 2: Non-AMR Compliant Sample')
plt.show()

#### Comments
Between the two samples, there doesn't seem to be much difference except for the weather standardized energy use intensity score (much lower in the non-AMR sample). WEUI is a measurement of the total amount of raw fuel (kBTU's) that is required to operate the building. It makes sense that buildings with a higher EUI would be more concerned about their utility usage, and thus be part of the AMR program.

In [None]:
#Where are these buildings?
import folium
map = folium.Map([40.730610, -73.935242],
                zoom_start = 11)

for row in df_amr.itertuples():
    map.add_child(folium.Marker([row.Latitude,row.Longitude]))

for row in df_noamr.itertuples():
    map.add_child(folium.Marker([row.Latitude,row.Longitude], icon=folium.Icon(color='red')))
map

## BLUE = AMR Compliant, RED = Non-AMR Compliant

## OLS Regression

Given that the data has been properly cleaned and converted to floats, I want to determine whether the presence of AMR technology has a statistically significant impact on the buildings greenhouse gas emissions. In order to do this, the AMR subsample will be assigned a binary value of 1 and the Non-AMR a value of 0, and then the samples will be combined. The following models will be estimated,
1. Parsimonious Regression ($ghg_i = \beta_0 + \beta_1amr_i \epsilon_i$)
1. Extended Regression ($ghg_i = \beta_0 + \beta_1 amr_i + \beta_2 ess_i + \beta_3 area_i + \beta_4 weui + \epsilon_i$)

In [None]:
df_amr['amr'] = 1 #Subsample where 1=has amr tech, 0=does not
df_noamr['amr'] = 0
df_reg = df_amr.append(df_noamr, ignore_index=True) #Appending the data together
print('For the combined data,',df_reg['amr'].mean(),'% of the observations have AMR technology.')

#### 1. Parsimonious Regression

In [None]:
import statsmodels.api as sm
regamr_p = sm.OLS(df_reg['t_ghg'],df_reg[['amr']])
type(regamr_p)
results = regamr_p.fit()
type(results)
print(results.summary())

The results from the parsimonious regression are as expected. By itself, the presence of AMR technology has a statistically insignificant effect on the total GHG released by the building. This does not support our hypothesis in a parsimonious sense, but maybe with the addition of control variables that are related to GHG release it might.

#### 2. Extended Regression

In [None]:
regamr_p = sm.OLS(df_reg['t_ghg'],df_reg[['amr', 'ess','area','weui']])
type(regamr_p)
results = regamr_p.fit()
type(results)
print(results.summary())

The result of the extended regression is that the existince of AMR technologies is still not significant. While the area of the building, the weighted EUI, and the energy star score are all significant (with ESS being slightly insignificant at the 90% confidence level). Unfortunately, the regression output contains statistics which are worrysome (ie.adj. R-squared, etc.). This is a common example of a model being over-fit. The following section will contain some regression diagnostics in an attempt to isolate the problem.

## Regression Diagnostics

For the OLS method, the beta coefficients are found by minimizing the following objective function,
$$\max_{\hat{\beta}}\frac{\partial e'e}{\partial \hat{\beta}}=0$$
Therefore we should find that the sum of squared residuals is close to equal to zero if we are to trust our results.

In [None]:
#Testing sum of squared residuals.
y_hat = results.predict() #predicting y_hat from the fitted model
resids = y_hat - df_reg['t_ghg'] #predicted - actual for residuals
resids_d = resids/1000000 #transformation to better understand scale
resids_dsq = resids_d*resids_d
print('The sum of squared residals is', resids_dsq.sum())

This estimate is quite far off, so to better undestand the residuals, I plot them using a KDE method. The t-test that is used in the above regressions assume that the errors and residuals are normally distributed with a variance of $\sigma^2$

In [None]:
residplot = resids_d.plot.kde()
plt.title('Distribution of residuals')

From the distribution of the error terms, we can assume that $\epsilon_i \sim [0,\sigma^2]$. This allows us to seak some confidence in the result of the t-tests and subsequently the p-values in our extrended regression.

Now I will test the specification of the model using the the link test. Given,
$$y_i =\beta_1 \hat{y}_i + \beta_2 \hat{y}^2_i + \epsilon_i$$
and that these coefficients are all significant, then we have evidence of a model misspecification.

In [None]:
df_reg['y'] = df_reg['t_ghg'] / 1000000
df_reg['y_hat'] = y_hat
df_reg['y_hatsq'] = df_reg['y_hat']*df_reg['y_hat']

In [None]:
regamr_p = sm.OLS(df_reg['y'],df_reg[['y_hat','y_hatsq']])
type(regamr_p)
results = regamr_p.fit()
type(results)
print(results.summary())

From the statistical significance of the parameters, it can be concluded that our model suffers from misspecification. This would cause our model to be over-fit when looking at the extended regression, and thus our results cannot be trusted.

In addition, it seems that our model may suffer from high levels of multicollinearity. The following correlation matrix aim to uncover this (expected) correlation.

In [None]:
corr = df_reg[['amr', 'ess','area','weui']].corr()
sns.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns,annot=True, fmt="f")
plt.title('Correlation index (independent variables)')

Surpisingly, there is no indication of correlation that we should be concerned with. When including the dependent variable in this model, the only varables that had almost a perfect correlation were total greenhouse gas emissions and the square footage of the building. Ommitting area from the extended regression does not change the outcome or the predictability of the model.

## Discussion
By adding the control variables, the presence of AMR technologies remains insignificant. Though we cannot yet conclude that the presence of AMR tech "does not" have an effect. The reason is due to the model itself and the data we have. The data we have is highly correlated with emmission of GHG. The more area in a building, the higher use of electricity, and thus the higher release of GHG's. The higher the WEUI, the more energy is required, and the higher the energy star rating, the less GHG is emmitted. While these all have the correct signs and are statistically significant except for the energy star rating, the high levels of collinearity between these variables are overfitting the model. This can be seen by the 100% explanatory power described by the adjusted R-squared. This is a sure sign that our model is mis-specified. Additionally, The loss of over 90% of the sample due to missing data may have caused our results to be un-representative.

## Conclusion
The data provided in this set is simply not sufficient to predicting either the total greenhouse gas emissions or whether AMR technologies affect those emissions. This is due to the high level of missing data and the extremely low level of detail about the buildings. There exists high multicollinearity between the provided data as well. Had all data been present for the amount of observations, the story would have likely been much different.

Future research could be done by taking this analysis to a time series analysis instead of a cross-sectional as was done here for the year 2014. Additional data could also be provided such as the age of the buildings, price of electricty (if doing a time series analysis), and behavioral charactaristics on whether the buildings participated in energy conservation incentives.