Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Site Model Selection Criteria #76
To select the weather model to be used for a site, CalTrack specifies a two-step process. First, candidate HDD and CDD models are specified using the pre-determined ranges of balance points. In addition, models with HDD but no CDD variable and CDD but no HDD variable are created. Lastly, an "intercept-only" model with no HDD or CDD variable is created (the mean over the time series). The first step in CalTrack for selecting a weather model is to remove candidate models where the HDD or CDD coefficients are not significant (p-value > 0.10). After candidate models with non-significant coefficients have been eliminated, then remaining model with the highest R-squared value is selected and used to compute the weather normalized annual consumption.
In testing that we did with Open EE in 2017, we found that the practical consequence of the coefficient significance screen was that there was a relatively high proportion of "intercept-only" models that were selected. We think this two step process is too restrictive and removes many valid weather models where the individual coefficients don't quite achieve statistical significance. Also, there is an argument to be made that all of the variables that are theoretically indicated should be included in the regression model, regardless of whether they are statistically significant.
Energy Trust suggests removing the coefficient significance screen and selecting the candidate weather model with the highest R-squared from the full range of candidate models. There may also be an R-squared floor for candidate weather models, below which the "intercept-only" model is selected. We have used R-squared < 0.5 as a floor for candidate weather models in the past.
Proposed test methodology:
A comparison was performed using 24-month electric traces, split into 12 months of training and 12 months of test data, and the mean absolute prediction error was used as the metric.
In the majority of cases (almost 90%), the fit did not change when the p-value criterion was removed. For the remainder, there was a change; in most cases, it was because the model changed from intercept-only to a weather-sensitive model (i.e., the heating or cooling terms were only marginally significant) - see Figure 1. The average Mean Absolute Error (MAE) was slightly lower when the p-value screen was removed (8.20 vs 8.34) . Over twice as many models improved than degraded when the p-value cutoff was removed, and none of the degradations were catastrophic (Figure 2).
Therefore, it appears that the p-value requirement is superfluous at best, marginally counterproductive at worst.
referenced this issue
Mar 5, 2018
Not sure about this recommendation. There are situations where there needs to be a decision between 2 independent variables that may be covariant. One example I have mentioned: A monthly model has an intercept and a weather-related slope for heating. Occupancy rates (leased space) changed during the baseline. There was also a second weather-related slope because there was a heat pump and resistance heat. Due to the timing of the occupancy changes, the weather and the occupancy were covariant. The correct independent variable was the second weather variable; the occupancy was not significant once the second weather variable was included. In contrast, occupancy was significant in the reporting period.