Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User file fails with NAN in predicted load #4557

Closed
mjwitte opened this issue Oct 22, 2014 · 20 comments · Fixed by #4761
Closed

User file fails with NAN in predicted load #4557

mjwitte opened this issue Oct 22, 2014 · 20 comments · Fixed by #4761
Assignees
Labels
PriorityLow This defect or feature has been declared low priority SeverityHigh This defect is high severity, generally indicating a hard crash without a workaround

Comments

@mjwitte
Copy link
Contributor

mjwitte commented Oct 22, 2014

Helpdesk ticket 9913
This ticket has various issues going on, but after fixing some earlier problems, the file now fails with this error:

** Severe ** DualSetPointWithDeadBand: Unanticipated combination of heating and cooling loads - report to EnergyPlus Development Team
** ~~~ ** occurs in Zone=OFFICES During Warmup, Environment=SUMMER DESIGN DAY IN BAD AXE MI JUL, at Simulation time=07/15 05:03 - 05:05
** ~~~ ** LoadToHeatingSetPoint=NAN, LoadToCoolingSetPoint=NAN
** ~~~ ** Zone Heating Set-point=21.70
** ~~~ ** Zone Cooling Set-point=23.30
** ~~~ ** Zone TempDepZnLd=NAN
** ~~~ ** Zone TempIndZnLd=NAN
** ~~~ ** Zone ThermostatSetPoint=0.00
** Fatal ** Program terminates due to above conditions.

Running v8.2.0 64-bit on Win 8.1

@mjwitte mjwitte added EnergyPlus PriorityLow This defect or feature has been declared low priority SeverityHigh This defect is high severity, generally indicating a hard crash without a workaround labels Oct 22, 2014
@rraustad
Copy link
Contributor

It looks like it starts here. It's looking at the zone inlet node temp?
nan

@rraustad
Copy link
Contributor

Using this breakpoint:
!EnergyPlus::DataGlobals::DoingSizing & EnergyPlus::DataGlobals::CurrentTime > 5 & EnergyPlus::DataGlobals::CurrentTime < 6 & ZoneNum ==4

Not sure what set that node to NAN. Maybe this code isn't the issue and something (Airloop?) is setting that node to NAN for some reason.

@rraustad
Copy link
Contributor

Of course the UnitarySys with the multi-speed coil is connected to that node.

AirLoopHVAC:UnitarySystem,
AirLoop_UnitCool1_Offices, !- Name

The outlet node went to NAN before the other issues occurred.

@rraustad
Copy link
Contributor

It's a choice of operating air flow for set point based control of multispeed coils. This "choice" occurs in SetOnOffMassFlowRate and considers a "zone" Heating, Cooling, or no load case. However, set point based control in not based on zone load. Setting the load variables (Heating or Cooling) may help if not already done, or rethinking what to do in the "NoLoad" (referring to zone load even though it's not used) case may help, or.....

@mjwitte
Copy link
Contributor Author

mjwitte commented Oct 23, 2014

@kbenne @Myoldmopar I was surprised to see NAN in the error message here. Are the release build settings such that exceptions such as divide by zero do not halt execution immediately?

@EnergyArchmage
Copy link
Contributor

I agree, builds that carry NANs along are going to be much more difficult to debug and can defeat the utility of a traceback.

@Myoldmopar
Copy link
Member

Agreed. We haven't enabled any flags such as "fast-math" equivalents that would disable NaNs, so I don't know how it's getting through calculations. @DeadParrot @kbenne Any ideas?

@kbenne
Copy link
Contributor

kbenne commented Oct 23, 2014

It appears floating point exceptions are disabled by default on MSVC.

http://msdn.microsoft.com/en-us/library/aa289157%28VS.71%29.aspx#floapoint_topic8

Looks like you can turn them on with

_clearfp();
_controlfp( _EM_ZERODIVIDE, _MCW_EM );

@DeadParrot
Copy link
Contributor

@Myoldmopar NAN detection is not inherent but can be controlled by the code. In main.cc we have this:

// Enable floating point exceptions
#ifndef NDEBUG
#ifdef __unix__
    feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW );
#endif
#endif

so in debug Linux builds those events will get trapped. The FE_INVALID should halt on a NAN that is caused by other than divide by zero bugs (uninitialized variables, stomping on or reading garbage memory via a bounds error, etc.) do not cause an immediate abort. The reason this isn't on for release builds is the performance impact.

The best debugging approach would appear to be running the case with a debug Linux build. There are Intel C++ switches and probably VC as well that can do something similar in Windows and _controlfp may work there instead of feenableexcept and setting those up for the Windows debug builds is probably a good idea.

If that doesn't identify the problem then it needs good old-fashioned debugging. That is hard work: more efficient/proper/logical would be to first fix the known uses of uninitialized variables and all the "possibly uninitialized" compiler warnings.

@mjwitte
Copy link
Contributor Author

mjwitte commented Oct 23, 2014

@DeadParrot What kind of performance impact is there? Without enabling
these exceptions in release builds, it seems possible that bad values
could slip through without getting caught in a user's simulation. It's
nice to say we'll fix all the possible uninitialized warnings, but there
will always be risks of divide by zero from unforeseen issues.

On 10/23/2014 11:00 AM, Stuart Mentzer wrote:

so in debug Linux builds those events will get trapped. The
|FE_INVALID| should halt on a NAN that is caused by other than divide
by zero bugs (uninitialized variables, stomping on or reading garbage
memory via a bounds error, etc.) do not cause an immediate abort. The
reason this isn't on for release builds is the performance impact.

@DeadParrot
Copy link
Contributor

@mjwitte the performance impact depends on the extent that the hot spots of your code are floating point operations. For EnergyPlus this may be "not much" now but as we reduce the bookkeeping overhead it should, ideally, increase. My understanding is that the FP state flags are set in h/w and thus not a performance impact: it is the checking of that state that takes some clock cycles. More seriously, enabling floating point exception can cause some operations that could be done in parallel to be serialized, which is not the direction we want to move in. But we could always enable the exceptions and measure the performance impact for release builds.

I suggest that the correct approach is that any NAN appearing is a programming bug and thus should be caught during testing (of debug builds with FP exceptions enabled). All divisions should be protected: by logic if a zero is possible and by an assert if it should never happen. Until we are able to catch up on our QA and coding practices then I guess having FP exceptions caught in the release code could be beneficial in that it prevents the user from being presented wrong results, but usually the NANs propagate through to the results such that the problem is clear. I think of the FP exception information as diagnostic for the developer, not the end user: they don't fix anything they just help you localize the bug and prevent the program from continuing with invalid quantities.

In my debug builds I always enable initialization by odd/NAN values so that you are more likely to catch use of uninitialized variables during testing. The arrays do bounds checking in debug builds too. With that plus the FP exceptions and good test case coverage we should be able to eliminate these issues in the code.

@rraustad
Copy link
Contributor

...I always enable initialization by odd/NAN values...

I thought CMake was setting the switches?

On 2014-10-23 14:19, Stuart Mentzer wrote:

@mjwitte [1] the performance impact depends on the extent that the hot
spots of your code are floating point operations. For EnergyPlus this
may be "not much" now but as we reduce the bookkeeping overhead it
should, ideally, increase. My understanding is that the FP state flags
are set in h/w and thus not a performance impact: it is the checking
of that state that takes some clock cycles. More seriously, enabling
floating point exception can cause some operations that could be done
in parallel to be serialized, which is not the direction we want to
move in. But we could always enable the exceptions and measure the
performance impact for release builds.

I suggest that the correct approach is that any NAN appearing is a
programming bug and thus should be caught during testing (of debug
builds with FP exceptions enabled). All divisions should be protected:
by logic if a zero is possible and by an assert if it should never
happen. Until we are able to catch up on our QA and coding practices
then I guess having FP exceptions caught in the release code could be
beneficial in that it prevents the user from being presented wrong
results, but usually the NANs propagate through to the results such
that the problem is clear. I think of the FP exception information as
diagnostic for the developer, not the end user: they don't fix
anything they just help you localize the bug and prevent the program
from continuing with invalid quantities.

In my debug builds I always enable initialization by odd/NAN values so
that you are more likely to catch use of uninitialized variables
during testing. The arrays do bounds checking in debug builds too.
With that plus the FP exceptions and good test case coverage we should
be able to eliminate these issues in the code.

Reply to this email directly or view it on GitHub [2].

Links:

[1] https://github.com/mjwitte
[2]
#4557 (comment)

@DeadParrot
Copy link
Contributor

@rraustad Yes, CMake is setting the switches for the EnergyPlus builds. I was referring to what I do for projects in general. I checked the CMake compiler switches and it looks like all of the settings I used wrt initialization are in there. GFortran has some nice -finit options for scalars that g++ doesn't, unfortunately, so we don't have full coverage of all variables, which is why fixing all the uninitialized warnings is important.

@mjwitte
Copy link
Contributor Author

mjwitte commented Feb 26, 2015

Also helpdesk ticket 10301, user file showing similar symptom but that file does not use unitarysystem or multispeed coil - it's just ideal loads, so probably a different bug.

@mjwitte
Copy link
Contributor Author

mjwitte commented Feb 27, 2015

@rraustad So, the divide by zero error occurs in DXCoils.cc CalcMultiSpeedDXCoilCooling (currently line 9807): hDelta = TotCapLS / AirMassFlow;
Because the incoming massflows are not consistent. AirMassFlow is >0 when this routine is entered, but MSHPMassFlowRateLow and MSHPMassFlowRateHigh are both zero. At the same time CompOn=0 but CycRatio = 1.0. With CycRatio > 0.0, for this particular case AirMassFlow gets set to MSHPMassFlowRateLow (which is zero) and then is used to check airflow/capacity and is used in a denominator.

So, seems we should trap this right at the top of CalcMultiSpeedDXCoilCooling and if AirMassFlow>0 but MSHPMassFlowRateLow and MSHPMassFlowRateHigh are zero, then stop, leave, or set MSHPMassFlowRateLow=AirMassFlow and MSHPMassFlowRateHigh=AirMassFlow?

I realize this doesn't really solve the issue of what it means to use setpoint based controls with a multispeed coil (is this even a valid combination - how do you determine flow rate?), but it protects against divide by zero.

@rraustad
Copy link
Contributor

MSHPMassFlowRateLow and High need to be set by the parent each time the child is called if the speed changes. If these are 0 while flow is non-zero, that's a bug. This is done in SetOnOffMassFlowRate or some similar routine (e.g. SetVSHPAirFlow in Furnaces).

@rraustad
Copy link
Contributor

CompOn = 0? This subroutine doesn't act on CompOn? Should have bypassed calcs to a coil off condition if CompOn is actually used for this child.

mjwitte added a commit that referenced this issue Mar 1, 2015
… assignments for MSHPMassFlowRateLow and MSHPMassFlowRateHigh to prevent divide by zero in CalcMultiSpeedDXCoilCooling/Heating.
@rraustad
Copy link
Contributor

rraustad commented Mar 3, 2015

How did you find this? Somehow break when NAN is first found? I'd hate to think it was single stepping through the code.

@mjwitte
Copy link
Contributor Author

mjwitte commented Mar 3, 2015

@rraustad Well, you had already started pointing in the right direction, but yes it took a while. There was a lot of single-stepping, adding watches, and adding breaks and conditional breaks, then working back up the logic to add earlier breaks. I wish I knew how to break on NAN - is that possible?

@DeadParrot
Copy link
Contributor

@mjwitte The debug build on Linux should be trapping NANs with the feenableexcept setup and the -fsignaling-nans. That should let you backtrace in gdb to find the cause.

If you are on VC then someone needs to add the equivalent _controlfp setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PriorityLow This defect or feature has been declared low priority SeverityHigh This defect is high severity, generally indicating a hard crash without a workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants