-
Notifications
You must be signed in to change notification settings - Fork 73
fix variance for identical values #554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AdamGlustein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for the contribution! Left some feedback on the implementation, when you get time let me know what you think.
Also, would you be able to add a test case under test_stats.py that verifies this behavior? You could re-use the same reproduction you had in the Discussion, and just assert the variance/weighted variance is exactly 0.
cpp/csp/cppnodes/statsimpl.h
Outdated
| m_lastValue = x; | ||
| m_consecutiveSameCount = 1; | ||
| } | ||
| else if( x == m_lastValue ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do
if( x == m_lastValue && m_count > 1 )
...
else
...
Also minor style comment but we usually don't include braces for one-line if-statement bodies.
cpp/csp/cppnodes/statsimpl.h
Outdated
| void add( double x ) | ||
| { | ||
| m_count++; | ||
| // Track consecutive same values (pandas approach) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add to the comment why we track the values (for the case all values are the same and we want to avoid floating-point errors).
cpp/csp/cppnodes/statsimpl.h
Outdated
| return; | ||
| } | ||
| // Reset consecutive tracking since we can't maintain it accurately during removal | ||
| m_consecutiveSameCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's incorrect to reset the count here, consider a window [1, 1, 1, 1] with interval=3. At t=4, when the first value is removed we are going to set m_consecutiveSameCount to zero, even though it should be 3.
We actually don't need to do anything in remove to get this functionality to work, since we are checking m_consecutiveSameCount >= m_count in compute (note the >=). So the only event that needs to resets the consecutive count is an addition of a new value.
cpp/csp/cppnodes/statsimpl.h
Outdated
| double m_count; | ||
| int64_t m_ddof; | ||
| double m_lastValue; | ||
| int64_t m_consecutiveSameCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we rename this to m_consecutiveValueCount
| { | ||
| if( w <= 0 ) | ||
| return; | ||
| // Track consecutive same values and observation count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the same comments apply for WeightedVariance as for Variance
cpp/csp/cppnodes/statsimpl.h
Outdated
| if( m_count > m_ddof ) | ||
| { | ||
| // Check if all values are identical (pandas approach) | ||
| if( m_count == 1 || m_consecutiveSameCount >= m_count ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check m_count here, if m_count == 1 then its guaranteed m_consecutiveSameCount >= 1 and thus the condition will be hit anyways.
| double m_unnormWVar; | ||
| double m_dx; | ||
| int64_t m_ddof; | ||
| int64_t m_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove m_count here (see the prior comment about how its not needed in the if-check in compute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated weighted variance to match variance. m_count as a variable is still needed in weighted variance because we need it to initialize m_lastValue and m_consecutiveValueCount.
Signed-off-by: bournejt <aipborn@outlook.com>
Signed-off-by: bournejt <aipborn@outlook.com>
|
@AdamGlustein I updated the code per your request. In my local test run, it gives me the following output, which I think is a pass. I did the DCO sign as you instructed twice. I hope it will pass this time. |
I'm not seeing most of the comments addressed, maybe you forgot to push a commit? |
cpp/csp/cppnodes/statsimpl.h
Outdated
| if( m_count > m_ddof ) | ||
| { | ||
| // Check if all values are identical | ||
| if( m_consecutiveValueCount >= m_count ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark this as [[unlikely]] so the compiler knows to optimize for the case where the variance is not 0
|
There are still comments like #554 (comment) and #554 (comment) that I don't see fixed. I don't want to take up any more of your time on this so I just did the last few touch ups and moved the PR over to #558 including all your commits. Thanks again! |
Mimic pandas implementation to make variance to be 0 when all the values in the window are identical. This avoids critique numerical errors in variance calculations.