Skip to content

Conversation

@bournejt
Copy link
Contributor

Mimic pandas implementation to make variance to be 0 when all the values in the window are identical. This avoids critique numerical errors in variance calculations.

@timkpaine timkpaine marked this pull request as draft June 26, 2025 16:57
@timkpaine timkpaine marked this pull request as draft June 26, 2025 16:57
@timkpaine timkpaine added type: enhancement Issues and PRs related to improvements to existing features labels Jun 26, 2025
Copy link
Collaborator

@AdamGlustein AdamGlustein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for the contribution! Left some feedback on the implementation, when you get time let me know what you think.

Also, would you be able to add a test case under test_stats.py that verifies this behavior? You could re-use the same reproduction you had in the Discussion, and just assert the variance/weighted variance is exactly 0.

m_lastValue = x;
m_consecutiveSameCount = 1;
}
else if( x == m_lastValue )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do

if( x == m_lastValue && m_count > 1 )
    ...
else
    ...

Also minor style comment but we usually don't include braces for one-line if-statement bodies.

void add( double x )
{
m_count++;
// Track consecutive same values (pandas approach)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add to the comment why we track the values (for the case all values are the same and we want to avoid floating-point errors).

return;
}
// Reset consecutive tracking since we can't maintain it accurately during removal
m_consecutiveSameCount = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's incorrect to reset the count here, consider a window [1, 1, 1, 1] with interval=3. At t=4, when the first value is removed we are going to set m_consecutiveSameCount to zero, even though it should be 3.

We actually don't need to do anything in remove to get this functionality to work, since we are checking m_consecutiveSameCount >= m_count in compute (note the >=). So the only event that needs to resets the consecutive count is an addition of a new value.

double m_count;
int64_t m_ddof;
double m_lastValue;
int64_t m_consecutiveSameCount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename this to m_consecutiveValueCount

{
if( w <= 0 )
return;
// Track consecutive same values and observation count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the same comments apply for WeightedVariance as for Variance

if( m_count > m_ddof )
{
// Check if all values are identical (pandas approach)
if( m_count == 1 || m_consecutiveSameCount >= m_count )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check m_count here, if m_count == 1 then its guaranteed m_consecutiveSameCount >= 1 and thus the condition will be hit anyways.

double m_unnormWVar;
double m_dx;
int64_t m_ddof;
int64_t m_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove m_count here (see the prior comment about how its not needed in the if-check in compute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated weighted variance to match variance. m_count as a variable is still needed in weighted variance because we need it to initialize m_lastValue and m_consecutiveValueCount.

Signed-off-by: bournejt <aipborn@outlook.com>
Signed-off-by: bournejt <aipborn@outlook.com>
@bournejt
Copy link
Contributor Author

@AdamGlustein I updated the code per your request. In my local test run, it gives me the following output, which I think is a pass. I did the DCO sign as you instructed twice. I hope it will pass this time.

Results (64.07s (0:01:04)):
    1175 passed
       3 xfailed
      82 skipped

@AdamGlustein
Copy link
Collaborator

@AdamGlustein I updated the code per your request. In my local test run, it gives me the following output, which I think is a pass. I did the DCO sign as you instructed twice. I hope it will pass this time.

Results (64.07s (0:01:04)):
    1175 passed
       3 xfailed
      82 skipped

I'm not seeing most of the comments addressed, maybe you forgot to push a commit?
Can you also add a test for this specific case in test_stats.py when you get a chance?

if( m_count > m_ddof )
{
// Check if all values are identical
if( m_consecutiveValueCount >= m_count )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark this as [[unlikely]] so the compiler knows to optimize for the case where the variance is not 0

@bournejt
Copy link
Contributor Author

bournejt commented Jul 1, 2025

  1. @AdamGlustein Sorry I missed your request to add a test before. Just added it. It passed in my local test.
  2. I did address all your request. Maybe it's just display issue on your end? If you directly click into my commit, you will see all the code changes.

@AdamGlustein
Copy link
Collaborator

  1. @AdamGlustein Sorry I missed your request to add a test before. Just added it. It passed in my local test.
  2. I did address all your request. Maybe it's just display issue on your end? If you directly click into my commit, you will see all the code changes.

There are still comments like #554 (comment) and #554 (comment) that I don't see fixed.

I don't want to take up any more of your time on this so I just did the last few touch ups and moved the PR over to #558 including all your commits. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: enhancement Issues and PRs related to improvements to existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants