Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one scan is enough to implement ALGstdev_@1 in monetdb5/modules/kernel/algebra.mx #3178

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2012-11-06 22:59:41 +0100
From: Yin <>
To: MonetDB5 devs <>
Version: 11.13.9 (Oct2012-SP3)
CC: @yzchang

Last updated: 2013-02-19 13:17:56 +0100

Comment 17881

Date: 2012-11-06 22:59:41 +0100
From: Yin <>

User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
Build Identifier:

The current implementation takes two scans of the BAT.
First scan to compute the sum (average)
Second scan to compute stdev.

By using a well-known mathematical property of the stdev, one scan is enough:
during the scan,
the sum of the value and the square of the value can be computed together, and the stdev can be computed then.

Reproducible: Always

Yin's one scan implementation is available by the following diff:

diff -r bcdb312657ff monetdb5/modules/kernel/algebra.mx
--- a/monetdb5/modules/kernel/algebra.mx Tue Nov 06 12:03:33 2012 -0800
+++ b/monetdb5/modules/kernel/algebra.mx Tue Nov 06 13:40:09 2012 -0800
@@ -4149,47 +4149,64 @@
}

/*

    • @+ BAT standard deviation
    • @+ BAT standard deviation---Yin's one scan implementation
      */
      @= stdev_impl
      str ALGstdev_@1(dbl *res, int *bid) {
      BAT *b;
  • BUN cnt = 0;
  •    BUN cnt;
    
  •    BATiter bi;
    
  •    bit hasnull;
    
  •    BUN p,q;
    
  •    @1 value;
    
  •    dbl x, sum, sumsqr, stdev;
    
  • if ((b=BATdescriptor(*bid)) == NULL) {
    throw(MAL, "algebra.stdev", RUNTIME_OBJECT_MISSING);
    }
  •    BATcheck(b,"ALGstdev_@1");
    
  • cnt = BATcount(b);
  • if (cnt) {
  •  dbl stdev = 0;
    
  •  /* first, we need the average of the tail values */
    
  •  bit t = TRUE;
    
  •  dbl avg = 0;
    
  •  CMDsum_@1_dbl(&avg, b, &t);
    
  •  if (avg != dbl_nil) {
    
  •  	BATiter bi = bat_iterator(b);
    
  •  	BUN p,q;
    
  •  	avg /= cnt;
    
  •  	/* next, we need the sum of the squares of the difference of each
    
  •  	   value in the tail with the avg */
    
  •  	/* no need to check for nil values on the tail, the CMDsum has
    
  •  	   already done that and reported back */
    
  •  	BATloop(b, p, q) {
    
  •  		dbl value = (dbl) *(@1*) BUNtloc(bi, p);
    
  •  		value -= avg;    /* diff from avg  */
    
  •  		value *= value;  /* square         */
    
  •  		stdev += value;  /* sum of squares */
    
  •  	}
    
  •  	/* finaly, divide by the cnt and compute the square root */
    
  •  	*res = sqrt(stdev/cnt);
    
  •  } else {
    
  •  	*res = dbl_nil;
    
  •            hasnull = FALSE;
    
  •            bi = bat_iterator(b);
    
  •            x = 0;
    
  •            sum = 0;
    
  •            sumsqr = 0;
    
  •            stdev = 0;
    
  •      if (b->T->nonil) {
    
  •          BATloop(b, p, q) {
    
  •                            x = (dbl) *(@1*) BUNtloc(bi, p);
    
  •  	        sum += x;
    
  •                            x *= x;
    
  •                            sumsqr += x;
    
  •          }
    
  •      } else {
    
  •          BATloop(b, p, q) {
    
  •  	        value = *(@1*) BUNtloc(bi, p);
    
  •  	        if (value == @1_nil) {
    
  •                                    hasnull = TRUE;
    
  •  		        break;
    
  •  	        } else {
    
  •  		        x = (dbl) value;
    
  •  	                sum += x;
    
  •                                    x *= x;
    
  •                                    sumsqr += x;
    
  •  	        }
    
  •                    }
     }
    
  •            if(!hasnull) {
    
  •                    stdev = sumsqr - (sum*sum)/cnt;
    
  •                    if(stdev < 0) {  // theoretically, that is impossible, but here it is a safegard against potential accumulation error
    
  •                            *res = 0;
    
  •                    }
    
  •                    else {
    
  •                            *res = sqrt(stdev/cnt);
    
  •                    }
    
  •            }
    
  •            else {
    
  •                    *res = dbl_nil;
    
  •            }
    
    } else {
    *res = dbl_nil;
    }

Comment 17941

Date: 2012-11-13 21:50:08 +0100
From: Yin <>

Created attachment 158
compute stdev with one scan of BAT

Attached file: proposed_fix_bug3178.txt (text/plain, 3102 bytes)
Description: compute stdev with one scan of BAT

Comment 17960

Date: 2012-11-20 13:36:41 +0100
From: @grobian

Hi Yin,

This is nice! However, I'm wondering what the error bound of this approach is. Agreed this would be faster, but if it is less "correct" we might want to introduce a variant of stddev that uses your implementation. If this is a very broadly accepted way of computing stddev, we might also drop our current "exact" implementation, or make it available under another name, such that the user can choose from the two.

Comment 17967

Date: 2012-11-20 20:26:54 +0100
From: Yin <>

Thank you very much for your comments, Fabian.

I had the same concern about the accumulation error of one-scan approach as you do, even though mathematically they should be exactly the same. I remember about 10 years ago, I did some experimentations on both of the approaches and I did not found any noticeable differences between the results, but the sample size I tried then is not huge.

For the sample size of the order 1 trillion, I have the same concern even with the two-scan approach. For example, After 500 billion entries have been added to the sum of square, the rest may not be able to be added up anymore, since it may well with the range of accumulation errors! For a huge BAT, if the precision is top priority, we may need to use an hierarchical approach: divide the huge sample into some manageable block, compute it for each block first, and then aggregate the results.

Comment 17968

Date: 2012-11-20 23:19:15 +0100
From: @sjoerdmullender

The sentiment of doing a single scan calculation is good, but I don't think this is the right way to do it. The problem is with the (large) potential for overflow and for large errors in the result. The overflow can arise easily because you calculate the sum of the squares (note that this is not different from the current implementation). When calculating this sum (and also the normal sum), the order in which the calculation is done can cause errors. If the first values are large and later values are small, the small ones may get ignored because they don't register. Try e.g. a table that starts with one large number (1e8 will probably do) that is followed by millions of small (but not zero, e.g. 1) numbers.

I am currently investigating a different single scan approach (see http://en.wikipedia.org/wiki/Algorithms_for_calculating_varianceOnline_algorithm).

Comment 17977

Date: 2012-11-21 19:05:42 +0100
From: Yin <>

Definitely, there are a lot of rooms for improvement over the naive implementation.

If the data are approximately coming from a normal distribution where the absolute value of mean is not too big compared with its stddev, then the naive implementation will yield reasonable results. For the data sampling from Cauchy distribution, the mean and stddev are not well defined anyway, no matter what algorithm you use, the computed value of stddev does not have much meaning. The situations are similar for other long tail distributions such as power law distribution, where stddev is not a description for the dataset.

Comment 18035

Date: 2012-11-27 11:43:48 +0100
From: @yzchang

Test added in monetdb5/tests/BugTracker/Tests/stdev.Bug-3178.mal

test results are different from spreadsheet results

Comment 18162

Date: 2012-11-27 17:42:30 +0100
From: Yin <>

Adding a test is definitely very helpful.

In the spreadsheet you used for the result comparison, what kind of estimator is used, unbiased or biased? The original implementation in MonetDB (or the proposed one scan implementation) is the population variance of a finite population of size N, that is different from that of unbiased sample variance. Please see the link for the definition:
http://en.wikipedia.org/wiki/Variance

For a large sample size (a large value of N), the difference should not be significant.

Comment 18187

Date: 2012-11-28 13:44:59 +0100
From: @yzchang

Changeset 19a439c07ec9 made by Jennie Zhang y.zhang@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=19a439c07ec9

Changeset description:

Added test for Bug #3178

Comment 18219

Date: 2012-11-29 10:54:04 +0100
From: @sjoerdmullender

Changeset bfb1f607de02 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=bfb1f607de02

Changeset description:

Implemented standard deviation (sample and population) in single scan.
This implementation is currently not yet used by SQL.
See bug #3178.

Comment 18450

Date: 2013-01-29 14:47:50 +0100
From: @sjoerdmullender

Since changeset 498738535dfe the single scan implementation is used for the stddev_samp and stddev_pop (and var_samp and var_pop) aggregates in SQL.

Comment 18507

Date: 2013-02-19 13:17:56 +0100
From: @sjoerdmullender

Feb2013 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant