one scan is enough to implement ALGstdev_@1 in monetdb5/modules/kernel/algebra.mx #3178
Date: 2012-11-06 22:59:41 +0100
Last updated: 2013-02-19 13:17:56 +0100
Date: 2012-11-06 22:59:41 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
The current implementation takes two scans of the BAT.
By using a well-known mathematical property of the stdev, one scan is enough:
Yin's one scan implementation is available by the following diff:
diff -r bcdb312657ff monetdb5/modules/kernel/algebra.mx
Date: 2012-11-13 21:50:08 +0100
Created attachment 158
Date: 2012-11-20 13:36:41 +0100
This is nice! However, I'm wondering what the error bound of this approach is. Agreed this would be faster, but if it is less "correct" we might want to introduce a variant of stddev that uses your implementation. If this is a very broadly accepted way of computing stddev, we might also drop our current "exact" implementation, or make it available under another name, such that the user can choose from the two.
Date: 2012-11-20 20:26:54 +0100
Thank you very much for your comments, Fabian.
I had the same concern about the accumulation error of one-scan approach as you do, even though mathematically they should be exactly the same. I remember about 10 years ago, I did some experimentations on both of the approaches and I did not found any noticeable differences between the results, but the sample size I tried then is not huge.
For the sample size of the order 1 trillion, I have the same concern even with the two-scan approach. For example, After 500 billion entries have been added to the sum of square, the rest may not be able to be added up anymore, since it may well with the range of accumulation errors! For a huge BAT, if the precision is top priority, we may need to use an hierarchical approach: divide the huge sample into some manageable block, compute it for each block first, and then aggregate the results.
Date: 2012-11-20 23:19:15 +0100
The sentiment of doing a single scan calculation is good, but I don't think this is the right way to do it. The problem is with the (large) potential for overflow and for large errors in the result. The overflow can arise easily because you calculate the sum of the squares (note that this is not different from the current implementation). When calculating this sum (and also the normal sum), the order in which the calculation is done can cause errors. If the first values are large and later values are small, the small ones may get ignored because they don't register. Try e.g. a table that starts with one large number (1e8 will probably do) that is followed by millions of small (but not zero, e.g. 1) numbers.
I am currently investigating a different single scan approach (see http://en.wikipedia.org/wiki/Algorithms_for_calculating_varianceOnline_algorithm).
Date: 2012-11-21 19:05:42 +0100
Definitely, there are a lot of rooms for improvement over the naive implementation.
If the data are approximately coming from a normal distribution where the absolute value of mean is not too big compared with its stddev, then the naive implementation will yield reasonable results. For the data sampling from Cauchy distribution, the mean and stddev are not well defined anyway, no matter what algorithm you use, the computed value of stddev does not have much meaning. The situations are similar for other long tail distributions such as power law distribution, where stddev is not a description for the dataset.
Date: 2012-11-27 11:43:48 +0100
Test added in monetdb5/tests/BugTracker/Tests/stdev.Bug-3178.mal
test results are different from spreadsheet results
Date: 2012-11-27 17:42:30 +0100
Adding a test is definitely very helpful.
In the spreadsheet you used for the result comparison, what kind of estimator is used, unbiased or biased? The original implementation in MonetDB (or the proposed one scan implementation) is the population variance of a finite population of size N, that is different from that of unbiased sample variance. Please see the link for the definition:
For a large sample size (a large value of N), the difference should not be significant.
Date: 2012-11-28 13:44:59 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=19a439c07ec9
Date: 2012-11-29 10:54:04 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=bfb1f607de02
Date: 2013-01-29 14:47:50 +0100
Since changeset 498738535dfe the single scan implementation is used for the stddev_samp and stddev_pop (and var_samp and var_pop) aggregates in SQL.
Date: 2013-02-19 13:17:56 +0100
Feb2013 has been released.
The text was updated successfully, but these errors were encountered: