Home

Overview

An implementation of the HyperLogLog approximate cardinality estimation algorithm (as well as Linear Counting), as a Hive User-defined Aggregation Function (UDAF).

Relies on stream-lib for implementation of the relevant algorithms.

Installation

Clone the repo

$ git clone https://github.com/MLnick/hive-udf.git

Then build the fat JAR using Maven

$ mvn package

The default versions to be built against are hive 0.9.0 and hadoop 1.1.0, however the packaging will exclude these dependencies under the assumption that they will be provided in the Hadoop/Hive installation (e.g. on a cluster). I've tested on CDH4.0.0 in production, but this should also apply if you've set up Hadoop and Hive in local mode, assuming your CLASSPATH is correct.

Example Usage

If running in local mode, simply start hive using /path/to/hive/bin/hive. If running on a cluster, copy the fat jar (hive-udf-VERSION.jar) to an appropriate directory on the cluster, cd to that directory, and start hive.

Then add the jar to the Hive CLI runtime:

hive> add jar hive-udf-0.0.1-SNAPSHOT.jar;
hive> create temporary function approx_distinct as 'com.github.mlnick.hive.udaf.UDAFCardinalityEstimator';

(You can also supply the full path to the jar when adding it, e.g. add jar /path/to/jar/XYZ.jar, or put your UDF jars in S3, for example).

If you run into Java heap errors (likely with higher b parameter for HyperLogLog), try increasing the memory allocated:

hive> set mapred.child.java.opts=-Xmx2000m;

The UDAF returns a struct, containing the following fields

type            Type of Cardinality Estimator {HLL, LC}
cardinality     Estimated cardinality of the column
binary          Binary serialised Cardinality Estimator

You can therefore pull out just the cardinality by using column.cardinality, or store the result in a column with type of struct {type string; cardinality bigint; binary binary} for the relevant column, to be extracted later.

NOTE if you just return the full struct (i.e. select approx_distinct(x) ..., you will get a bunch of binary nonsense in the CLI, so to view the results use .cardinality).

Try it out on your data!

-- the default is HyperLogLog with b=16
select 
    approx_distinct(COLUMN).cardinality,
    approx_distinct(COLUMN, 'hll', 5).cardinality,
    approx_distinct(COLUMN, 'hll', 24).cardinality
from TABLE;
select count(distinct COLUMN) from TABLE;

-- default for Linear Counting is 1 million bits size
select approx_distinct(COLUMN, 'lc').cardinality from TABLE;
select count(distinct COLUMN) from TABLE;

NOTE that using approx_distinct(x) and count(distinct x) in the same query leads to weird errors with Java heap space (still trying to figure out why).

Accuracy

My quick and dirty tests indicate that the Linear Counting approach is not very robust when used with MapReduce, since the error rates seem to compound very quickly even for high bit sizes. HyperLogLog, with sufficient bit size parameter (e.g. b = 16 to b = 26 range) seems fairly robust for moderate sizes of unique values.

Here are some stats:

Expected Accuracy (HyperLogLog)  
b parameter 
5           18.38%
16          0.41%
24          0.03%

QUERY 1

700 million rows, ~7.5 million unique values. ~200 mappers, 1 reducer (1G heap).

Algorithm                   Cardinality Estimate        Error Rate (%)
COUNT(DISTINCT x)           7,559,481                         
Linear Counting             6,266,403                   -17.11%
HyperLogLog (b=5)           9,625,215                   27.33%
HyperLogLog (b=16)          7,536,626                   -0.30%
HyperLogLog (b=24)          7,546,374                   -0.17%

QUERY 2 - with Group By

Grouped Field (sample)  Records       Exact Uniques   Approx Uniques (b=16)   Error (%)
A                       233,024,253   3,451,201       3,446,017               -0.15%
B                       223,350,154   2,810,703       2,796,881               -0.49%
C                       152,874,611   2,889,500       2,884,780               -0.16%
D                        22,144,205     573,330         571,830               -0.26%
E                        20,638,765   1,322,098       1,320,369               -0.13%
F                        19,215,734     504,509         504,751                0.05%
G                        18,896,848     280,102         280,943                0.30%
H                        17,998,980   3,173,567       3,170,974               -0.08%
            
Overall uniques                 7,679,458          
Overall approx uniques          7,665,425          
Overall error rate                 -0.18%

QUERY 3 - Large cardinality, with and without long-range correction

Overall distincts               500,000,000 
Approx distincts (b=26)         529,757,672 
Error rate                           -5.62%
Approx (without long-range 
        correction, default)    498,447,667 
Error rate                           -0.31%

Error rates without long-range correction for large cardinalities are of similar magnitude as for smaller cardinalities, although for b=26 and 1.7 billion uniques, the error rate does climb to about 5%.

References

Relies on Clearspring's excellent stream-lib project for the algorithm implementations.

See this blog post, this blog post and this blog post for in depth discussions of HyperLogLog and Linear Counting (as well as Count-Min Sketch and others).

The actual papers:

TODO

Tests!
Add set intersection for HyperLogLog to stream-lib
Other algorithms (count-min sketch etc)
HyperLogLog (and other monoids) from algebird in Scala (for fun!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly