# Q-digests

* Q-digests are a binary tree whose leaf nodes are value counts for the associated quantiles.
* The set of possible buckets are chosen from a binary partition over the set of all possible values, denoted $\{1, \ldots,\sigma\}$.
* Quantiles with too few observations are compressed away when possible. The end result is in an imbalanced binary tree whose nodes are nicely compressed value counts.
* The fudamental difference between Q-digests and "classical" equi-depth histograms is that Q-digests have overlapping buckets. That is, the boundary points on leaves may overlap with those of nearby leaves.
* The compression parameter is $k$.
* Every node in the Q-digest follows two invariants.
* The first is that the number of elements in any node included in the Q-digest has to satisfy $\text{count}(v) \leq \lfloor n \:/\: k \rfloor$.
* The second considition is that any node, parent, and sibling (triangle arrangement) must satisfy $\text{count}(v) + \text{count}(v_p) + \text{count}(v_s) > \lfloor n \: / \: k \rfloor$
* The exceptions to these properties are the root node, obviously, and leaf nodes.
* Really only the second property is important, the first is mainly used for proving certain properties.
* Define a relation $\Delta_v \equiv \text{count}(v) + \text{count}(v_l) + \text{count}(v_r)$. In other words let $\Delta_v$ be the number of non-leaf, non-root nodes in the tree.
* To build a Q-digest, first construct a binary tree whose leaves are individual values. Then, compress the tree by recursively compressing non-root, non-leaf nodes for which $\Delta_v \leq \lfloor n \: / \: k \rfloor$.

In [210]:
class Node():
    def __init__(self, v, l, r):
        self.v = v  # value
        self.l = l  # left sub-node
        self.r = r  # right sub-node
        
    def get_level(self, level):
        nodes = bfs_traverse(self)
        return [t[0] for t in nodes if t[1] == level]
    
    def Δ_v(self):
        tot = 0
        if self.l:
            tot += self.l.v
        if self.r:
            tot += self.r.v
        tot += self.v
        return tot
        
        
def bfs_traverse(root, level=0, parent=None):
    """
    Returns an array of tuples whose first element is a node, the second is its depth, and the third its parent.
    """
    ret = [(root, level, parent)]
    if root.l:
        ret += bfs_traverse(root.l, level + 1, root)
    if root.r:
        ret += bfs_traverse(root.r, level + 1, root)
    return ret            


def treeify(values):
    """
    Given a sorted list of value, returns a binary tree.
    Positions determined by list sort order, values determined by list entry value.
    """
    if len(values) == 0:
        return Node(None, None, None)
    elif len(values) == 1:
        return Node(values[0], None, None)
    else:
        pivot_idx = len(values) // 2
        left = treeify(values[:pivot_idx])
        right = treeify(values[pivot_idx + 1:])
        return Node(values[pivot_idx], left, right)


class QDigest():
    def __init__(self, k, values):
        """
        K is the compression factor. Values is a sorted list of counts: that is, how often each value of 
        interest appears. For example if 1 appears twice, 2 appears once, and 3 appears five times, the 
        values input would be [2, 1, 5] (NOT [1, 2, 3]).
        """
        self.k = k
        self.root = treeify(values)
        self.compress()
        
    def get_level(self, level):
        """Returns pointers to all nodes on the given level."""
        return self.root.get_level(level)
    
    def count(self):
        """Returns a count of all elements on a level."""
        return sum([n[0].v for n in bfs_traverse(self.root)])
    
    def compress(self):
        
        traversed_node_list = bfs_traverse(self.root)
        
        # The maximum level is the leaf level, but compression occurs at the levels above.
        max_level = max([tup[1] for tup in traversed_node_list])
        n = sum([tup[0].v for tup in traversed_node_list])
        n_div_k = n // self.k
        
        # Data needed for node traversal.
        nodes = [tup[0] for tup in traversed_node_list]
        levels = [tup[1] for tup in traversed_node_list]
        parents = [tup[2] for tup in traversed_node_list]
        
        for nlvl in range(max_level - 1, 0, -1):
            idxs = [i for i, level in enumerate(levels) if level == nlvl]
            
            for idx in idxs:
                node, parent = nodes[idx], parents[idx]
                if node.Δ_v() <= n_div_k:
                    print("""Compressing node with value {0} at level {1} b/c its total weight {2} is less than {3}.""".format(
                        node.v, levels[idx], node.Δ_v(), n_div_k
                    ))
                    new_v = node.v
                    new_v += node.l.v if node.l else 0
                    new_v += node.r.v if node.r else 0
                    
                    new_node = Node(new_v, None, None)
                    if node.v < parent.v:
                        parent.l = new_node
                    else:
                        parent.r = new_node
    
    def insert(values):
        """Union operation."""
        pass

* The Q-digest algorithm is designed for providing the data "all at once". This is because it is designed in the context of exporting data from a sensor networks, not in the context of online summary statistic calculations.
* A `merge` procedure is provided for combining multiple Q-digests. It's straigthforward: take the union of the trees, and re-compress again.
* Technically adding a value entry to the tree is the same as merging against.
* This implementation could be improved upon...

In [211]:
q = QDigest(1, [1, 2, 3, 4, 5, 6, 7])

Compressing node with value 2 at level 1 b/c its total weight 6 is less than 28.
Compressing node with value 6 at level 1 b/c its total weight 18 is less than 28.


In [212]:
[(tup[1], tup[0].v) for tup in bfs_traverse(q.root)]

[(0, 4), (1, 6), (1, 18)]

In [135]:
QDigest(5, [5, 4, 3])

<__main__.QDigest at 0x111f35198>

In [123]:
QDigest(5, [5, 4, 3]).root.l

<__main__.Node at 0x111f6ab38>