Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up _compute_centroid_quantile by using an accumulation tree #28

Closed
wants to merge 5 commits into from

Conversation

tkluck-booking
Copy link

A profiling run shows that _compute_centroid_quantile is by far a bottleneck, and that intuitively makes sense because its runtime is bounded by the number of centroids in the tree O(n) (as opposed to by O(log(n)) for the other tree operations).

Ted Dunning's original paper suggests speeding up this computation by storing partial sums at every node in the tree. This commit does that.

In this proof-of-concept, this means trading in the C speed for FastRBTree for the Python speed of RBTree. Still, I get an almost 50% speedup for a test use case

time python -c'import random; from tdigest import TDigest; TDigest().batch_update(random.random() for _ in range(10000))'

proving yet again that better algorithms beat faster implementation languages.

A followup could be to write an accumulation tree in C or Cython, which should give the best of both worlds.

…an accumulation tree

A profiling run shows that _compute_centroid_quantile is by far a bottleneck,
and that intuitively makes sense because its runtime is bounded by the number
of centroids in the tree O(n) (as opposed to by O(log(n)) for the other tree
operations).

Ted Dunning's original paper suggests speeding up this computation by storing
partial sums at every node in the tree. This commit does that.

In this proof-of-concept, this means trading in the C speed for FastRBTree for
the Python speed of RBTree. Still, I get an almost 50% speedup for a test use
case

    time python -c'import random; from tdigest import TDigest; TDigest().batch_update(random.random() for _ in range(10000))'

proving yet again that better algorithms beat faster implementation languages.

A followup could be to write an accumulation tree in C or Cython, which should
give the best of both worlds.
@tkluck-booking
Copy link
Author

^^ Fixing that integration check. I'd noticed it before, but because of other reasons, I didn't think this commit introduced it - it did, obviously :)

On it - correctness is much more important than speed. Will re-open this pull request once that is done.

…tree

The major issue was that _path_to_key chose the wrong child node when
descending into the tree. The other issues are cosmetic, although this commit
also fixes the method get_accumulation (which is currently unused by TDigest).

The unit tests now (1) pass and (2) run 5 times faster.
@tkluck-booking
Copy link
Author

Just fixed the unit tests with a followup commit; let's see what travis thinks.

@tkluck-booking tkluck-booking reopened this Feb 3, 2017
@@ -55,8 +56,7 @@ def _add_centroid(self, centroid):

def _compute_centroid_quantile(self, centroid):
denom = self.n
cumulative_sum = sum(
c_i.count for c_i in self.C.value_slice(-float('Inf'), centroid.mean))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ho! Interesting, this is linear. Nice catch

@CamDavidsonPilon
Copy link
Owner

PR looks really fascinating, I bet it was fun to write. I don't have time now to review, but will this weekend!

@tkluck
Copy link

tkluck commented Feb 5, 2017

I now have a cython implementation of this 'accumulation_tree'. It brings make test down to 11 seconds (from 160 seconds before the work in this branch, or 30 seconds with the python implementation). I could either add it to tdigest, or release it as a separate library and add it as a dependency. I could also add it as a pull request to bintrees - but its github page says that its development has been abandoned. What do you think?

def _get_left_accumulation(self, node, upper):
if not node:
return self._zero
if(upper > node.key):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but this means our accumulation is not inclusive, right? ie. [a,b)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

self._get_full_accumulation(node[0]),
self._get_left_accumulation(node[1], upper),
),
self._mapper(node.value),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A part I'm not sure about: does the node store in its accumulation variable the sum of everything below and it's own value too? Or just everything below? Here is looks like we add in it's own value, but below in _update_accumulation it looks like we already include it.

Copy link

@tkluck tkluck Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the node store in its accumulation variable the sum of everything below and it's own value too?

Yes, that's what it' supposed to do. So in particular, if it has no children, then node.accumulation = self._mapper(node.value). Which is the part of _update_accumulation that makes you think this does not hold? (It's perfectly possible I missed a case.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@tkluck tkluck Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I see where you are coming from. No, I don't think we're adding the value twice. Because when we visit node in that function, we are not sure that we need the entire accumulation under node, which is why you'll notice node.accumulation is not accessed at all. This is the reason there is no double-counting: we either use self._mapper(node.value) or node.accumulation, but never both in the same function.

Let's go over it in detail. By definition, _get_left_accumulation(self, node, upper) returns the sum of all nodes n under node (including node itself) which satisfy n.key < upper. "By definition" is to say: that's what I want it to do. Let me try to convince you that indeed, it does:

def _get_left_accumulation(self, node, upper):

if node is None, this accumulation is zero. This is the stopping condition for the recursion.

        if not node:
            return self._zero

if upper > node.key, we now need to retrieve the sum of the following three things:

  • node.key itself (see definition);
  • the sum of node.left (node[0]) and all its children, because all of them have n.key < node.key, and therefore also n.key < upper. We call _get_full_accumulation on node.left;
  • that part of the tree rooted at node.right that satisfies n.key < upper. We just call this function recursively.
        if(upper > node.key):
            return self._reducer(
                self._reducer(
                    self._get_full_accumulation(node[0]),
                    self._get_left_accumulation(node[1], upper),
                ),
                self._mapper(node.value),
            )

if upper <= node.key, we can ignore this node and everything smaller, so we ignore node[1] and only recurse into node[0].

        else:
            return self._get_left_accumulation(node[0], upper)

Does this explanation help? You can reason all other functions through in the same way.

(edit: fix flipped node[0] vs node[1] in the explanation for case three.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, I now understand after some more thought. Thanks for the explanation.


self._dirty_nodes = set()

class AccumulatingNode(Node):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this class inside the Tree class is a very clever way to share variables like _dirty_nodes 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yes, ever since i learned about it, I've been able to cut a lot of

def __init__(self, variable1, variable2, ...)
    self._variable1 = variable1
    self._variable2 = variable2
    ....

boilerplate.

@CamDavidsonPilon
Copy link
Owner

I have a few questions (above). Overall the code is excellent. I feel like I'm looking at code that I want to write in a few years. Not to mention the algorithm looks great. 👏

(Did you write this yourself, or did you have a reference to help?)

@CamDavidsonPilon
Copy link
Owner

I could also add it as a pull request to bintrees - but its github page says that its development has been abandoned

I think this is the least favourable option. It probably wouldn't see the light of day if the project is abandoned.

I could either add it to tdigest

Since we already abstracted cython parts (bintrees), tdigest can be made to be pure python (pros are less complicated setup.py, installing, etc). So, with that in mind ...

release it as a separate library and add it as a dependency.

I think this is the best option: it abstracts the cython bits outside of this library, and then also other libraries can benefit from the implementation (https://github.com/jcrist/crick is an example)

@tkluck
Copy link

tkluck commented Feb 5, 2017

release it as a separate library and add it as a dependency.

I think this is the best option: it abstracts the cython bits outside of this library, and then also other libraries can benefit from the implementation (https://github.com/jcrist/crick is an example)

All right, I'll get to it. I guess that will mean I'll move the code from this branch to that library instead. I'll try to mimic bintrees' behaviour of falling back gracefully to a python implementation in cases cython is not available.

I have a few questions (above). Overall the code is excellent. I feel like I'm looking at code that I want to write in a few years. Not to mention the algorithm looks great. 👏

(Did you write this yourself, or did you have a reference to help?)

Thanks for being positive and open to pull requests! I must say, reaching out on github can be hit-or-miss, and you seem to be one of those maintainers that manages to set exactly the right atmosphere that would allow for a community to form. Keep up the good spirit!

FWIW, yes, i wrote this. I'd done something similar before for computing quick accumulations of slices of arrays. The new bit was to combine those ideas with a binary search tree.

@tkluck
Copy link

tkluck commented Feb 5, 2017

^^ Just released that package; haven't tested with python 3 yet, so that's likely to fail the integration testing. Will look into that tomorrow.

requirements.txt Outdated
@@ -1,2 +1,3 @@
bintrees
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop bintrees now that it's a dep on your package

… tree now

(Pushing this commit to the pull request also bumps travis to run the
integration test against the newer version of the accumulation_tree library,
v0.3, which should solve the issue with python3.)
@tkluck
Copy link

tkluck commented Feb 6, 2017

Oh no! The integration checks are failing. The reason seems to be that I'm depending on a relatively new version of setuptools, which fixed this issue.

It was very hard to distribute a python2/3 cython extension before that change. There were two ways to distribute your package:

  • either just require that the user always has cython installed;
  • or ship the .c file that cython generated in your source distribution. Whenever cython is not available, fall back to that.

Unfortunately, the second option can't be simultaneously python2 and python3 compatible, because cython generates different C sources for each.

The newer version of setuptools allows you to just specify setup_requires=["cython"] and you're all set.

I'll try to reproduce travis' environment and see what I can do to address this.

@@ -17,7 +17,8 @@ def read(fname):
long_description=read('README.md'),
install_requires=[
"bintrees",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's nuke this here too

@FlorianWilhelm
Copy link

It's really nice to see that work is going on to improve the performance. On my laptop I see that about 10,000 updates take about 1 second. For my actual use case I would have to ingest 100s of millions of values (or I just subsample :-). Nevertheless what speedup do you expect in this PR with the new partial sum method implemented in Cython?

@tkluck-booking
Copy link
Author

FlorianWilhelm: the speedup is each insert operation moving from being linear in the number of centroids to being logarithmic. So the speedup should be much more than a constant factor. Feel free to checkout this branch and test its performance on your machine.

@CamDavidsonPilon
Copy link
Owner

Are you still able to contribute to this PR @tkluck? Otherwise I don't mind taking it over.

@tkluck
Copy link

tkluck commented Feb 19, 2017

@CamDavidsonPilon feel free to beat me to fixing this :) I'll pick it up eventually if you don't; for external reasons, it has dropped a few places on my list of priorities.

Thanks for checking in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants