New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up _compute_centroid_quantile by using an accumulation tree #28
Conversation
…an accumulation tree A profiling run shows that _compute_centroid_quantile is by far a bottleneck, and that intuitively makes sense because its runtime is bounded by the number of centroids in the tree O(n) (as opposed to by O(log(n)) for the other tree operations). Ted Dunning's original paper suggests speeding up this computation by storing partial sums at every node in the tree. This commit does that. In this proof-of-concept, this means trading in the C speed for FastRBTree for the Python speed of RBTree. Still, I get an almost 50% speedup for a test use case time python -c'import random; from tdigest import TDigest; TDigest().batch_update(random.random() for _ in range(10000))' proving yet again that better algorithms beat faster implementation languages. A followup could be to write an accumulation tree in C or Cython, which should give the best of both worlds.
^^ Fixing that integration check. I'd noticed it before, but because of other reasons, I didn't think this commit introduced it - it did, obviously :) On it - correctness is much more important than speed. Will re-open this pull request once that is done. |
…tree The major issue was that _path_to_key chose the wrong child node when descending into the tree. The other issues are cosmetic, although this commit also fixes the method get_accumulation (which is currently unused by TDigest). The unit tests now (1) pass and (2) run 5 times faster.
Just fixed the unit tests with a followup commit; let's see what travis thinks. |
@@ -55,8 +56,7 @@ def _add_centroid(self, centroid): | |||
|
|||
def _compute_centroid_quantile(self, centroid): | |||
denom = self.n | |||
cumulative_sum = sum( | |||
c_i.count for c_i in self.C.value_slice(-float('Inf'), centroid.mean)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ho! Interesting, this is linear. Nice catch
PR looks really fascinating, I bet it was fun to write. I don't have time now to review, but will this weekend! |
I now have a cython implementation of this 'accumulation_tree'. It brings |
tdigest/accumulation_tree.py
Outdated
def _get_left_accumulation(self, node, upper): | ||
if not node: | ||
return self._zero | ||
if(upper > node.key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but this means our accumulation is not inclusive, right? ie. [a,b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
tdigest/accumulation_tree.py
Outdated
self._get_full_accumulation(node[0]), | ||
self._get_left_accumulation(node[1], upper), | ||
), | ||
self._mapper(node.value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A part I'm not sure about: does the node store in its accumulation
variable the sum of everything below and it's own value
too? Or just everything below? Here is looks like we add in it's own value
, but below in _update_accumulation
it looks like we already include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the node store in its accumulation variable the sum of everything below and it's own value too?
Yes, that's what it' supposed to do. So in particular, if it has no children, then node.accumulation = self._mapper(node.value)
. Which is the part of _update_accumulation
that makes you think this does not hold? (It's perfectly possible I missed a case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if that's true, then aren't we adding value
again in https://github.com/tkluck/tdigest/blob/e0acbee7ac74bddb51f8eaa90186d9d7839d6bbf/tdigest/accumulation_tree.py#L71-L76?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I see where you are coming from. No, I don't think we're adding the value twice. Because when we visit node
in that function, we are not sure that we need the entire accumulation under node
, which is why you'll notice node.accumulation
is not accessed at all. This is the reason there is no double-counting: we either use self._mapper(node.value)
or node.accumulation
, but never both in the same function.
Let's go over it in detail. By definition, _get_left_accumulation(self, node, upper)
returns the sum of all nodes n
under node
(including node
itself) which satisfy n.key < upper
. "By definition" is to say: that's what I want it to do. Let me try to convince you that indeed, it does:
def _get_left_accumulation(self, node, upper):
if node is None, this accumulation is zero. This is the stopping condition for the recursion.
if not node:
return self._zero
if upper > node.key
, we now need to retrieve the sum of the following three things:
- node.key itself (see definition);
- the sum of
node.left
(node[0]
) and all its children, because all of them haven.key < node.key
, and therefore alson.key < upper
. We call_get_full_accumulation
onnode.left
; - that part of the tree rooted at
node.right
that satisfiesn.key < upper
. We just call this function recursively.
if(upper > node.key):
return self._reducer(
self._reducer(
self._get_full_accumulation(node[0]),
self._get_left_accumulation(node[1], upper),
),
self._mapper(node.value),
)
if upper <= node.key
, we can ignore this node and everything smaller, so we ignore node[1]
and only recurse into node[0]
.
else:
return self._get_left_accumulation(node[0], upper)
Does this explanation help? You can reason all other functions through in the same way.
(edit: fix flipped node[0]
vs node[1]
in the explanation for case three.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, I now understand after some more thought. Thanks for the explanation.
tdigest/accumulation_tree.py
Outdated
|
||
self._dirty_nodes = set() | ||
|
||
class AccumulatingNode(Node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this class inside the Tree class is a very clever way to share variables like _dirty_nodes
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yes, ever since i learned about it, I've been able to cut a lot of
def __init__(self, variable1, variable2, ...)
self._variable1 = variable1
self._variable2 = variable2
....
boilerplate.
I have a few questions (above). Overall the code is excellent. I feel like I'm looking at code that I want to write in a few years. Not to mention the algorithm looks great. 👏 (Did you write this yourself, or did you have a reference to help?) |
I think this is the least favourable option. It probably wouldn't see the light of day if the project is abandoned.
Since we already abstracted cython parts (
I think this is the best option: it abstracts the cython bits outside of this library, and then also other libraries can benefit from the implementation (https://github.com/jcrist/crick is an example) |
All right, I'll get to it. I guess that will mean I'll move the code from this branch to that library instead. I'll try to mimic
Thanks for being positive and open to pull requests! I must say, reaching out on github can be hit-or-miss, and you seem to be one of those maintainers that manages to set exactly the right atmosphere that would allow for a community to form. Keep up the good spirit! FWIW, yes, i wrote this. I'd done something similar before for computing quick accumulations of slices of arrays. The new bit was to combine those ideas with a binary search tree. |
See https://github.com/tkluck/accumulation_tree, which was written with exactly this use case in mind.
^^ Just released that package; haven't tested with python 3 yet, so that's likely to fail the integration testing. Will look into that tomorrow. |
requirements.txt
Outdated
@@ -1,2 +1,3 @@ | |||
bintrees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can drop bintrees now that it's a dep on your package
… tree now (Pushing this commit to the pull request also bumps travis to run the integration test against the newer version of the accumulation_tree library, v0.3, which should solve the issue with python3.)
Oh no! The integration checks are failing. The reason seems to be that I'm depending on a relatively new version of It was very hard to distribute a python2/3 cython extension before that change. There were two ways to distribute your package:
Unfortunately, the second option can't be simultaneously python2 and python3 compatible, because cython generates different C sources for each. The newer version of setuptools allows you to just specify I'll try to reproduce travis' environment and see what I can do to address this. |
@@ -17,7 +17,8 @@ def read(fname): | |||
long_description=read('README.md'), | |||
install_requires=[ | |||
"bintrees", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's nuke this here too
It's really nice to see that work is going on to improve the performance. On my laptop I see that about 10,000 updates take about 1 second. For my actual use case I would have to ingest 100s of millions of values (or I just subsample :-). Nevertheless what speedup do you expect in this PR with the new partial sum method implemented in Cython? |
FlorianWilhelm: the speedup is each |
Are you still able to contribute to this PR @tkluck? Otherwise I don't mind taking it over. |
@CamDavidsonPilon feel free to beat me to fixing this :) I'll pick it up eventually if you don't; for external reasons, it has dropped a few places on my list of priorities. Thanks for checking in! |
A profiling run shows that _compute_centroid_quantile is by far a bottleneck, and that intuitively makes sense because its runtime is bounded by the number of centroids in the tree O(n) (as opposed to by O(log(n)) for the other tree operations).
Ted Dunning's original paper suggests speeding up this computation by storing partial sums at every node in the tree. This commit does that.
In this proof-of-concept, this means trading in the C speed for FastRBTree for the Python speed of RBTree. Still, I get an almost 50% speedup for a test use case
proving yet again that better algorithms beat faster implementation languages.
A followup could be to write an accumulation tree in C or Cython, which should give the best of both worlds.