nlogn concordance index algorithm (first pass) #145

benkuhn · 2015-04-18T00:30:18Z

My dataset is about 200k rows, so even the Fortran concordance option takes more than a minute because it's an n^2 algorithm. I wrote a faster (n log n) version. On 100k rows of fake data, this takes the time down from 52s (previous fast n^2 Fortran version) to 4s (current pure-Python n log n version).

Right now it introduces a dependency on another library (blist) because I didn't want to write the order statistic tree myself. Unfortunately blist's order statistic trees might be O(log^2 n) instead of O(log n) for RANK operations, so right now this might be O(n log^2 n) in practice. Also I suspect blist is slower than a data structure that just tried to be an order statistic tree would be. Anyway, because of the dependency issue, I wanted to run this by the maintainers before I go any further with it. What do you recommend?

PS It's also not quite correct yet--it disagrees with the Fortran implementation on the full Cox model concordance test, and they both disagree with what I get in R. I still have to track this down.

CamDavidsonPilon · 2015-04-18T04:03:50Z

I'm sorta okay with adding blist as a dependency: it's a lib that is not maintained much anymore, but on the other hand it really may offer significant gains. Is there another lib or snippet that can be used?

w.r.t to the new algorithm, I'll go over it tomorrow, and I invite @spacecowboy to take a look to :)

benkuhn · 2015-04-18T04:39:05Z

I thought of a simpler "balanced" binary tree that I could write myself (not adding an external dependency), based on the fact that we know the rank order of each prediction at the beginning. The basic idea is that you know what the final balanced tree would look like, that is, which node would contain each number. So when you shuffle nodes around to rebalance, you can just always move them towards their final position.

Here's how it goes:

For simplicity, suppose that the elements that we insert into the tree are the ranks, not the predictions themselves. Also assume n (list length) is a power of 2 for now.
Assign each node a position in the final list: the root node is n/2, its left and right children are n/4 and 3n/4 respectively, etc.
When inserting a new value x at an occupied node y, we knock out y if and only if inserting it as a child of x would move it closer to its target/final position.
Concretely, to add a new element x at a node that currently contains y (when the node's list is i):
- If x > y and y < i, then replace y with x and try to insert y as the left child of x
- If x > y and y >= i, then try to insert x as the right child of y
- If x < y and y > i, then replace y with x and try to insert y as the right child of x
- If x < y and y <= i, then try to insert x as the left child of y
You can then easily get the rank of an element in O(log n) by keeping a counter of the number of children of each node.

I think this should allow a reasonably concise implementation (even when extending it to handle ties). But I don't know if it's worth the additional code complexity compared to a library dependency.

I could also just rewrite my stuff in Fortran and paste a license-compatible Fortran balanced tree implementation from somewhere else. Your call.

spacecowboy · 2015-04-18T08:37:42Z

This is excellent. I had actually never heard of an O(nlogn) implementation before. Doing some "expert googling" turns up a single reference in the literature:

David Christensen. Fast algorithms for the calculation of Kendall’s τ . Computational
Statistics, 20:51–62, 2005.

I don't have time to really dig into the implementation right now as I am swamped for a few days more. I'll do some major reading/testing later. But be careful when relying on the Cox-model for the test. The Cox model does have some significant differences to R still.

As fond as I am of my Fortran code, I suggest we keep this pure python if we can. With an O(nlogn) execution time, there should be little reason for having a native implementation. Keep blits for now. I can volunteer to write a sorted tree implementation at a later date.

benkuhn · 2015-04-18T19:36:52Z

@spacecowboy: There's at least one other reference to an nlogn implementation--the R documentation for survConcordance says

The algorithm is based on a balanced binary tree, which allows the computation to be done in O(n log n) time.

which is probably where I got the idea.

Today I found out that there's actually documentation for that code! It apparently uses a clever simplification of my suggestion for a balanced binary tree. Instead of moving nodes around in the tree, you can just keep track of whether each node in the tree is "occupied yet", and use that info when you're updating the array of subtree sizes. That should make it pretty simple to implement except for possibly handling ties.

CamDavidsonPilon · 2015-04-18T20:15:14Z

lifelines/tests/test_utils.py

+
+    # This fails right now, and BOTH values disagree with the R output.
+    # TODO: investigate?
+    assert fast_py_cindex(T, P, E) == fast_cindex(T, P, E)


This fails because of ties in the dataset. If you change line 382 to np.random.binomial(50,0.5, size=size) (which will have ties), then the second assert will fail.

I'm concerned that the output is different from R. Are you using the survcomp package?

Make sure it's not just a floating point difference due to different summation order before you worry too much.

Oh! I spotted the bug.

You have

def valid_comparison(time_a, time_b, event_a, event_b): """True if times can be compared.""" if time_a == time_b: # Ties are not informative return False

But actually, if event_a != event_b, you should return True.

That's actually true. Well spotted sir.

So change to:

if time_a == time_b: return event_a != event_b

benkuhn · 2015-04-19T19:50:14Z

OK, I stole the btree idea from R's survival package--it ended up being not that much code and was kind of neat to implement. I also fixed the bug in the slow concordance code; it now agrees with the fast Python implementation. Should I just get rid of the Fortran implementation for now, then? (It might be good to keep the pure-Python n^2 implementation in, since it's much simpler--it's useful to test the fast one against.)

benkuhn · 2015-04-19T20:20:08Z

Timings for the latest commit:

SIZE = int(10e4)
pred = np.random.uniform(size=SIZE)
obs = np.random.binomial(100, 0.5, SIZE)
event = np.random.binomial(1, 0.5, SIZE)
%timeit ci.fast_concordance_index(obs, pred, event)
# 1 loops, best of 3: 1.44 s per loop
%timeit -r 1 fci.concordance_index(obs, pred, event) # fci is the Fortran module
# 1 loops, best of 1: 53 s per loop

Let me know if you see any other low-hanging optimizations; it's still pretty slow compared to the survival version, but that's because survival does the heavy lifting in C.

spacecowboy · 2015-04-19T22:40:04Z

lifelines/_utils/cindex.py

+        pairs = len(times_to_compare) * (next_ix - first_ix)
+        correct = 0
+        tied = 0
+        for i in xrange(first_ix, next_ix):


xrange does not exist in python 3

Whoops, didn't realize lifelines was supposed to be py3k compatible as well!

spacecowboy · 2015-04-19T22:46:34Z

Removing the Fortran code means we can also simplify setup.py, and possibly remove the __utils submodule. The slow version could be renamed "_naive_concordance_index" or something, if it's not removed.

benkuhn · 2015-04-20T17:29:31Z

OK, I tried to simplify the setup.py script. Let me know if anything's wrong--I haven't used it much before.

CamDavidsonPilon · 2015-04-20T17:49:21Z

👍 I'll give this a 🎩 after work

benkuhn · 2015-04-20T18:24:44Z

(rebased on latest master)

CamDavidsonPilon · 2015-04-20T23:54:06Z

setup.py

+    requires=[
+        "numpy",
+        "scipy",
+        "pandas(>=0.14)",


This syntax is new to me, where did you see it originally?

I read the distutils documentation, which states:

If specific versions are required, a sequence of qualifiers can be supplied in parentheses. Each qualifier may consist of a comparison operator and a version number.

When I tried to run it without the parens (as it was before), it gave me a ValueError: expected parenthesized list: '>=0.14'. I thought this was because I changed the numpy.distutils.core line to distutils.core (since we don't need Fortran stuff anymore), but I just tried changing it back and it still doesn't work for me. So I'm not sure what's going on.

FYI, I installed this version fine just now (Pandas 0.16 was installed in a fresh virtualenv). Let me try removing the () too.

ValueError: expected parenthesized list: '>=0.14' ditto. But doing a quick search suggests using install_requires over requires. I did this locally with the () and it worked fine.

https://coderwall.com/p/xqe8xw/pypi-packages-as-dependency-requirements-in-setup-py

Ah, crap, I missed this comment. When I use install_requires I get a UserWarning: Unknown distribution option: 'install_requires'

Oh, apparently you have to use pip or easy_install to use install_requires instead of requires? Weird. I can correct this if you want.

How else can one install a lib without using pip or easy_install?

I should have looked here before merging, if you have a minute now, yea please send a new PR =)

You can run python setup.py install manually like me :P New PR coming up.

CamDavidsonPilon · 2015-04-21T00:25:25Z

Since this is such a large speed improvement, can you also remove the optional concordance index call in

lifelines/lifelines/estimation.py

Line 1310 in 8134dd9

def print_summary(self, c_index=True):

?

benkuhn · 2015-04-21T00:27:17Z

@CamDavidsonPilon: You mean remove the option and make it always call concordance?

CamDavidsonPilon · 2015-04-21T00:28:40Z

@benkuhn Yes that's right. For context: #139

benkuhn · 2015-04-21T00:48:28Z

Cool, that should do it. Do you prefer me to squash before it's merged?

CamDavidsonPilon · 2015-04-21T00:49:55Z

Nope, its all good. Thanks for your contribution here; you've made installation a whole lot easier too! 🎉

nlogn concordance index algorithm (first pass)

spacecowboy · 2015-04-21T07:22:04Z

Nice. That means I can stop my Windows Build VM.

CamDavidsonPilon reviewed Apr 18, 2015
View reviewed changes

spacecowboy reviewed Apr 19, 2015
View reviewed changes

benkuhn added 6 commits April 20, 2015 11:22

nlogn concordance index algorithm (first pass)

f2c595b

Eliminate dependency on blist, fix bugs

acfbcff

30% speedup by making btree.rank() iterative

1a0a4c2

15% speedup by making btree.insert() iterative

3675c8c

Convert to new concordance method

4cea944

Simplify setup.py

7c7a9a3

benkuhn force-pushed the faster_concordance branch from 6cc8693 to 7c7a9a3 Compare April 20, 2015 18:24

CamDavidsonPilon reviewed Apr 20, 2015
View reviewed changes

Test tree error on bad insert call

018a7c1

Remove c_index param from print_summary

3f48225

CamDavidsonPilon added a commit that referenced this pull request Apr 21, 2015

Merge pull request #145 from benkuhn/faster_concordance

c84ab22

nlogn concordance index algorithm (first pass)

CamDavidsonPilon merged commit c84ab22 into CamDavidsonPilon:master Apr 21, 2015

benkuhn mentioned this pull request Apr 28, 2015

Fix setup.py #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlogn concordance index algorithm (first pass) #145

nlogn concordance index algorithm (first pass) #145

benkuhn commented Apr 18, 2015

CamDavidsonPilon commented Apr 18, 2015

benkuhn commented Apr 18, 2015

spacecowboy commented Apr 18, 2015

benkuhn commented Apr 18, 2015

CamDavidsonPilon Apr 18, 2015

spacecowboy Apr 18, 2015

benkuhn Apr 18, 2015

spacecowboy Apr 18, 2015

spacecowboy Apr 18, 2015

benkuhn commented Apr 19, 2015

benkuhn commented Apr 19, 2015

spacecowboy Apr 19, 2015

benkuhn Apr 20, 2015

spacecowboy commented Apr 19, 2015

benkuhn commented Apr 20, 2015

CamDavidsonPilon commented Apr 20, 2015

benkuhn commented Apr 20, 2015

CamDavidsonPilon Apr 20, 2015

benkuhn Apr 21, 2015

CamDavidsonPilon Apr 21, 2015

CamDavidsonPilon Apr 21, 2015

CamDavidsonPilon Apr 21, 2015

benkuhn Apr 21, 2015

benkuhn Apr 21, 2015

CamDavidsonPilon Apr 21, 2015

CamDavidsonPilon Apr 21, 2015

benkuhn Apr 21, 2015

CamDavidsonPilon commented Apr 21, 2015

benkuhn commented Apr 21, 2015

CamDavidsonPilon commented Apr 21, 2015

benkuhn commented Apr 21, 2015

CamDavidsonPilon commented Apr 21, 2015

spacecowboy commented Apr 21, 2015

nlogn concordance index algorithm (first pass) #145

nlogn concordance index algorithm (first pass) #145

Conversation

benkuhn commented Apr 18, 2015

CamDavidsonPilon commented Apr 18, 2015

benkuhn commented Apr 18, 2015

spacecowboy commented Apr 18, 2015

benkuhn commented Apr 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benkuhn commented Apr 19, 2015

benkuhn commented Apr 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spacecowboy commented Apr 19, 2015

benkuhn commented Apr 20, 2015

CamDavidsonPilon commented Apr 20, 2015

benkuhn commented Apr 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CamDavidsonPilon commented Apr 21, 2015

benkuhn commented Apr 21, 2015

CamDavidsonPilon commented Apr 21, 2015

benkuhn commented Apr 21, 2015

CamDavidsonPilon commented Apr 21, 2015

spacecowboy commented Apr 21, 2015