-
-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nlogn concordance index algorithm (first pass) #145
nlogn concordance index algorithm (first pass) #145
Conversation
I'm sorta okay with adding blist as a dependency: it's a lib that is not maintained much anymore, but on the other hand it really may offer significant gains. Is there another lib or snippet that can be used? w.r.t to the new algorithm, I'll go over it tomorrow, and I invite @spacecowboy to take a look to :) |
I thought of a simpler "balanced" binary tree that I could write myself (not adding an external dependency), based on the fact that we know the rank order of each prediction at the beginning. The basic idea is that you know what the final balanced tree would look like, that is, which node would contain each number. So when you shuffle nodes around to rebalance, you can just always move them towards their final position. Here's how it goes:
I think this should allow a reasonably concise implementation (even when extending it to handle ties). But I don't know if it's worth the additional code complexity compared to a library dependency. I could also just rewrite my stuff in Fortran and paste a license-compatible Fortran balanced tree implementation from somewhere else. Your call. |
This is excellent. I had actually never heard of an O(nlogn) implementation before. Doing some "expert googling" turns up a single reference in the literature: David Christensen. Fast algorithms for the calculation of Kendall’s τ . Computational I don't have time to really dig into the implementation right now as I am swamped for a few days more. I'll do some major reading/testing later. But be careful when relying on the Cox-model for the test. The Cox model does have some significant differences to R still. As fond as I am of my Fortran code, I suggest we keep this pure python if we can. With an O(nlogn) execution time, there should be little reason for having a native implementation. Keep blits for now. I can volunteer to write a sorted tree implementation at a later date. |
@spacecowboy: There's at least one other reference to an nlogn implementation--the R documentation for
which is probably where I got the idea. Today I found out that there's actually documentation for that code! It apparently uses a clever simplification of my suggestion for a balanced binary tree. Instead of moving nodes around in the tree, you can just keep track of whether each node in the tree is "occupied yet", and use that info when you're updating the array of subtree sizes. That should make it pretty simple to implement except for possibly handling ties. |
|
||
# This fails right now, and BOTH values disagree with the R output. | ||
# TODO: investigate? | ||
assert fast_py_cindex(T, P, E) == fast_cindex(T, P, E) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails because of ties in the dataset. If you change line 382 to np.random.binomial(50,0.5, size=size)
(which will have ties), then the second assert
will fail.
I'm concerned that the output is different from R. Are you using the survcomp
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure it's not just a floating point difference due to different summation order before you worry too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I spotted the bug.
You have
def valid_comparison(time_a, time_b, event_a, event_b):
"""True if times can be compared."""
if time_a == time_b:
# Ties are not informative
return False
But actually, if event_a != event_b
, you should return True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually true. Well spotted sir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So change to:
if time_a == time_b:
return event_a != event_b
OK, I stole the btree idea from R's |
Timings for the latest commit:
Let me know if you see any other low-hanging optimizations; it's still pretty slow compared to the |
pairs = len(times_to_compare) * (next_ix - first_ix) | ||
correct = 0 | ||
tied = 0 | ||
for i in xrange(first_ix, next_ix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xrange does not exist in python 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, didn't realize lifelines was supposed to be py3k compatible as well!
Removing the Fortran code means we can also simplify setup.py, and possibly remove the __utils submodule. The slow version could be renamed "_naive_concordance_index" or something, if it's not removed. |
OK, I tried to simplify the setup.py script. Let me know if anything's wrong--I haven't used it much before. |
👍 I'll give this a 🎩 after work |
6cc8693
to
7c7a9a3
Compare
(rebased on latest master) |
requires=[ | ||
"numpy", | ||
"scipy", | ||
"pandas(>=0.14)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax is new to me, where did you see it originally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the distutils documentation, which states:
If specific versions are required, a sequence of qualifiers can be supplied in parentheses. Each qualifier may consist of a comparison operator and a version number.
When I tried to run it without the parens (as it was before), it gave me a ValueError: expected parenthesized list: '>=0.14'
. I thought this was because I changed the numpy.distutils.core
line to distutils.core
(since we don't need Fortran stuff anymore), but I just tried changing it back and it still doesn't work for me. So I'm not sure what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I installed this version fine just now (Pandas 0.16 was installed in a fresh virtualenv). Let me try removing the ()
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError: expected parenthesized list: '>=0.14'
ditto. But doing a quick search suggests using install_requires
over requires
. I did this locally with the ()
and it worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, crap, I missed this comment. When I use install_requires
I get a UserWarning: Unknown distribution option: 'install_requires'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, apparently you have to use pip or easy_install to use install_requires
instead of requires
? Weird. I can correct this if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How else can one install a lib without using pip or easy_install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have looked here before merging, if you have a minute now, yea please send a new PR =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run python setup.py install
manually like me :P New PR coming up.
Since this is such a large speed improvement, can you also remove the optional concordance index call in lifelines/lifelines/estimation.py Line 1310 in 8134dd9
|
@CamDavidsonPilon: You mean remove the option and make it always call concordance? |
Cool, that should do it. Do you prefer me to squash before it's merged? |
Nope, its all good. Thanks for your contribution here; you've made installation a whole lot easier too! 🎉 |
nlogn concordance index algorithm (first pass)
Nice. That means I can stop my Windows Build VM. |
My dataset is about 200k rows, so even the Fortran concordance option takes more than a minute because it's an n^2 algorithm. I wrote a faster (n log n) version. On 100k rows of fake data, this takes the time down from 52s (previous fast n^2 Fortran version) to 4s (current pure-Python n log n version).
Right now it introduces a dependency on another library (
blist
) because I didn't want to write the order statistic tree myself. Unfortunatelyblist
's order statistic trees might be O(log^2 n) instead of O(log n) for RANK operations, so right now this might be O(n log^2 n) in practice. Also I suspectblist
is slower than a data structure that just tried to be an order statistic tree would be. Anyway, because of the dependency issue, I wanted to run this by the maintainers before I go any further with it. What do you recommend?PS It's also not quite correct yet--it disagrees with the Fortran implementation on the full Cox model concordance test, and they both disagree with what I get in R. I still have to track this down.