Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nlogn concordance index algorithm (first pass) #145

Merged
merged 8 commits into from Apr 21, 2015

Conversation

benkuhn
Copy link
Contributor

@benkuhn benkuhn commented Apr 18, 2015

My dataset is about 200k rows, so even the Fortran concordance option takes more than a minute because it's an n^2 algorithm. I wrote a faster (n log n) version. On 100k rows of fake data, this takes the time down from 52s (previous fast n^2 Fortran version) to 4s (current pure-Python n log n version).

Right now it introduces a dependency on another library (blist) because I didn't want to write the order statistic tree myself. Unfortunately blist's order statistic trees might be O(log^2 n) instead of O(log n) for RANK operations, so right now this might be O(n log^2 n) in practice. Also I suspect blist is slower than a data structure that just tried to be an order statistic tree would be. Anyway, because of the dependency issue, I wanted to run this by the maintainers before I go any further with it. What do you recommend?

PS It's also not quite correct yet--it disagrees with the Fortran implementation on the full Cox model concordance test, and they both disagree with what I get in R. I still have to track this down.

@CamDavidsonPilon
Copy link
Owner

I'm sorta okay with adding blist as a dependency: it's a lib that is not maintained much anymore, but on the other hand it really may offer significant gains. Is there another lib or snippet that can be used?

w.r.t to the new algorithm, I'll go over it tomorrow, and I invite @spacecowboy to take a look to :)

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 18, 2015

I thought of a simpler "balanced" binary tree that I could write myself (not adding an external dependency), based on the fact that we know the rank order of each prediction at the beginning. The basic idea is that you know what the final balanced tree would look like, that is, which node would contain each number. So when you shuffle nodes around to rebalance, you can just always move them towards their final position.

Here's how it goes:

  • For simplicity, suppose that the elements that we insert into the tree are the ranks, not the predictions themselves. Also assume n (list length) is a power of 2 for now.
  • Assign each node a position in the final list: the root node is n/2, its left and right children are n/4 and 3n/4 respectively, etc.
  • When inserting a new value x at an occupied node y, we knock out y if and only if inserting it as a child of x would move it closer to its target/final position.
  • Concretely, to add a new element x at a node that currently contains y (when the node's list is i):
    • If x > y and y < i, then replace y with x and try to insert y as the left child of x
    • If x > y and y >= i, then try to insert x as the right child of y
    • If x < y and y > i, then replace y with x and try to insert y as the right child of x
    • If x < y and y <= i, then try to insert x as the left child of y
  • You can then easily get the rank of an element in O(log n) by keeping a counter of the number of children of each node.

I think this should allow a reasonably concise implementation (even when extending it to handle ties). But I don't know if it's worth the additional code complexity compared to a library dependency.

I could also just rewrite my stuff in Fortran and paste a license-compatible Fortran balanced tree implementation from somewhere else. Your call.

@spacecowboy
Copy link
Collaborator

This is excellent. I had actually never heard of an O(nlogn) implementation before. Doing some "expert googling" turns up a single reference in the literature:

David Christensen. Fast algorithms for the calculation of Kendall’s τ . Computational
Statistics, 20:51–62, 2005.

I don't have time to really dig into the implementation right now as I am swamped for a few days more. I'll do some major reading/testing later. But be careful when relying on the Cox-model for the test. The Cox model does have some significant differences to R still.

As fond as I am of my Fortran code, I suggest we keep this pure python if we can. With an O(nlogn) execution time, there should be little reason for having a native implementation. Keep blits for now. I can volunteer to write a sorted tree implementation at a later date.

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 18, 2015

@spacecowboy: There's at least one other reference to an nlogn implementation--the R documentation for survConcordance says

The algorithm is based on a balanced binary tree, which allows the computation to be done in O(n log n) time.

which is probably where I got the idea.

Today I found out that there's actually documentation for that code! It apparently uses a clever simplification of my suggestion for a balanced binary tree. Instead of moving nodes around in the tree, you can just keep track of whether each node in the tree is "occupied yet", and use that info when you're updating the array of subtree sizes. That should make it pretty simple to implement except for possibly handling ties.


# This fails right now, and BOTH values disagree with the R output.
# TODO: investigate?
assert fast_py_cindex(T, P, E) == fast_cindex(T, P, E)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails because of ties in the dataset. If you change line 382 to np.random.binomial(50,0.5, size=size) (which will have ties), then the second assert will fail.

I'm concerned that the output is different from R. Are you using the survcomp package?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure it's not just a floating point difference due to different summation order before you worry too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I spotted the bug.

You have

    def valid_comparison(time_a, time_b, event_a, event_b):
        """True if times can be compared."""
        if time_a == time_b:
            # Ties are not informative
            return False

But actually, if event_a != event_b, you should return True.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually true. Well spotted sir.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So change to:

if time_a == time_b:
    return event_a != event_b

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 19, 2015

OK, I stole the btree idea from R's survival package--it ended up being not that much code and was kind of neat to implement. I also fixed the bug in the slow concordance code; it now agrees with the fast Python implementation. Should I just get rid of the Fortran implementation for now, then? (It might be good to keep the pure-Python n^2 implementation in, since it's much simpler--it's useful to test the fast one against.)

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 19, 2015

Timings for the latest commit:

SIZE = int(10e4)
pred = np.random.uniform(size=SIZE)
obs = np.random.binomial(100, 0.5, SIZE)
event = np.random.binomial(1, 0.5, SIZE)
%timeit ci.fast_concordance_index(obs, pred, event)
# 1 loops, best of 3: 1.44 s per loop
%timeit -r 1 fci.concordance_index(obs, pred, event) # fci is the Fortran module
# 1 loops, best of 1: 53 s per loop

Let me know if you see any other low-hanging optimizations; it's still pretty slow compared to the survival version, but that's because survival does the heavy lifting in C.

pairs = len(times_to_compare) * (next_ix - first_ix)
correct = 0
tied = 0
for i in xrange(first_ix, next_ix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xrange does not exist in python 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, didn't realize lifelines was supposed to be py3k compatible as well!

@spacecowboy
Copy link
Collaborator

Removing the Fortran code means we can also simplify setup.py, and possibly remove the __utils submodule. The slow version could be renamed "_naive_concordance_index" or something, if it's not removed.

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 20, 2015

OK, I tried to simplify the setup.py script. Let me know if anything's wrong--I haven't used it much before.

@CamDavidsonPilon
Copy link
Owner

👍 I'll give this a 🎩 after work

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 20, 2015

(rebased on latest master)

requires=[
"numpy",
"scipy",
"pandas(>=0.14)",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is new to me, where did you see it originally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the distutils documentation, which states:

If specific versions are required, a sequence of qualifiers can be supplied in parentheses. Each qualifier may consist of a comparison operator and a version number.

When I tried to run it without the parens (as it was before), it gave me a ValueError: expected parenthesized list: '>=0.14'. I thought this was because I changed the numpy.distutils.core line to distutils.core (since we don't need Fortran stuff anymore), but I just tried changing it back and it still doesn't work for me. So I'm not sure what's going on.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I installed this version fine just now (Pandas 0.16 was installed in a fresh virtualenv). Let me try removing the () too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError: expected parenthesized list: '>=0.14' ditto. But doing a quick search suggests using install_requires over requires. I did this locally with the () and it worked fine.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, crap, I missed this comment. When I use install_requires I get a UserWarning: Unknown distribution option: 'install_requires'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, apparently you have to use pip or easy_install to use install_requires instead of requires? Weird. I can correct this if you want.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How else can one install a lib without using pip or easy_install?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have looked here before merging, if you have a minute now, yea please send a new PR =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run python setup.py install manually like me :P New PR coming up.

@CamDavidsonPilon
Copy link
Owner

Since this is such a large speed improvement, can you also remove the optional concordance index call in

def print_summary(self, c_index=True):
?

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 21, 2015

@CamDavidsonPilon: You mean remove the option and make it always call concordance?

@CamDavidsonPilon
Copy link
Owner

@benkuhn Yes that's right. For context: #139

@benkuhn
Copy link
Contributor Author

benkuhn commented Apr 21, 2015

Cool, that should do it. Do you prefer me to squash before it's merged?

@CamDavidsonPilon
Copy link
Owner

Nope, its all good. Thanks for your contribution here; you've made installation a whole lot easier too! 🎉

CamDavidsonPilon added a commit that referenced this pull request Apr 21, 2015
nlogn concordance index algorithm (first pass)
@CamDavidsonPilon CamDavidsonPilon merged commit c84ab22 into CamDavidsonPilon:master Apr 21, 2015
@spacecowboy
Copy link
Collaborator

Nice. That means I can stop my Windows Build VM.

@benkuhn benkuhn mentioned this pull request Apr 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants