Some failing CoxPH tests #89

spacecowboy · 2014-11-22T15:51:52Z

Noticed some strange behavior with CoxPH. I always have
normalized data, which did not seem to work good with
the implementation. The failures here seem unrelated to that
though.

The added tests currently fail with the following trace:

Traceback (most recent call last):
File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 937, in test_crossval_normalized
event_col='E', k=3)
File "/home/jonas/workspacepython/lifelines/lifelines/utils.py", line 311, in k_fold_cross_validation
fitter.fit(training_data, duration_col=duration_col, event_col=event_col)
File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 998, in fit
include_likelihood=include_likelihood)
File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 938, in _newton_rhaphson
delta = solve(-hessian, step_size * gradient.T)
File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 381, in solve
r = gufunc(a, b, signature=signature, extobj=extobj)
File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
numpy.linalg.linalg.LinAlgError: Singular matrix

However, note that I have commented out one of the datasets because that
seems to cause the cross-validation to end up in an infinite loop of
some kind. The tests never finish (only waited for ~2 minutes).

Doing similar things with R works with no problem.

Doing the following is a fast way to check the results:

python -m unittest lifelines.tests.test_suite.CoxRegressionTests

Fixes an issue where CoxPH would cause an error to be raised which seemed to be due to mixing of pandas and numpy. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Noticed some strange behavior with CoxPH. I always have normalized data, which did not seem to work good with the implementation. The failures here seem unrelated to that though. The added tests currently fail with the following trace: >Traceback (most recent call last): > File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 937, in test_crossval_normalized > event_col='E', k=3) > File "/home/jonas/workspacepython/lifelines/lifelines/utils.py", line 311, in k_fold_cross_validation > fitter.fit(training_data, duration_col=duration_col, event_col=event_col) > File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 998, in fit > include_likelihood=include_likelihood) > File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 938, in _newton_rhaphson > delta = solve(-hessian, step_size * gradient.T) > File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 381, in solve > r = gufunc(a, b, signature=signature, extobj=extobj) > File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 90, in _raise_linalgerror_singular > raise LinAlgError("Singular matrix") >numpy.linalg.linalg.LinAlgError: Singular matrix However, note that I have commented out one of the datasets because that seems to cause the cross-validation to end up in an infinite loop of some kind. The tests never finish (only waited for ~2 minutes). Doing similar checks with *R* works with no problem. Doing the following is a fast way to check the results: python -m unittest lifelines.tests.test_suite.CoxRegressionTests Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

CamDavidsonPilon · 2014-11-22T18:48:39Z

@spacecowboy, I can confirm that your test throws a convergence error. Interestingly, I added some noise to the duration value:
data_pred2['t'] = 1 + data_pred2['x1'] + data_pred2['x2'] + np.random.exponential(1, 50)

and that removed the error. I'll continue to investigate

CamDavidsonPilon · 2014-11-22T22:56:29Z

Also, w.r.t. data_pred1, I found the issue: it was because the data was 1-d, but I had a strict inequality where I should of had a non-strict one.

… more regression tests; adding new datasets

CamDavidsonPilon · 2014-11-23T20:30:48Z

ping @spacecowboy, if you have time, please review =)

CamDavidsonPilon · 2014-11-23T20:34:56Z

One thing I haven't addressed, though I spent some time working on, is how to deal with a noise-less model, e.g. in your original example.

CamDavidsonPilon · 2014-11-23T20:36:10Z

lifelines/estimation.py

@@ -937,7 +938,7 @@ def _newton_rhaphson(self, X, T, E, initial_beta=None, step_size=1.,
            hessian, gradient = output[:2]
            delta = solve(-hessian, step_size * gradient.T)
            beta = delta + beta
-            if pd.isnull(delta).sum() > 1:
+            if pd.isnull(delta).sum() >= 1:


This fixes the non-returning bug in the original issue.

Cox should really solve the data (cindex > 0.9). This is what happens in R. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy · 2014-11-24T00:42:41Z

@CamDavidsonPilon Yes, on typical real survival data you'll see a c-index of about 0.6 to 0.7. This is because real data is extremely noisy. On artificial data however, it is easy to construct a set which cox will achieve a c-index of 1.0 (which I have done here).

I did some experiments and noted that with uniform noise, like:

+ np.random.uniform(0, 0.02, size=N)

It "succeeds" with noise as low as 0.02, but raises an error for 0.01. However, printing the scores, it's obvious that it doesn't succeed at all:

scores [ 0.55882353  0.55882353  0.5       ]
scores [ 0.66176471  0.5         0.5       ]

When I run the same data using the Cox-model in R, I get a concordance index of 0.999. Which is what I alluded to in the comment about raising the limit for a successful test. The expected mean result should be > 0.99 (with no noise). Running the same with some gaussian noise (mean=0, scale=0.1) it typically performs > 0.9. I've tweaked to code to reflect this.

The poor training performance and the numerical errors are likely related. I checked out the previous version of the cox model (before I refactored it) and did the same tests (for final gaussian noise scale=0.07):

scores [ 0.79779412  0.77573529  0.7875    ]
scores [ 0.8125      0.55882353  0.74583333]

Which obviously was better, but still not even close to passing the test. So this is at least partly my fault :/

CamDavidsonPilon · 2014-11-24T02:02:50Z

@spacecowboy, can you share your R code to get the result of 0.999?

…es like cumulative_hazard_ and baseline_survival_

CamDavidsonPilon · 2014-11-24T04:46:31Z

Good news: I found a few bugs and am now able to get pretty high c-index. The biggest was in how I computed the baseline survival function: to compute a predicted survival function, the baseline survival function is needed. I was computing this wrong (using the hazard instead of the cumulative hazard and messing up the equality).

test_crossval_with_normalized_data almost always passes the 0.9 threshold, while its sister test test_crossval_for_cox_ph is still a little off (but averages only slightly before 0.9). Clearly normalization improves performance of CoxPH - should it be a default behaviour?

CamDavidsonPilon · 2014-11-24T04:50:22Z

lifelines/utils.py

@@ -146,8 +147,7 @@ def survival_table_from_events(durations, event_observed, min_observations,

    # this next line can be optimized for when min_observerations is all zeros.
    event_table = death_table.join(births_table, how='outer', sort=True).fillna(0)  # http://wesmckinney.com/blog/?p=414
-
-    return event_table
+    return event_table.astype(int)


Values in here could be boolean if the censorship array was boolean. I want this to be strictly ints for consistency, and since there are counts, they should always be ints.

Ah, but not if we include weights...

… cast output of survival_table_to_events to float

spacecowboy · 2014-11-24T10:45:45Z

Good news indeed. I have done some tests with your latest changes and can confirm great success. Different in results in tests is due to me not doing a cross-validation splitting, but simply training on the entire data.

Using the new partial_hazards function, I get the same result down to the last decimal! However, something is fishy about the default predict method still:

rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
preds = -rcox.hazard_eval_all(np.array(data_pred2[['x1', 'x2']]))
print("RCox:", concordance_index(data_pred2['t'], preds))

RCox: 0.9273469387755102

lcox = CoxPHFitter()
lcox.fit(data_pred2, duration_col='t', event_col='E')

phaz = -lcox.predict_partial_hazard(data_pred2[['x1', 'x2']])
ppred = lcox.predict(data_pred2[['x1', 'x2']])

print("Partial hazards:", concordance_index(data[:, -2], preds.ravel()))
print("Predict:", concordance_index(data_pred2['t'], ppred.ravel()))

Partial hazards: 0.9273469387755102
Predict: 0.5587755102040817

This seems like a perfect use case of using r2py tests to verify that we get the exact (to some decimal) result.

I am not a fan of R and can't quickly extract the exact logic in pure R code. The only library I use is survival. I am using a python wrapper with r2py. If you have R and r2py installed, you can install my wrapper:

pip install git+https://github.com/spacecowboy/pysurvival.git

I wrote the package long before I learned about pandas, so it only supports numpy arrays. Usage is:

# Define the data_pred2 dataframe, and convert to numpy
data = np.array(data_pred2)

from pysurvival.cox import cox_model

# You need to specify which columns are the time and event variables
# You can define the headers of the data frame manually with keyword headers
rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
# If no headers are specified, all columns are named X1, X2, ...
rcox = cox_model(data, 'X2', 'X3')

# Model is fitted in constructor. Now predict some hazards. Since the array has
# time and event columns last, this works fine and those columns are simply ignored.
preds = -rcox.hazard_eval_all(data)

print("RCox:", concordance_index(data[:, -2], preds))

If you get errors, make sure you are only giving python lists or numpy arrays as arguments. The code relies entirely on numerical indexing I'm afraid (wrote this 3 years ago, when I was much newer at Python)

CamDavidsonPilon · 2014-11-24T14:47:36Z

Okay, so here's where my head is at:

Normalization really really helps c-index score well. This is because with very little noise (as in our tests), and no normalization, the coxph estimated coefficients are very unstable: ex:

data_pred2 = pd.DataFrame()
data_pred2['x1'] = np.random.uniform(size=N)
data_pred2['x2'] = np.random.uniform(size=N)
data_pred2['t'] = (1 + data_pred2['x1'] + data_pred2['x2'] +
                   np.random.normal(0, 0.05, size=N))
data_pred2['E'] = True

cp.fit(data_pred2, duration_col='t')
cp.hazards_
#                        x1              x2
# coef -21.270541 -21.569023

These coefs are too large, and when we put them (quite a few times) in the exponential, we get garbage out:

x = data_pred2[ ['x1','x2']]
cp.predict_survival_function(x)[0]
#output 
event_at
0.000000    1
1.146793    1
1.222418    1
1.270433    1
1.382819    0
1.674921    0
1.748990    0
1.753403    0
1.774674    0
1.788767    0
1.790994    0
1.795693    0

Hence any medians/expectations are useless.

With normalization of the covariates, things behave nicely, for almost any amount of non-trivial noise.
I suspect that our coxph algo is numerically unstable for little/no noise, but this can be remedied by normalizing covariates.

I'll try to get pysurvival on my machine later today to perform some more tests. Likely in a seperate PR, we can create tests against lifelines and pysurvival.

BTW, I think your code above is reporting the wrong thing: You have

print("Partial hazards:", concordance_index(data[:, -2], preds.ravel()))

I think you want:

print("Partial hazards:", concordance_index(data[:, -2], phaz.ravel()))

which explains why we got such close decimals points!

CamDavidsonPilon · 2014-11-26T05:10:20Z

@spacecowboy the implementation of the Firth algo can be another PR. We've made great progress in this PR - going from a ci-index of ~0.50 (and NaN predictions) to a plus .85 (and numerically stable predictions). Tests should be 🍏 (on average), want to merge this one? (If you are able to, please go ahead =)

spacecowboy · 2014-11-26T09:16:02Z

Sure, as soon as I verify that the predict method works
Den 26 nov 2014 06:10 skrev "Cameron Davidson-Pilon" <
notifications@github.com>:

@spacecowboy https://github.com/spacecowboy the implementation of the
Firth algo can be another PR. We've made great progress in this PR - going
from a ci-index of ~0.50 (and NaN predictions) to a plus .85 (and
numerically stable predictions). Tests should be [image: 🍏](on average), want to merge this one? (If you are able to, please go ahead
=)

—
Reply to this email directly or view it on GitHub
#89 (comment)
.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

If a model normalizes data during fitting, it should also normalize data passed to the predict methods later in the same way. As a logical consequence, CoxPH now stores the original data instead of the normalized version as well as the normalization variables. I also added some checks to make sure that incoming dataframes have the same input columns present as the fitting data. In case the incoming data is an ndarray, the order is assumed to be correct (as before). The following examples now work, where they before would crash or simply return the wrong result: # fulldata has columns x1, x2, t, e cf.predict(fulldata) # shuffleddata has columns x2, t, x1, e, y1, y2, y3 cf.predict(shuffleddata) # Xrev has columns x2, x1 cf.predict(Xrev) Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy · 2014-11-26T14:16:05Z

@CamDavidsonPilon Please review the latest changes when you get a chance. Predict_median does not get the same concordance as partial_hazards, but I'm not entirely sure if it is supposed to.

User should not have to type A.ravel() or worse: A.values.ravel(), when calculating the concordance index. Function now does ravel() on behalf of the user in these two cases. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

To be consistent with output from other methods. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Since other predict methods are numerically unstable atm, use predict_partial_hazard in k_fold_crossval tests. Since this method works, I could remove the outer loop and increase expected result because results are consistently good now (at .93 to 0.97). Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy · 2014-11-26T23:49:26Z

After reading up on the Cox model, and googling for baseline hazard, I've come to the drastic conclusion that perhaps all predict methods, except predict_partial_hazards, should be deleted. At the very least, they require a good redesign. The reason is because they are numerically unstable. And by they I really mean the calculation of the baseline hazard and baseline survival as all other methods depend on that. According to [1], Cox in R doesn't even have any methods relating to the baseline hazard, so would anyone miss them?

The methods seem analytically correct, but numerically it all explodes. The methods basically calculate the exponential of another exponential, in a range from very small to very large numbers. So you end up with mostly zeros or ones in the end result (and infinities before then), and if you're lucky you have some decimal numbers somewhere in the middle.

I have changed the tests to reflect this, Cross-validation tests now use predict_partial_hazard, and I've added an expected failure on the monotonic test of the predict methods.

As it stands I'm quite happy with merging this now if you feel we should keep the predict methods as is for the moment. At the very least tackle that problem in a separate PR.

Links:
[1] http://stats.stackexchange.com/questions/68737/how-to-estimate-baseline-hazard-function-in-cox-model-with-r

CamDavidsonPilon · 2014-11-27T02:18:02Z

👍 to your changes - I've only made a few small additions/subtractions:

Always output DataFrames on predict methods.
Respect the user's index.
Ignore column ordering

…ng some tests for it. Moving sig-values to utils

CamDavidsonPilon · 2014-11-27T02:46:40Z

W.r.t. removing prediction methods, let's do that in another PR/Issue.

CamDavidsonPilon · 2014-11-27T02:58:00Z

🚢 on your 👍

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Range returns an iterable in Python3. Making sure result is a list always. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy · 2014-11-27T10:23:02Z

I'm guessing that "ship on your +1" means merge ;)

Merging..

Merging improvements on CoxPH and lots of additional tests.

CamDavidsonPilon · 2014-11-27T13:38:06Z

Whoop 🌟! Since this was a major revision, let's cut it out into 0.4.4. I'll make a milestone/issue for 0.5

spacecowboy added 2 commits November 22, 2014 16:40

Predicting with CoxPH could raise an error

49622e5

Fixes an issue where CoxPH would cause an error to be raised which seemed to be due to mixing of pandas and numpy. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

CamDavidsonPilon added 5 commits November 22, 2014 18:01

fixing inequality bug, and adding noise to tests

99e1cb1

fix typo

812315d

adding predict method to fitters, and cleaning up some tests + adding…

d34d259

… more regression tests; adding new datasets

fixing coxph prediction values

dec7e4e

fixing a few tets

737d6ad

CamDavidsonPilon reviewed Nov 23, 2014
View reviewed changes

Change to gaussian noise, increase limits

538556f

Cox should really solve the data (cindex > 0.9). This is what happens in R. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

CamDavidsonPilon added 3 commits November 23, 2014 23:38

reverting the keyword argument for birth_col, I found a better way

241d285

getting baseline survival for CoxPH correct! Adding new baseline piec…

9bd3947

…es like cumulative_hazard_ and baseline_survival_

adding new CoxPH test for p-values and se

84148ae

make test more liberal

69bacde

CamDavidsonPilon reviewed Nov 24, 2014
View reviewed changes

in order to allow for weighted values, and to keep thing non-boolean,…

abd58ab

… cast output of survival_table_to_events to float

CamDavidsonPilon added 2 commits November 25, 2014 23:57

there appears to be a divergence in python 3+ in this test

ac52561

removing unused lstsq and fixing output of tset

880381d

spacecowboy added 4 commits November 26, 2014 11:51

Found a problem with data sorting

a9b807d

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Add test to compare concordance between methods

ae46f1c

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Mistakenly added a new file

d928978

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy added 4 commits November 27, 2014 00:30

Allow (n,1) or (1,n) arrays in C-index

046f155

User should not have to type A.ravel() or worse: A.values.ravel(), when calculating the concordance index. Function now does ravel() on behalf of the user in these two cases. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Make predict_partial_hazard output a DataFrame

9e10813

To be consistent with output from other methods. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Expect failure on predict test

907882a

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

CamDavidsonPilon added 3 commits November 26, 2014 20:36

no more abiguity - DataFrame are always returned

38e118d

update a sorting test to now compare np vs dataframes

a7b6c9f

Update this test to use new dataframe outputs

f67a8f0

CamDavidsonPilon added 2 commits November 26, 2014 21:38

adding in respecful indexing, refactoring the logic around that, addi…

7a89a21

…ng some tests for it. Moving sig-values to utils

fixing cv for cox survival functions, cleaning test code

acb1e83

spacecowboy added 2 commits November 27, 2014 10:54

Update comments

5650dff

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

Making sure behavior is consistent wrt Py2/Py3

c838831

Range returns an iterable in Python3. Making sure result is a list always. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>

spacecowboy added a commit that referenced this pull request Nov 27, 2014

Merge pull request #89 from CamDavidsonPilon/cox_ph_tests

802195c

Merging improvements on CoxPH and lots of additional tests.

spacecowboy merged commit 802195c into master Nov 27, 2014

CamDavidsonPilon deleted the cox_ph_tests branch November 27, 2014 13:38

CamDavidsonPilon mentioned this pull request Sep 29, 2015

CoxPHFitter.predict() gives many 'inf' values #189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some failing CoxPH tests #89

Some failing CoxPH tests #89

spacecowboy commented Nov 22, 2014

CamDavidsonPilon commented Nov 22, 2014

CamDavidsonPilon commented Nov 22, 2014

CamDavidsonPilon commented Nov 23, 2014

CamDavidsonPilon commented Nov 23, 2014

CamDavidsonPilon Nov 23, 2014

spacecowboy commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon Nov 24, 2014

spacecowboy commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

CamDavidsonPilon commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014

spacecowboy commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014

Some failing CoxPH tests #89

Some failing CoxPH tests #89

Conversation

spacecowboy commented Nov 22, 2014

CamDavidsonPilon commented Nov 22, 2014

CamDavidsonPilon commented Nov 22, 2014

CamDavidsonPilon commented Nov 23, 2014

CamDavidsonPilon commented Nov 23, 2014

CamDavidsonPilon Nov 23, 2014

Choose a reason for hiding this comment

spacecowboy commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon Nov 24, 2014

Choose a reason for hiding this comment

spacecowboy commented Nov 24, 2014

CamDavidsonPilon commented Nov 24, 2014

CamDavidsonPilon commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

spacecowboy commented Nov 26, 2014

CamDavidsonPilon commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014

spacecowboy commented Nov 27, 2014

CamDavidsonPilon commented Nov 27, 2014