Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some failing CoxPH tests #89

Merged
merged 42 commits into from Nov 27, 2014
Merged

Some failing CoxPH tests #89

merged 42 commits into from Nov 27, 2014

Conversation

spacecowboy
Copy link
Collaborator

Noticed some strange behavior with CoxPH. I always have
normalized data, which did not seem to work good with
the implementation. The failures here seem unrelated to that
though.

The added tests currently fail with the following trace:

Traceback (most recent call last):
File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 937, in test_crossval_normalized
event_col='E', k=3)
File "/home/jonas/workspacepython/lifelines/lifelines/utils.py", line 311, in k_fold_cross_validation
fitter.fit(training_data, duration_col=duration_col, event_col=event_col)
File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 998, in fit
include_likelihood=include_likelihood)
File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 938, in _newton_rhaphson
delta = solve(-hessian, step_size * gradient.T)
File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 381, in solve
r = gufunc(a, b, signature=signature, extobj=extobj)
File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
numpy.linalg.linalg.LinAlgError: Singular matrix

However, note that I have commented out one of the datasets because that
seems to cause the cross-validation to end up in an infinite loop of
some kind. The tests never finish (only waited for ~2 minutes).

Doing similar things with R works with no problem.

Doing the following is a fast way to check the results:

python -m unittest lifelines.tests.test_suite.CoxRegressionTests

Fixes an issue where CoxPH would cause an error to be raised
which seemed to be due to mixing of pandas and numpy.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Noticed some strange behavior with CoxPH. I always have
normalized data, which did not seem to work good with
the implementation. The failures here seem unrelated to that
though.

The added tests currently fail with the following trace:

>Traceback (most recent call last):
>  File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 937, in test_crossval_normalized
>    event_col='E', k=3)
>  File "/home/jonas/workspacepython/lifelines/lifelines/utils.py", line 311, in k_fold_cross_validation
>    fitter.fit(training_data, duration_col=duration_col, event_col=event_col)
>  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 998, in fit
>    include_likelihood=include_likelihood)
>  File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 938, in _newton_rhaphson
>    delta = solve(-hessian, step_size * gradient.T)
>  File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 381, in solve
>    r = gufunc(a, b, signature=signature, extobj=extobj)
>  File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 90, in _raise_linalgerror_singular
>    raise LinAlgError("Singular matrix")
>numpy.linalg.linalg.LinAlgError: Singular matrix

However, note that I have commented out one of the datasets because that
seems to cause the cross-validation to end up in an infinite loop of
some kind. The tests never finish (only waited for ~2 minutes).

Doing similar checks with *R* works with no problem.

Doing the following is a fast way to check the results:

    python -m unittest lifelines.tests.test_suite.CoxRegressionTests

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@CamDavidsonPilon
Copy link
Owner

@spacecowboy, I can confirm that your test throws a convergence error. Interestingly, I added some noise to the duration value:
data_pred2['t'] = 1 + data_pred2['x1'] + data_pred2['x2'] + np.random.exponential(1, 50)

and that removed the error. I'll continue to investigate

@CamDavidsonPilon
Copy link
Owner

Also, w.r.t. data_pred1, I found the issue: it was because the data was 1-d, but I had a strict inequality where I should of had a non-strict one.

@CamDavidsonPilon
Copy link
Owner

ping @spacecowboy, if you have time, please review =)

@CamDavidsonPilon
Copy link
Owner

One thing I haven't addressed, though I spent some time working on, is how to deal with a noise-less model, e.g. in your original example.

@@ -937,7 +938,7 @@ def _newton_rhaphson(self, X, T, E, initial_beta=None, step_size=1.,
hessian, gradient = output[:2]
delta = solve(-hessian, step_size * gradient.T)
beta = delta + beta
if pd.isnull(delta).sum() > 1:
if pd.isnull(delta).sum() >= 1:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the non-returning bug in the original issue.

Cox should really solve the data (cindex > 0.9). This
is what happens in R.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@spacecowboy
Copy link
Collaborator Author

@CamDavidsonPilon Yes, on typical real survival data you'll see a c-index of about 0.6 to 0.7. This is because real data is extremely noisy. On artificial data however, it is easy to construct a set which cox will achieve a c-index of 1.0 (which I have done here).

I did some experiments and noted that with uniform noise, like:

+ np.random.uniform(0, 0.02, size=N)

It "succeeds" with noise as low as 0.02, but raises an error for 0.01. However, printing the scores, it's obvious that it doesn't succeed at all:

scores [ 0.55882353  0.55882353  0.5       ]
scores [ 0.66176471  0.5         0.5       ]

When I run the same data using the Cox-model in R, I get a concordance index of 0.999. Which is what I alluded to in the comment about raising the limit for a successful test. The expected mean result should be > 0.99 (with no noise). Running the same with some gaussian noise (mean=0, scale=0.1) it typically performs > 0.9. I've tweaked to code to reflect this.

The poor training performance and the numerical errors are likely related. I checked out the previous version of the cox model (before I refactored it) and did the same tests (for final gaussian noise scale=0.07):

scores [ 0.79779412  0.77573529  0.7875    ]
scores [ 0.8125      0.55882353  0.74583333]

Which obviously was better, but still not even close to passing the test. So this is at least partly my fault :/

@CamDavidsonPilon
Copy link
Owner

@spacecowboy, can you share your R code to get the result of 0.999?

@CamDavidsonPilon
Copy link
Owner

Good news: I found a few bugs and am now able to get pretty high c-index. The biggest was in how I computed the baseline survival function: to compute a predicted survival function, the baseline survival function is needed. I was computing this wrong (using the hazard instead of the cumulative hazard and messing up the equality).

test_crossval_with_normalized_data almost always passes the 0.9 threshold, while its sister test test_crossval_for_cox_ph is still a little off (but averages only slightly before 0.9). Clearly normalization improves performance of CoxPH - should it be a default behaviour?

@@ -146,8 +147,7 @@ def survival_table_from_events(durations, event_observed, min_observations,

# this next line can be optimized for when min_observerations is all zeros.
event_table = death_table.join(births_table, how='outer', sort=True).fillna(0) # http://wesmckinney.com/blog/?p=414

return event_table
return event_table.astype(int)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values in here could be boolean if the censorship array was boolean. I want this to be strictly ints for consistency, and since there are counts, they should always be ints.

Ah, but not if we include weights...

… cast output of survival_table_to_events to float
@spacecowboy
Copy link
Collaborator Author

Good news indeed. I have done some tests with your latest changes and can confirm great success. Different in results in tests is due to me not doing a cross-validation splitting, but simply training on the entire data.

Using the new partial_hazards function, I get the same result down to the last decimal! However, something is fishy about the default predict method still:

rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
preds = -rcox.hazard_eval_all(np.array(data_pred2[['x1', 'x2']]))
print("RCox:", concordance_index(data_pred2['t'], preds))

RCox: 0.9273469387755102

lcox = CoxPHFitter()
lcox.fit(data_pred2, duration_col='t', event_col='E')

phaz = -lcox.predict_partial_hazard(data_pred2[['x1', 'x2']])
ppred = lcox.predict(data_pred2[['x1', 'x2']])

print("Partial hazards:", concordance_index(data[:, -2], preds.ravel()))
print("Predict:", concordance_index(data_pred2['t'], ppred.ravel()))

Partial hazards: 0.9273469387755102
Predict: 0.5587755102040817

This seems like a perfect use case of using r2py tests to verify that we get the exact (to some decimal) result.

I am not a fan of R and can't quickly extract the exact logic in pure R code. The only library I use is survival. I am using a python wrapper with r2py. If you have R and r2py installed, you can install my wrapper:

pip install git+https://github.com/spacecowboy/pysurvival.git

I wrote the package long before I learned about pandas, so it only supports numpy arrays. Usage is:

# Define the data_pred2 dataframe, and convert to numpy
data = np.array(data_pred2)

from pysurvival.cox import cox_model

# You need to specify which columns are the time and event variables
# You can define the headers of the data frame manually with keyword headers
rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
# If no headers are specified, all columns are named X1, X2, ...
rcox = cox_model(data, 'X2', 'X3')

# Model is fitted in constructor. Now predict some hazards. Since the array has
# time and event columns last, this works fine and those columns are simply ignored.
preds = -rcox.hazard_eval_all(data)

print("RCox:", concordance_index(data[:, -2], preds))

If you get errors, make sure you are only giving python lists or numpy arrays as arguments. The code relies entirely on numerical indexing I'm afraid (wrote this 3 years ago, when I was much newer at Python)

@CamDavidsonPilon
Copy link
Owner

Okay, so here's where my head is at:

  1. Normalization really really helps c-index score well. This is because with very little noise (as in our tests), and no normalization, the coxph estimated coefficients are very unstable: ex:
data_pred2 = pd.DataFrame()
data_pred2['x1'] = np.random.uniform(size=N)
data_pred2['x2'] = np.random.uniform(size=N)
data_pred2['t'] = (1 + data_pred2['x1'] + data_pred2['x2'] +
                   np.random.normal(0, 0.05, size=N))
data_pred2['E'] = True

cp.fit(data_pred2, duration_col='t')
cp.hazards_
#                        x1              x2
# coef -21.270541 -21.569023

These coefs are too large, and when we put them (quite a few times) in the exponential, we get garbage out:

x = data_pred2[ ['x1','x2']]
cp.predict_survival_function(x)[0]
#output 
event_at
0.000000    1
1.146793    1
1.222418    1
1.270433    1
1.382819    0
1.674921    0
1.748990    0
1.753403    0
1.774674    0
1.788767    0
1.790994    0
1.795693    0

Hence any medians/expectations are useless.

  1. With normalization of the covariates, things behave nicely, for almost any amount of non-trivial noise.
  2. I suspect that our coxph algo is numerically unstable for little/no noise, but this can be remedied by normalizing covariates.

I'll try to get pysurvival on my machine later today to perform some more tests. Likely in a seperate PR, we can create tests against lifelines and pysurvival.


BTW, I think your code above is reporting the wrong thing: You have

print("Partial hazards:", concordance_index(data[:, -2], preds.ravel()))

I think you want:

print("Partial hazards:", concordance_index(data[:, -2], phaz.ravel()))

which explains why we got such close decimals points!

@CamDavidsonPilon
Copy link
Owner

@spacecowboy the implementation of the Firth algo can be another PR. We've made great progress in this PR - going from a ci-index of ~0.50 (and NaN predictions) to a plus .85 (and numerically stable predictions). Tests should be 🍏 (on average), want to merge this one? (If you are able to, please go ahead =)

@spacecowboy
Copy link
Collaborator Author

Sure, as soon as I verify that the predict method works
Den 26 nov 2014 06:10 skrev "Cameron Davidson-Pilon" <
notifications@github.com>:

@spacecowboy https://github.com/spacecowboy the implementation of the
Firth algo can be another PR. We've made great progress in this PR - going
from a ci-index of ~0.50 (and NaN predictions) to a plus .85 (and
numerically stable predictions). Tests should be [image: 🍏](on average), want to merge this one? (If you are able to, please go ahead
=)


Reply to this email directly or view it on GitHub
#89 (comment)
.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
If a model normalizes data during fitting, it should also
normalize data passed to the predict methods later in the same
way. As a logical consequence, CoxPH now stores the original data
instead of the normalized version as well as the normalization
variables.

I also added some checks to make sure that incoming dataframes have
the same input columns present as the fitting data. In case the incoming
data is an ndarray, the order is assumed to be correct (as before).

The following examples now work, where they before would crash or simply
return the wrong result:

    # fulldata has columns x1, x2, t, e
    cf.predict(fulldata)

    # shuffleddata has columns x2, t, x1, e, y1, y2, y3
    cf.predict(shuffleddata)

    # Xrev has columns x2, x1
    cf.predict(Xrev)

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@spacecowboy
Copy link
Collaborator Author

@CamDavidsonPilon Please review the latest changes when you get a chance. Predict_median does not get the same concordance as partial_hazards, but I'm not entirely sure if it is supposed to.

User should not have to type A.ravel() or worse: A.values.ravel(), when
calculating the concordance index. Function now does ravel() on behalf
of the user in these two cases.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
To be consistent with output from other methods.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Since other predict methods are numerically unstable atm, use
predict_partial_hazard in k_fold_crossval tests. Since this method
works, I could remove the outer loop and increase expected result
because results are consistently good now (at .93 to 0.97).

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@spacecowboy
Copy link
Collaborator Author

After reading up on the Cox model, and googling for baseline hazard, I've come to the drastic conclusion that perhaps all predict methods, except predict_partial_hazards, should be deleted. At the very least, they require a good redesign. The reason is because they are numerically unstable. And by they I really mean the calculation of the baseline hazard and baseline survival as all other methods depend on that. According to [1], Cox in R doesn't even have any methods relating to the baseline hazard, so would anyone miss them?

The methods seem analytically correct, but numerically it all explodes. The methods basically calculate the exponential of another exponential, in a range from very small to very large numbers. So you end up with mostly zeros or ones in the end result (and infinities before then), and if you're lucky you have some decimal numbers somewhere in the middle.

I have changed the tests to reflect this, Cross-validation tests now use predict_partial_hazard, and I've added an expected failure on the monotonic test of the predict methods.

As it stands I'm quite happy with merging this now if you feel we should keep the predict methods as is for the moment. At the very least tackle that problem in a separate PR.

Links:
[1] http://stats.stackexchange.com/questions/68737/how-to-estimate-baseline-hazard-function-in-cox-model-with-r

@CamDavidsonPilon
Copy link
Owner

👍 to your changes - I've only made a few small additions/subtractions:

  1. Always output DataFrames on predict methods.
  2. Respect the user's index.
  3. Ignore column ordering

@CamDavidsonPilon
Copy link
Owner

W.r.t. removing prediction methods, let's do that in another PR/Issue.

@CamDavidsonPilon
Copy link
Owner

🚢 on your 👍

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Range returns an iterable in Python3. Making sure result is a list
always.

Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@spacecowboy
Copy link
Collaborator Author

I'm guessing that "ship on your +1" means merge ;)

Merging..

spacecowboy added a commit that referenced this pull request Nov 27, 2014
Merging improvements on CoxPH and lots of additional tests.
@spacecowboy spacecowboy merged commit 802195c into master Nov 27, 2014
@CamDavidsonPilon
Copy link
Owner

Whoop 🌟! Since this was a major revision, let's cut it out into 0.4.4. I'll make a milestone/issue for 0.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants