New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some failing CoxPH tests #89
Conversation
Fixes an issue where CoxPH would cause an error to be raised which seemed to be due to mixing of pandas and numpy. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Noticed some strange behavior with CoxPH. I always have normalized data, which did not seem to work good with the implementation. The failures here seem unrelated to that though. The added tests currently fail with the following trace: >Traceback (most recent call last): > File "/home/jonas/workspacepython/lifelines/lifelines/tests/test_suite.py", line 937, in test_crossval_normalized > event_col='E', k=3) > File "/home/jonas/workspacepython/lifelines/lifelines/utils.py", line 311, in k_fold_cross_validation > fitter.fit(training_data, duration_col=duration_col, event_col=event_col) > File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 998, in fit > include_likelihood=include_likelihood) > File "/home/jonas/workspacepython/lifelines/lifelines/estimation.py", line 938, in _newton_rhaphson > delta = solve(-hessian, step_size * gradient.T) > File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 381, in solve > r = gufunc(a, b, signature=signature, extobj=extobj) > File "/home/jonas/anaconda3/lib/python3.4/site-packages/numpy/linalg/linalg.py", line 90, in _raise_linalgerror_singular > raise LinAlgError("Singular matrix") >numpy.linalg.linalg.LinAlgError: Singular matrix However, note that I have commented out one of the datasets because that seems to cause the cross-validation to end up in an infinite loop of some kind. The tests never finish (only waited for ~2 minutes). Doing similar checks with *R* works with no problem. Doing the following is a fast way to check the results: python -m unittest lifelines.tests.test_suite.CoxRegressionTests Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@spacecowboy, I can confirm that your test throws a convergence error. Interestingly, I added some noise to the duration value: and that removed the error. I'll continue to investigate |
Also, w.r.t. |
… more regression tests; adding new datasets
ping @spacecowboy, if you have time, please review =) |
One thing I haven't addressed, though I spent some time working on, is how to deal with a noise-less model, e.g. in your original example. |
@@ -937,7 +938,7 @@ def _newton_rhaphson(self, X, T, E, initial_beta=None, step_size=1., | |||
hessian, gradient = output[:2] | |||
delta = solve(-hessian, step_size * gradient.T) | |||
beta = delta + beta | |||
if pd.isnull(delta).sum() > 1: | |||
if pd.isnull(delta).sum() >= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the non-returning bug in the original issue.
Cox should really solve the data (cindex > 0.9). This is what happens in R. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@CamDavidsonPilon Yes, on typical real survival data you'll see a c-index of about 0.6 to 0.7. This is because real data is extremely noisy. On artificial data however, it is easy to construct a set which cox will achieve a c-index of 1.0 (which I have done here). I did some experiments and noted that with uniform noise, like:
It "succeeds" with noise as low as 0.02, but raises an error for 0.01. However, printing the scores, it's obvious that it doesn't succeed at all:
When I run the same data using the Cox-model in R, I get a concordance index of 0.999. Which is what I alluded to in the comment about raising the limit for a successful test. The expected mean result should be > 0.99 (with no noise). Running the same with some gaussian noise (mean=0, scale=0.1) it typically performs > 0.9. I've tweaked to code to reflect this. The poor training performance and the numerical errors are likely related. I checked out the previous version of the cox model (before I refactored it) and did the same tests (for final gaussian noise scale=0.07):
Which obviously was better, but still not even close to passing the test. So this is at least partly my fault :/ |
@spacecowboy, can you share your R code to get the result of 0.999? |
…es like cumulative_hazard_ and baseline_survival_
Good news: I found a few bugs and am now able to get pretty high c-index. The biggest was in how I computed the baseline survival function: to compute a predicted survival function, the baseline survival function is needed. I was computing this wrong (using the hazard instead of the cumulative hazard and messing up the equality).
|
@@ -146,8 +147,7 @@ def survival_table_from_events(durations, event_observed, min_observations, | |||
|
|||
# this next line can be optimized for when min_observerations is all zeros. | |||
event_table = death_table.join(births_table, how='outer', sort=True).fillna(0) # http://wesmckinney.com/blog/?p=414 | |||
|
|||
return event_table | |||
return event_table.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values in here could be boolean if the censorship array was boolean. I want this to be strictly ints for consistency, and since there are counts, they should always be ints.
Ah, but not if we include weights...
… cast output of survival_table_to_events to float
Good news indeed. I have done some tests with your latest changes and can confirm great success. Different in results in tests is due to me not doing a cross-validation splitting, but simply training on the entire data. Using the new partial_hazards function, I get the same result down to the last decimal! However, something is fishy about the default predict method still: rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
preds = -rcox.hazard_eval_all(np.array(data_pred2[['x1', 'x2']]))
print("RCox:", concordance_index(data_pred2['t'], preds))
lcox = CoxPHFitter()
lcox.fit(data_pred2, duration_col='t', event_col='E')
phaz = -lcox.predict_partial_hazard(data_pred2[['x1', 'x2']])
ppred = lcox.predict(data_pred2[['x1', 'x2']])
print("Partial hazards:", concordance_index(data[:, -2], preds.ravel()))
print("Predict:", concordance_index(data_pred2['t'], ppred.ravel()))
This seems like a perfect use case of using r2py tests to verify that we get the exact (to some decimal) result. I am not a fan of R and can't quickly extract the exact logic in pure R code. The only library I use is
I wrote the package long before I learned about pandas, so it only supports numpy arrays. Usage is: # Define the data_pred2 dataframe, and convert to numpy
data = np.array(data_pred2)
from pysurvival.cox import cox_model
# You need to specify which columns are the time and event variables
# You can define the headers of the data frame manually with keyword headers
rcox = cox_model(np.array(data_pred2), 't', 'E', headers=list(data_pred2.columns))
# If no headers are specified, all columns are named X1, X2, ...
rcox = cox_model(data, 'X2', 'X3')
# Model is fitted in constructor. Now predict some hazards. Since the array has
# time and event columns last, this works fine and those columns are simply ignored.
preds = -rcox.hazard_eval_all(data)
print("RCox:", concordance_index(data[:, -2], preds)) If you get errors, make sure you are only giving python lists or numpy arrays as arguments. The code relies entirely on numerical indexing I'm afraid (wrote this 3 years ago, when I was much newer at Python) |
Okay, so here's where my head is at:
These coefs are too large, and when we put them (quite a few times) in the exponential, we get garbage out:
Hence any medians/expectations are useless.
I'll try to get pysurvival on my machine later today to perform some more tests. Likely in a seperate PR, we can create tests against lifelines and pysurvival. BTW, I think your code above is reporting the wrong thing: You have
I think you want:
which explains why we got such close decimals points! |
@spacecowboy the implementation of the Firth algo can be another PR. We've made great progress in this PR - going from a ci-index of ~0.50 (and NaN predictions) to a plus .85 (and numerically stable predictions). Tests should be 🍏 (on average), want to merge this one? (If you are able to, please go ahead =) |
Sure, as soon as I verify that the predict method works
|
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
If a model normalizes data during fitting, it should also normalize data passed to the predict methods later in the same way. As a logical consequence, CoxPH now stores the original data instead of the normalized version as well as the normalization variables. I also added some checks to make sure that incoming dataframes have the same input columns present as the fitting data. In case the incoming data is an ndarray, the order is assumed to be correct (as before). The following examples now work, where they before would crash or simply return the wrong result: # fulldata has columns x1, x2, t, e cf.predict(fulldata) # shuffleddata has columns x2, t, x1, e, y1, y2, y3 cf.predict(shuffleddata) # Xrev has columns x2, x1 cf.predict(Xrev) Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
@CamDavidsonPilon Please review the latest changes when you get a chance. Predict_median does not get the same concordance as partial_hazards, but I'm not entirely sure if it is supposed to. |
User should not have to type A.ravel() or worse: A.values.ravel(), when calculating the concordance index. Function now does ravel() on behalf of the user in these two cases. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
To be consistent with output from other methods. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Since other predict methods are numerically unstable atm, use predict_partial_hazard in k_fold_crossval tests. Since this method works, I could remove the outer loop and increase expected result because results are consistently good now (at .93 to 0.97). Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
After reading up on the Cox model, and googling for baseline hazard, I've come to the drastic conclusion that perhaps all predict methods, except predict_partial_hazards, should be deleted. At the very least, they require a good redesign. The reason is because they are numerically unstable. And by they I really mean the calculation of the baseline hazard and baseline survival as all other methods depend on that. According to [1], Cox in R doesn't even have any methods relating to the baseline hazard, so would anyone miss them? The methods seem analytically correct, but numerically it all explodes. The methods basically calculate the exponential of another exponential, in a range from very small to very large numbers. So you end up with mostly zeros or ones in the end result (and infinities before then), and if you're lucky you have some decimal numbers somewhere in the middle. I have changed the tests to reflect this, Cross-validation tests now use predict_partial_hazard, and I've added an expected failure on the monotonic test of the predict methods. As it stands I'm quite happy with merging this now if you feel we should keep the predict methods as is for the moment. At the very least tackle that problem in a separate PR. |
👍 to your changes - I've only made a few small additions/subtractions:
|
…ng some tests for it. Moving sig-values to utils
W.r.t. removing prediction methods, let's do that in another PR/Issue. |
🚢 on your 👍 |
Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
Range returns an iterable in Python3. Making sure result is a list always. Signed-off-by: Jonas Kalderstam <jonas@kalderstam.se>
I'm guessing that "ship on your +1" means merge ;) Merging.. |
Merging improvements on CoxPH and lots of additional tests.
Whoop 🌟! Since this was a major revision, let's cut it out into 0.4.4. I'll make a milestone/issue for 0.5 |
Noticed some strange behavior with CoxPH. I always have
normalized data, which did not seem to work good with
the implementation. The failures here seem unrelated to that
though.
The added tests currently fail with the following trace:
However, note that I have commented out one of the datasets because that
seems to cause the cross-validation to end up in an infinite loop of
some kind. The tests never finish (only waited for ~2 minutes).
Doing similar things with R works with no problem.
Doing the following is a fast way to check the results: