Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs of Incorrect Calculation of Baseline Hazard & Baseline Cumulative Hazard #1475

Open
bofeng2018 opened this issue Nov 29, 2022 · 18 comments

Comments

@bofeng2018
Copy link

bofeng2018 commented Nov 29, 2022

Hello, I am a senior data scientist from Prudential Financial. While working on one project involving the Cox proportional hazard models, I found that the baseline hazard and baseline cumulative hazard were obviously calculated incorrectly using the CoxPHFitter module within the lifelines package. Before jumping into the bug details, I want to share the version information first. The lifelines package I was using was 0.27.0, but by the time I checked the source codes in GitHub again this morning, which belonged to the version 0.27.4 I believe, I still saw the same bugs.

The bugs originate from the fit_model function in the SemiParametricPHFitter class, which start from the line 1252 of the coxph_fitter.py file. Notice that, the standardized data are supplied to the fit_model function for further estimation. This will not be a problem for the Cox coefficient estimation (i.e., the params in the function), because they are restored to their original scales after being divided by corresponding standard deviations of the original data as in line 1399. However, no similar action has been taken for the predicted_partial_hazards calculated in line 1392. I notice that in line 1393, a matrix multiplication of the standardized data with the uncorrected Cox coefficients was used to avoid the scale issues. However, the location issues were never taken care of. As a result, it is almost like the raw data are shifted which direction and extent depend on their original mean values, and the effect of the shift is incorrectly transferred to the baseline hazard of the Cox model. Thus, it causes all the subsequent calculations of the baseline hazard and baseline cumulative hazard to be incorrect.

The fixes for the bugs are straightforward, which is basically to use the unstandardized data for baseline hazard related calculations. I have appended some codes below for a lazy fix. By adding them right at the line 1270, it should generate the correct baseline hazard and baseline cumulative hazard. However, this is definitely not ideal, since incorrect calculations are not removed but rather just replaced. If desired, I would be very happy to work with the lifelines development team to come up with a permanent and neater fix for it.

predicted_partial_hazards_ = (
    pd.DataFrame(np.exp(dot(X.values, self.params_)), columns=["P"]).assign(T=T.values, E=E.values, W=weights.values).set_index(X.index)
)
self.baseline_hazard_ = self._compute_baseline_hazards(predicted_partial_hazards_)
self.baseline_cumulative_hazard_ = self._compute_baseline_cumulative_hazard(self.baseline_hazard_)
@CamDavidsonPilon
Copy link
Owner

Hi @bofeng2018, thank you for the detailed issue. This is interesting. Give me a few days to try some corrections.

@bofeng2018
Copy link
Author

Hi @bofeng2018, thank you for the detailed issue. This is interesting. Give me a few days to try some corrections.

Happy new year! It has been a while and just want to follow up with the issue that I reported. Please let me know if there is anything that I can help with.

@CamDavidsonPilon
Copy link
Owner

Hey @bofeng2018, I was just thinking about this issue. I haven't had time to devote to this - do you want to submit a PR with a recommend change (even a temporary one that I can replace later)? With a fix in, I can cut a new release too

@bofeng2018
Copy link
Author

Hey @bofeng2018, I was just thinking about this issue. I haven't had time to devote to this - do you want to submit a PR with a recommend change (even a temporary one that I can replace later)? With a fix in, I can cut a new release too

Sure. I just created a pull request. Please let me know if you see any problems.

@user799595
Copy link

Hello

I came accross this issue, and was wondering if it is still open and what the impact is.

I would like to use lifelines for Cox PH regression, but I don't know enough about this area to be decide if this implementation is safe to use.

Thank you!

@CamDavidsonPilon
Copy link
Owner

Hi, I'm finally investigating this more carefully. I've written these two tests below, but am unable to make them fail (all assertions are true).

    def test_baseline_prediction_with_extreme_means(self, rossi):
        cph = CoxPHFitter()
        cph.fit(rossi, "week", "arrest")

        rossi_shifted = rossi.copy()
        rossi_shifted['prio'] += 100
        cph_shifted = CoxPHFitter()
        cph_shifted.fit(rossi_shifted, "week", "arrest")

        # make sure summary stats are equal
        assert_frame_equal(cph_shifted.summary, cph.summary)

        # confirm hazards are equal
        assert_frame_equal(cph.baseline_hazard_, cph_shifted.baseline_hazard_)
        assert_frame_equal(cph.baseline_cumulative_hazard_, cph_shifted.baseline_cumulative_hazard_)


    def test_baseline_prediction_with_extreme_scaling(self, rossi):
        cph = CoxPHFitter()
        cph.fit(rossi, "week", "arrest")

        rossi_scaled = rossi.copy()
        rossi_scaled['prio'] *= 100
        cph_scaled = CoxPHFitter()
        cph_scaled.fit(rossi_scaled, "week", "arrest")

        # make sure summary stats are equal - note that CI and coefs are unequal since we scaled params.
        assert_frame_equal(cph_scaled.summary[['z', 'p']], cph.summary[['z', 'p']])

        # confirm hazards are equal
        assert_frame_equal(cph.baseline_hazard_, cph_scaled.baseline_hazard_)
        assert_frame_equal(cph.baseline_cumulative_hazard_, cph_scaled.baseline_cumulative_hazard_)

Also, when I implement your solution, the following important tests fail:
https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4594

https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4573

Some others fail, too, but these are the more relevant ones as they compare against a R's survival library.


Can you provide a simple example, dummy data or not, that shows the difference?

@bofeng2018
Copy link
Author

bofeng2018 commented Jan 6, 2024 via email

@CamDavidsonPilon
Copy link
Owner

@bofeng2018 email attachments won't show in issues, can you repost your nb in the issue thread?

@bofeng2018
Copy link
Author

Demonstration.ipynb.zip

Attached is the zipped nb. Let me know if you can't open it.

@CamDavidsonPilon
Copy link
Owner

(Sorry about the spam - I keep alternating between if I am convinced I've found the problem or not - still investigating)

@bofeng2018
Copy link
Author

(Sorry about the spam - I keep alternating between if I am convinced I've found the problem or not - still investigating)

Totally understandable. It also took me a while to finally confirm that the problem was because of the data standardization. Let me know if you have any questions.

@CamDavidsonPilon
Copy link
Owner

CamDavidsonPilon commented Jan 7, 2024

Okay, here's an (updating) summary of what I know. This may change:

  • the baseline hazard (and cumulative hazard) can be computed at the mean values, or at all zeros. We, don't ask why yet, defaulted to the latter. This is the same as R's basehaz(..., centered=FALSE) which is what we test against¹.
  • From @bofeng2018 notebook, we can compute the factor the cumulative baseline hazard is "off": exp(-healthScoreCoef * 0.5) (the 0.5 is the mean of the single covariate). This makes the curves line up:
Screenshot 2024-01-07 at 4 35 34 PM
  • generally, the factor is exp(-self.params dot self._norm_means)
  • I'm putting "off" in quotations, since it's partially a decision for the user (that I made for them)
  • Need to investigate the predict functions.
  • Thinking about why centered=FALSE was chosen:
    • the current method has the property that the baseline hazard is indp. of shifts in the covariates. That is, if we shift a covariate by 100, the baseline hazard doesn't change. This is desirable: all else being equal, if I change my measurement stick to start at 100, this shouldn't effect survival. Unless I start pairing it with covariates - then I need to handle the 100 somewhere. (Note that shifting by 100 never effects the coefs).
    • the proposed method doesn't have this property. If I add 10 to the dat, ex: dat['Health Score'] += 10, then the proposed result looks skewed, whereas the current method's result is the same:
Screenshot 2024-01-07 at 5 22 32 PM
  • Suffice to say, this is confusing for us, so it needs to be made clear for users.

¹ although our tests say centered=TRUE, which tripped me up.

@CamDavidsonPilon
Copy link
Owner

Some more thoughts:

  • this exp(-params * means) factor is introduced in the predict functions. So when you call predict_survival_function(x), a exp(-params * means) is introduced. At a high level:
predict_survival_function(x) = exp(-CBH(t) * exp(params * (x - means))
                             = exp(-CBH(t) * exp(-params * means) * exp(params * x)
                             = exp(- (CBH(t) * exp(-params * means)) * exp(params * x)

So I think the problem is that it's not obvious where the factor is being introduced. We either introduce it in the calculation of the baseline hazard and then avoid it in the predict functions, or vice versa.

@bofeng2018
Copy link
Author

Okay, here's an (updating) summary of what I know. This may change:

  • the baseline hazard (and cumulative hazard) can be computed at the mean values, or at all zeros. We, don't ask why yet, defaulted to the latter. This is the same as R's basehaz(..., centered=FALSE) which is what we test against¹.
  • From @bofeng2018 notebook, we can compute the factor the cumulative baseline hazard is "off": exp(-healthScoreCoef * 0.5) (the 0.5 is the mean of the single covariate). This makes the curves line up:
Screenshot 2024-01-07 at 4 35 34 PM * generally, the factor is `exp(-self.params dot self._norm_means)` * I'm putting "off" in quotations, since it's partially a decision for the user (that I made for them) * Need to investigate the `predict` functions. * Thinking about _why_ `centered=FALSE` was chosen:
  • the current method has the property that the baseline hazard is indp. of shifts in the covariates. That is, if we shift a covariate by 100, the baseline hazard doesn't change. This is desirable: all else being equal, if I change my measurement stick to start at 100, this shouldn't effect survival. Unless I start pairing it with covariates - then I need to handle the 100 somewhere. (Note that shifting by 100 never effects the coefs).
  • the proposed method doesn't have this property. If I add 10 to the dat, ex: dat['Health Score'] += 10, then the proposed result looks skewed, whereas the current method's result is the same:
Screenshot 2024-01-07 at 5 22 32 PM * Suffice to say, this is confusing for us, so it needs to be made clear for users.

¹ although our tests say centered=TRUE, which tripped me up.

The weird second graph is probably caused by super low mortality rates. Instead of dat['Health Score'] += 10, can you try dat['Health Score'] -= 10? The blue data points were based on classical textbook methods. Unless the implementation was wrong, they should be very consistent with theoretical results with large sample sizes.

@bofeng2018
Copy link
Author

bofeng2018 commented Jan 8, 2024

Some more thoughts:

  • this exp(-params * means) factor is introduced in the predict functions. So when you call predict_survival_function(x), a exp(-params * means) is introduced. At a high level:
predict_survival_function(x) = exp(-CBH(t) * exp(params * (x - means))
                             = exp(-CBH(t) * exp(-params * means) * exp(params * x)
                             = exp(- (CBH(t) * exp(-params * means)) * exp(params * x)

So I think the problem is that it's not obvious where the factor is being introduced. We either introduce it in the calculation of the baseline hazard and then avoid it in the predict functions, or vice versa.

Yes, I'm confident that the problem comes from the data standardization procedure. To be more specific, covariate means were incorrectly shifted to the baseline hazard. As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? That applies to lines 1248-1250 in the coxph_fitter.py file, and the corresponding codes are shown below. I would be very curious to see how it changes the results.

        X_norm = pd.DataFrame(
            utils.normalize(X.values, self._norm_mean.values, self._norm_std.values), index=X.index, columns=X.columns
        )

@CamDavidsonPilon
Copy link
Owner

Here's my edits of your notebook, I made some changes to reduce the complexity and focus on a simpler example.
Demonstration 2.ipynb.zip


To be more specific, covariate means were incorrectly shifted to the baseline hazard.

I don't think this is the correct interpretation. The covariate means have to go somewhere. Either they exist in the log-partial-hazard, or they exist in baseline hazard. We (implicitly) have them exist in the log-partial-hazard.

More explicitly:

  • to compute CBH(t), one uses exp(params * (x - means) in computations (as we do in lifelines)
  • to compute CBH'(t) := CBH(t) * exp(-params * means), one uses exp(params * x)

If I understand correctly, the differences we see are due to what we think the cumulative base hazard is: CBH'(t) or CBH(t). Answer is: it doesn't matter so long as one is explicit. If the data is demeaned before entering lifelines, then they are equal and there is no issue.


As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? ... I would be very curious to see how it changes the results.

Sorry, I'm not following: which results should we be looking at?

@bofeng2018
Copy link
Author

Here's my edits of your notebook, I made some changes to reduce the complexity and focus on a simpler example. Demonstration 2.ipynb.zip

To be more specific, covariate means were incorrectly shifted to the baseline hazard.

I don't think this is the correct interpretation. The covariate means have to go somewhere. Either they exist in the log-partial-hazard, or they exist in baseline hazard. We (implicitly) have them exist in the log-partial-hazard.

More explicitly:

  • to compute CBH(t), one uses exp(params * (x - means) in computations (as we do in lifelines)
  • to compute CBH'(t) := CBH(t) * exp(-params * means), one uses exp(params * x)

If I understand correctly, the differences we see are due to what we think the cumulative base hazard is: CBH'(t) or CBH(t). Answer is: it doesn't matter so long as one is explicit. If the data is demeaned before entering lifelines, then they are equal and there is no issue.

As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? ... I would be very curious to see how it changes the results.

Sorry, I'm not following: which results should we be looking at?

Yes, I agree we probably have disagreements on what would define a baseline. I guess I'm always following the most classical way that a Cox proportional hazard model would be formulated, which does not include steps to subtract sample means from each covariate. In this case, the baseline hazard would be calculated in the way I described.

However, I also totally agree that the most important thing is to get the final hazard estimations correctly. As long as covariate sample means are always subtracted from future data as in the case of the training data, the estimated hazard functions should be correct.

Overall, very inspiring discussions! Thanks for all the thought sharing!

@CamDavidsonPilon
Copy link
Owner

Overall, very inspiring discussions! Thanks for all the thought sharing!

💯

This is forcing me to re-evaluate these very important details. We'll make some changes to make this easier (at the very least: more transparent) for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants