-
-
Notifications
You must be signed in to change notification settings - Fork 560
CoxTimeVaryingFitter is actually faster than CoxPHFitter... #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Running some tests, the performance does change when the amount of ties changes. So I created a simple script to test when and how the performance changes as we vary the dataset size and the amount of duplication: https://gist.github.com/CamDavidsonPilon/779e8644915caaeb9fb6bff92a241146 (uses some edits to results:
It looks like batch algo. is faster when less than 0.05 up to a limit (not true when there are 10^3+ rows). This makes sense, as the indexing in the batch mode gets less and less efficient. However, for sub 5000 rows, batch mode provides some fast gains. FYI, the original Rossi dataset has a duplication metric of 0.11. However the distribution of Based on this, I'll keep both algorithms, and dynamically choose the algorithm at runtime based on statistics like Next steps are to repeat the script, but focus on the 0.05 - 0.25 range for less than 5k rows to find a better threshold |
To get a better threshold, run a linear regression |
The squared terms don't add much, I'll drop them.
|
more dataset size & frac combinations. Adding an interaction term too:
|
Returning to the original test, |
Down to 0.67 with #595 |
Down to ~0.17 with #609 |
CoxPHFitter
testtakes about 2.3 seconds.
CoxTimeVaryingFitter
testtakes about 1.65 seconds.
Note that the datasets between the two are identical. Even the results are identical (as expected). The internal differences are that
CoxPHFitter
looks at each row individually, while theCoxTimeVaryingFitter
looks at all rows grouped by duration. The latter is much more efficient when there are lots of ties (i.e. whencardinality / row count
is low).This is kinda shocking to me. It means I can improve
CoxPHFitter
performance by like 30%.The text was updated successfully, but these errors were encountered: