Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoxTimeVaryingFitter is actually faster than CoxPHFitter... #591

Closed
CamDavidsonPilon opened this issue Jan 2, 2019 · 7 comments · Fixed by #594
Closed

CoxTimeVaryingFitter is actually faster than CoxPHFitter... #591

CamDavidsonPilon opened this issue Jan 2, 2019 · 7 comments · Fixed by #594

Comments

@CamDavidsonPilon
Copy link
Owner

CoxPHFitter test

    import pandas as pd
    import time

    from lifelines import CoxPHFitter
    from lifelines.datasets import load_rossi

    df = load_rossi()
    df = pd.concat([df] * 20)
    cp = CoxPHFitter()
    start_time = time.time()
    cp.fit(df, duration_col="week", event_col="arrest")
    print("--- %s seconds ---" % (time.time() - start_time))
    cp.print_summary()

takes about 2.3 seconds.

CoxTimeVaryingFitter test

    import time
    import pandas as pd
    from lifelines import CoxTimeVaryingFitter
    from lifelines.datasets import load_rossi
    from lifelines.utils import to_long_format

    df = load_rossi()
    df = pd.concat([df] * 20)
    df = df.reset_index()
    df = to_long_format(df, duration_col='week')
    ctv = CoxTimeVaryingFitter()
    start_time = time.time()
    ctv.fit(df, id_col="index", event_col="arrest", start_col="start", stop_col="stop")
    time_took = time.time() - start_time
    print("--- %s seconds ---" % time_took)
    ctv.print_summary()

takes about 1.65 seconds.

Note that the datasets between the two are identical. Even the results are identical (as expected). The internal differences are that CoxPHFitter looks at each row individually, while the CoxTimeVaryingFitter looks at all rows grouped by duration. The latter is much more efficient when there are lots of ties (i.e. when cardinality / row count is low).

This is kinda shocking to me. It means I can improve CoxPHFitter performance by like 30%.

@CamDavidsonPilon
Copy link
Owner Author

CamDavidsonPilon commented Jan 2, 2019

Running some tests, the performance does change when the amount of ties changes. So I created a simple script to test when and how the performance changes as we vary the dataset size and the amount of duplication: https://gist.github.com/CamDavidsonPilon/779e8644915caaeb9fb6bff92a241146 (uses some edits to CoxPHFitter from a branch)

results:

                batch    single      ratio
N   frac_dups
432   0.05   0.115094  0.215279   0.534627
      0.25   0.156897  0.161292   0.972753
      0.50   0.175383  0.163694   1.071407
      0.75   0.218806  0.164751   1.328101
      0.95   0.170074  0.165040   1.030502
      1.00   0.183934  0.182833   1.006022
2160  0.05   0.377973  0.572945   0.659702
      0.25   0.804157  0.595763   1.349794
      0.50   0.925240  0.612084   1.511622
      0.75   0.947170  0.626574   1.511665
      0.95   0.985102  0.620530   1.587517
      1.00   1.349139  0.730042   1.848029
4320  0.05   0.993831  1.840383   0.540013
      0.25   2.554543  1.360934   1.877051
      0.50   2.776866  1.334408   2.080972
      0.75   2.918765  1.197274   2.437842
      0.95   2.837594  1.221863   2.322350
      1.00   2.808126  1.215002   2.311211
8640  0.05   2.436381  2.456973   0.991619
      0.25   9.698420  4.532513   2.139745
      0.50  11.143309  2.670600   4.172586
      0.75  11.033132  2.548802   4.328752
      0.95  10.128021  2.485414   4.074983
      1.00   9.942075  2.537789   3.917613
21600 0.05  12.986119  5.600304   2.318824
      0.25  45.722887  8.451117   5.410277
      0.50  61.571375  6.184555   9.955668
      0.75  66.415850  6.102201  10.883917
      0.95  66.390649  6.206955  10.696170
      1.00  70.464078  5.955308  11.832147

It looks like batch algo. is faster when less than 0.05 up to a limit (not true when there are 10^3+ rows). This makes sense, as the indexing in the batch mode gets less and less efficient. However, for sub 5000 rows, batch mode provides some fast gains. FYI, the original Rossi dataset has a duplication metric of 0.11. However the distribution of T is not identical (my duplication metric is simple, and doesn't capture the distribution well).

Based on this, I'll keep both algorithms, and dynamically choose the algorithm at runtime based on statistics like df['T'].unique().shape[0] / df.shape[0]. I unfortunately get some code duplication, but I think it's worth it.

Next steps are to repeat the script, but focus on the 0.05 - 0.25 range for less than 5k rows to find a better threshold

@CamDavidsonPilon
Copy link
Owner Author

CamDavidsonPilon commented Jan 2, 2019

                  batch    single     ratio
N   frac_dups
432  0.010000  0.090478  0.160682  0.563086
     0.036667  0.101040  0.162177  0.623023
     0.063333  0.124758  0.173921  0.717325
     0.090000  0.148689  0.185626  0.801014
     0.116667  0.126510  0.159210  0.794610
     0.143333  0.140035  0.168380  0.831660
     0.170000  0.139186  0.172226  0.808160
     0.196667  0.143183  0.161531  0.886412
     0.223333  0.149419  0.172548  0.865956
     0.250000  0.154384  0.164213  0.940146
864  0.010000  0.118958  0.269069  0.442109
     0.036667  0.152362  0.271408  0.561376
     0.063333  0.175262  0.277069  0.632557
     0.090000  0.204368  0.306464  0.666859
     0.116667  0.221013  0.283780  0.778819
     0.143333  0.228133  0.279285  0.816847
     0.170000  0.259037  0.275419  0.940520
     0.196667  0.271853  0.283252  0.959757
     0.223333  0.267034  0.282465  0.945370
     0.250000  0.284253  0.292829  0.970713
1728 0.010000  0.189198  0.500514  0.378007
     0.036667  0.257756  0.488055  0.528129
     0.063333  0.356278  0.510444  0.697977
     0.090000  0.567291  0.716904  0.791307
     0.116667  0.626131  0.569123  1.100169
     0.143333  0.500412  0.579520  0.863494
     0.170000  0.541273  0.583804  0.927148
     0.196667  0.730457  0.544671  1.341098
     0.223333  0.592445  0.586924  1.009406
     0.250000  0.589447  0.515669  1.143073
2592 0.010000  0.257900  0.721063  0.357666
     0.036667  0.394811  0.734231  0.537720
     0.063333  0.535725  0.841410  0.636699
     0.090000  0.814893  0.771991  1.055573
     0.116667  0.758148  0.735620  1.030625
     0.143333  0.906650  0.851793  1.064402
     0.170000  1.008925  0.876554  1.151013
     0.196667  1.008188  0.764584  1.318610
     0.223333  1.025650  0.828224  1.238373
     0.250000  1.047904  0.803953  1.303439
3456 0.010000  0.343622  0.958570  0.358474
     0.036667  0.614166  1.068780  0.574642
     0.063333  0.820540  0.964449  0.850787
     0.090000  1.008817  0.977758  1.031766
     0.116667  1.187182  1.095112  1.084074
     0.143333  1.482159  1.132114  1.309196
     0.170000  1.596786  1.039942  1.535457
     0.196667  1.541407  1.008447  1.528496
     0.223333  1.578326  1.162654  1.357520
     0.250000  1.601857  1.038344  1.542704
4320 0.010000  0.471593  1.263956  0.373109
     0.036667  0.840879  1.285744  0.654002
     0.063333  1.029592  1.219469  0.844295
     0.090000  1.350916  1.288697  1.048280
     0.116667  1.650521  1.378140  1.197644
     0.143333  1.815617  1.600887  1.134132
     0.170000  1.854152  1.189545  1.558707
     0.196667  2.228401  1.405833  1.585111
     0.223333  2.535661  1.405362  1.804276
     0.250000  2.510236  1.341797  1.870802

To get a better threshold, run a linear regression ratio ~ N + dup_frac + N**2 + dup_frac**2, and solve for ratio < 1

@CamDavidsonPilon
Copy link
Owner Author

The squared terms don't add much, I'll drop them.

                            OLS Regression Results
==============================================================================
Dep. Variable:                  ratio   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.750
Method:                 Least Squares   F-statistic:                     89.39
Date:                Wed, 02 Jan 2019   Prob (F-statistic):           2.67e-18
Time:                        14:57:49   Log-Likelihood:                 20.535
No. Observations:                  60   AIC:                            -35.07
Df Residuals:                      57   BIC:                            -28.79
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2345      0.058      4.037      0.000       0.118       0.351
N              0.0001   1.65e-05      7.171      0.000    8.55e-05       0.000
frac           3.3536      0.297     11.286      0.000       2.759       3.949
==============================================================================
Omnibus:                        7.346   Durbin-Watson:                   1.028
Prob(Omnibus):                  0.025   Jarque-Bera (JB):                9.072
Skew:                           0.432   Prob(JB):                       0.0107
Kurtosis:                       4.698   Cond. No.                     3.45e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.45e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

@CamDavidsonPilon
Copy link
Owner Author

CamDavidsonPilon commented Jan 2, 2019

more dataset size & frac combinations. Adding an interaction term too:

                            OLS Regression Results
==============================================================================
Dep. Variable:                  ratio   R-squared:                       0.649
Model:                            OLS   Adj. R-squared:                  0.635
Method:                 Least Squares   F-statistic:                     46.89
Date:                Wed, 02 Jan 2019   Prob (F-statistic):           2.92e-17
Time:                        15:26:53   Log-Likelihood:                -35.826
No. Observations:                  80   AIC:                             79.65
Df Residuals:                      76   BIC:                             89.18
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4690      0.161      2.908      0.005       0.148       0.790
N           3.045e-05   4.46e-05      0.683      0.497   -5.83e-05       0.000
frac           2.3741      0.893      2.658      0.010       0.595       4.153
N * frac       0.0007      0.000      2.879      0.005       0.000       0.001
==============================================================================
Omnibus:                       26.028   Durbin-Watson:                   2.032
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               41.254
Skew:                           1.320   Prob(JB):                     1.10e-09
Kurtosis:                       5.325   Cond. No.                     7.62e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.62e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

@CamDavidsonPilon
Copy link
Owner Author

CamDavidsonPilon commented Jan 2, 2019

Returning to the original test, CoxPHFitter is now faster. Takes ~1.0 seconds.

@CamDavidsonPilon
Copy link
Owner Author

Down to 0.67 with #595

@CamDavidsonPilon
Copy link
Owner Author

Down to ~0.17 with #609

@CamDavidsonPilon CamDavidsonPilon unpinned this issue Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant