Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up Aalen Additive Regression #421

Closed
springcoil opened this issue Mar 8, 2018 · 15 comments · Fixed by #604
Closed

Speeding up Aalen Additive Regression #421

springcoil opened this issue Mar 8, 2018 · 15 comments · Fixed by #604

Comments

@springcoil
Copy link

Hi, I've been working on a project for a few months now and one problem I have is that it can take about 4 days to run on 340k rows, with about 6 features.

I know lifelines isn't necessarily designed for this and I've discovered that the ridge regression solve step is the biggest bottleneck - 60% of the compute time happens there.

Are there alternative algorithms I can use like mini-batch say? Rather than the ridge regression?

@CamDavidsonPilon
Copy link
Owner

Four days!? That is beyond inappropriate. I’ll look into this for the next release

@springcoil
Copy link
Author

https://stats.stackexchange.com/questions/83272/fastest-way-to-run-ridge-regression-on-large-datasets-where-np makes me think that this is a problem with my BLAS/LAPACK. Do you want me to check what they are.

@springcoil
Copy link
Author

What sort of debugging information would you need from me? How can I help you?

@springcoil
Copy link
Author

springcoil commented Mar 8, 2018

The server I was running this on - I took the spec/versions I'm using from it. Let me know if that helps :)

NumPy version:     1.11.3
Python version:    2.7.12 | packaged by conda-forge | (default, Sep  8 2016, 14:22:31) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Platform:          linux2-x86_64
AMD/Intel CPU?     True
VML available?     False
Number of threads used by default: 8 (out of 24 detected cores)

>>> np.__config__.show() 

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/net/DataScience_public/jupyter_virtual_envs/python2/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/net/DataScience_public/jupyter_virtual_envs/python2/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/net/DataScience_public/jupyter_virtual_envs/python2/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/net/DataScience_public/jupyter_virtual_envs/python2/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE

@springcoil
Copy link
Author

In https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/aalen_additive_fitter.py#L188 is this correct. It seems to me it's predicting per row in table not per time event.

@CamDavidsonPilon
Copy link
Owner

It seems to me it's predicting per row in table not per time event.

the for loop iterates over the times https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/aalen_additive_fitter.py#L170

@springcoil
Copy link
Author

Ok I'll close that bugfix. Any idea how to speed this up Cameron? 4 days isn't adequate.

@CamDavidsonPilon
Copy link
Owner

For now, just sample down to a smaller number of observations

@springcoil
Copy link
Author

springcoil commented Mar 12, 2018

I've tried that but not really suitable for my problem. But thanks :) Let me know if I can help in anyway :)

@CamDavidsonPilon
Copy link
Owner

CamDavidsonPilon commented Mar 27, 2018

Looking at the profile of the code, most of the time is spent in solving the least-squares problem. I've found some modest performance increases there (~50% faster for tall datasets), slated for the next release

@springcoil
Copy link
Author

springcoil commented Mar 27, 2018 via email

@CamDavidsonPilon
Copy link
Owner

I don't know what gradient boosting would solve. The LR step is part of the inference algorithm

@springcoil
Copy link
Author

Just so I understand, are you saying that gradient boosting wouldn't work for the inference step?

@CamDavidsonPilon
Copy link
Owner

That's correct

@springcoil
Copy link
Author

Ahh that makes sense. It looks like your solving step change is the right approach. I suspect there's other performance improvements. Looking forward to the new release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants