Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope of improvements in EnsembleGPUKernel #171

Closed
utkarsh530 opened this issue Aug 2, 2022 · 2 comments
Closed

Scope of improvements in EnsembleGPUKernel #171

utkarsh530 opened this issue Aug 2, 2022 · 2 comments

Comments

@utkarsh530
Copy link
Member

utkarsh530 commented Aug 2, 2022

#170
The latest profile, while solving from EnsembleGPUKernel, raises some questions:

Some overheads are discussed here for potential improvements EnsembleGPUKernel for Tsit5.

  1. Converting the solution back to CuArrays.
    The reason for this overhead (converting to CPU Arrays) is to provide users access to something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us, which are CuArrays.

Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.

  1. Ensemble problem creation for parameter parallelism
    The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.

Possible workaround: create ps or u0s and pass them into DiffEqGPU instead of only specifying the trajectories, and the library handles the rest.

If we don’t convert to CPU Arrays, we’ll get good performance (~2x faster) plus if we let user build ps (instead of asking the trajectories and building ourselves), we’ll probably reach the desired benchmark.

@ChrisRackauckas
Copy link
Member

Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.

We can make that be an option (with a val type). But we can also

The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.

I think for that, we can have a documented lower level API for people who really want to pull as much speed out as possible. On that note, we should make some real docs.

@utkarsh530
Copy link
Member Author

Sounds good to me. I will start writing some documentation for it. I can help setting up docs page for it, something aligned with SciMLDocs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants