Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore whether using pointers to const data as arg to CUDA kernels improves performance #637

Closed
mattldawson opened this issue Aug 29, 2024 · 2 comments · Fixed by #671
Closed
Assignees

Comments

@mattldawson
Copy link
Collaborator

mattldawson commented Aug 29, 2024

Currently, CUDA kernels get passed const structs like LuDecomposeParam which include pointers to arrays of indices. However, the array element types are not const. Explore whether creating something like a ConstLuDecomposeParam struct that has data member pointers to const data, and then having this type be used as the argument to CUDA kernels (with appropriate assignment operator overloads) will result in better pre-fetching of data in memory during the execution of the kernel.

Possible definition of new struct:

struct ConstLuDecomposeParam
{
  const std::pair<const std::size_t, const std::size_t>* niLU_;
  const char* const do_aik_;
  const std::size_t* const aik_;
  const std::pair<const std::size_t, const std::size_t>* uik_nkj_;
  const std::pair<const std::size_t, const std::size_t>* lij_ujk_;
  const char* do_aki_;
  const std::size_t* aki_;
  const std::pair<const std::size_t, const std::size_t>* lki_nkj_;
  const std::pair<const std::size_t, const std::size_t>* lkj_uji_;
  const std::size_t* uii_;
  std::size_t niLU_size_;
  std::size_t do_aik_size_;
  std::size_t aik_size_;
  std::size_t uik_nkj_size_;
  std::size_t lij_ujk_size_;
  std::size_t do_aki_size_;
  std::size_t aki_size_;
  std::size_t lki_nkj_size_;
  std::size_t lkj_uji_size_;
  std::size_t uii_size_;

  ConstLuDecomposeParam(const LuDecomposeParam& other) :
    niLU_(other.niLU_),
    ...
    {}
};

or we can change the existing LuDecomposeParam struct with the new one above (without the initializer) and see if it works too.

@sjsprecious
Copy link
Collaborator

I and Matt tested it quickly on Derecho's A100 GPU. No significant difference is observed for the GPU performance.

@sjsprecious
Copy link
Collaborator

@mattldawson My bad! After adding the const qualifier to all the CUDA kernels, the time of unit tests for CUDA is significantly faster. Thus I think it is helpful to add it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants