Skip to content

Workaround current CUDA/HIP "solution suspicious" bug...#381

Merged
abouteiller merged 1 commit intoICLDisco:masterfrom
therault:hotfix
Jun 20, 2022
Merged

Workaround current CUDA/HIP "solution suspicious" bug...#381
abouteiller merged 1 commit intoICLDisco:masterfrom
therault:hotfix

Conversation

@therault
Copy link
Copy Markdown
Contributor

Introduction of the NEW optimization in CUDA introduced this bug that
makes the CUDA driver not copy data from RAM to GPU, if the data copy
comes from a direct memory access. The test to detect this is a NEW
tile is wrong in the code, and abusively confuses copies coming
directly from memory with copies coming from a NEW operation.

I'm not sure how to detect a tile is NEW at this time... This patch
is a temporary workaround that forces data copies to be copied on
the CUDA device, if they come from memory (correct), or if they
come from a NEW operation (useless overhead). But it's more correct
to do the copy all the time than to not do it at all...

@therault therault requested a review from a team as a code owner May 13, 2022 18:41
Copy link
Copy Markdown
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment is correct with regard to "data coming from memory" as all all tiles eventually come from memory. Input data for tasks do not have a source_repo_entry iff the task does not have a predecessor, aka it has direct access to the data_collection_t.

@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Jun 8, 2022

Do you have a simple example that highlight this bug?

@therault
Copy link
Copy Markdown
Contributor Author

therault commented Jun 8, 2022

testing_dpotrf -N 100 -t 100 -g 1 -x in dplasma. The direct access of A(k, k) on the GPU is confused with a NEW, and the dpotrf kernel is called on a 0 tile, triggering an error (and of course factorization is suspicious).

Copy link
Copy Markdown
Contributor

@abouteiller abouteiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does resolve failure to compute accurate results in dplasma with CUDA in DPOTRF with 1 and 2 GPUs.

This also fix all ctest defects

@abouteiller abouteiller linked an issue Jun 9, 2022 that may be closed by this pull request
@abouteiller abouteiller added bug Something isn't working blocker Blocking release or critical use case labels Jun 10, 2022
@abouteiller abouteiller added this to the v4.0 milestone Jun 10, 2022
@abouteiller abouteiller requested a review from bosilca June 10, 2022 22:53
@abouteiller abouteiller changed the title Workaround current CUDA bug... Workaround current CUDA/HIP "solution suspicious" bug... Jun 14, 2022
Introduction of the NEW optimization in CUDA introduced this bug that
makes the CUDA driver not copy data from RAM to GPU, if the data copy
comes from a direct memory access. The test to detect this is a NEW
tile is wrong in the code, and abusively confuses copies coming
directly from memory with copies coming from a NEW operation.

The proper way to detect this is a NEW data is to check if the
data collection is NULL (meaning it's not a direct access from the data
collection) AND if the repo_entry is NULL (meaning it doesn't have a
predecessor).

Tested with DPLASMA dpotrf on leconte.
@therault
Copy link
Copy Markdown
Contributor Author

The proper way to detect this is a NEW data is to check if the
data collection is NULL (meaning it's not a direct access from the data
collection) AND if the repo_entry is NULL (meaning it doesn't have a
predecessor).

Tested with DPLASMA dpotrf on leconte. Updated the commit.

This is ready for review.

@abouteiller abouteiller self-requested a review June 20, 2022 20:18
@abouteiller abouteiller merged commit e65249b into ICLDisco:master Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker Blocking release or critical use case bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Portf 2 gpus: suspicious solution

3 participants