Workaround current CUDA/HIP "solution suspicious" bug...#381
Workaround current CUDA/HIP "solution suspicious" bug...#381abouteiller merged 1 commit intoICLDisco:masterfrom
Conversation
bosilca
left a comment
There was a problem hiding this comment.
I don't think the comment is correct with regard to "data coming from memory" as all all tiles eventually come from memory. Input data for tasks do not have a source_repo_entry iff the task does not have a predecessor, aka it has direct access to the data_collection_t.
|
Do you have a simple example that highlight this bug? |
|
|
abouteiller
left a comment
There was a problem hiding this comment.
This does resolve failure to compute accurate results in dplasma with CUDA in DPOTRF with 1 and 2 GPUs.
This also fix all ctest defects
Introduction of the NEW optimization in CUDA introduced this bug that makes the CUDA driver not copy data from RAM to GPU, if the data copy comes from a direct memory access. The test to detect this is a NEW tile is wrong in the code, and abusively confuses copies coming directly from memory with copies coming from a NEW operation. The proper way to detect this is a NEW data is to check if the data collection is NULL (meaning it's not a direct access from the data collection) AND if the repo_entry is NULL (meaning it doesn't have a predecessor). Tested with DPLASMA dpotrf on leconte.
|
The proper way to detect this is a NEW data is to check if the Tested with DPLASMA dpotrf on leconte. Updated the commit. This is ready for review. |
Introduction of the NEW optimization in CUDA introduced this bug that
makes the CUDA driver not copy data from RAM to GPU, if the data copy
comes from a direct memory access. The test to detect this is a NEW
tile is wrong in the code, and abusively confuses copies coming
directly from memory with copies coming from a NEW operation.
I'm not sure how to detect a tile is NEW at this time... This patch
is a temporary workaround that forces data copies to be copied on
the CUDA device, if they come from memory (correct), or if they
come from a NEW operation (useless overhead). But it's more correct
to do the copy all the time than to not do it at all...