Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong result for miopenOpTensor() #79

Closed
ghostplant opened this issue Feb 1, 2019 · 20 comments
Closed

[BUG] Wrong result for miopenOpTensor() #79

ghostplant opened this issue Feb 1, 2019 · 20 comments
Assignees
Labels

Comments

@ghostplant
Copy link

ghostplant commented Feb 1, 2019

Source code attached below:

#include <hip/hip_runtime.h>
#include <hipblas.h>
#include <miopen/miopen.h>
#include <assert.h>
#include <vector>

using std::vector;

int main() {
  const int len = 9;

  hipSetDevice(0);
  miopenHandle_t miopen_handle;
  miopenCreate(&miopen_handle);

  miopenTensorDescriptor_t tensor;
  miopenCreateTensorDescriptor(&tensor);
  miopenSet4dTensorDescriptor(tensor, miopenFloat, len, 1, 1, 1);

  float *in, *out;
  hipMalloc((void**)&in, sizeof(float) * len);
  hipMalloc((void**)&out, sizeof(float) * len);

  float host_in[len] = { 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f}, host_out[len] = {};
  hipMemcpyHtoD(in, host_in, sizeof(float) * len);

  float alpha = 1.0f, beta = 0.0f;
  miopenOpTensor(miopen_handle, miopenTensorOpAdd, &alpha, tensor, in, &alpha, tensor, in, &beta, tensor, out);
  hipMemcpyDtoH(host_out, out, sizeof(float) * len);
  hipStreamSynchronize(0);

  // output: 2 nan nan 8 nan nan 14 nan nan
  // expect: 2 4 6 8 10 12 14 16 18
  for (int i = 0; i < len; ++i)
    printf("%g ", host_out[i]);
  puts("");
  return 0;
}

It outputs partial NaN results on gfx803 which is not expected, and this bug is reproducible every time.

@ghostplant
Copy link
Author

If using len = 8 instead of len = 9, the result would be correct.

@ghostplant
Copy link
Author

ghostplant commented Feb 1, 2019

The root cause seems to be a wrong execute of float3 data copy, the following patch fixes this bug:

NOTE: A typical *((float3*)y) = *((float3*)x) just copy the first float element and skip the second and third elements.

@@ -909,15 +910,24 @@ __kernel void Op4dTensorLite(const global MIOPEN_TYPE* a,
                              const long Coffset)
 {
     int gid0 = get_global_id(0);
-
     int index = gid0 * RD_BLCK;

     MIOPEN_TYPE a_dat[RD_BLCK];
     MIOPEN_TYPE b_dat[RD_BLCK];
     MIOPEN_TYPE c_dat[RD_BLCK];

+#if RD_BLCK == 3
+    a_dat[0] = (a + index + Aoffset)[0];
+    a_dat[1] = (a + index + Aoffset)[1];
+    a_dat[2] = (a + index + Aoffset)[2];
+    b_dat[0] = (b + index + Boffset)[0];
+    b_dat[1] = (b + index + Boffset)[1];
+    b_dat[2] = (b + index + Boffset)[2];
+#else
     *((READ_TYPE*)a_dat) = *((const global READ_TYPE*)(a + index + Aoffset));
     *((READ_TYPE*)b_dat) = *((const global READ_TYPE*)(b + index + Boffset));
+#endif
+
 #ifdef BETA
+ // Also need correct copy if RD_BLCK == 3
     *((READ_TYPE*)c_dat) = *((const global READ_TYPE*)(c + index + Coffset));
 #endif

@daniellowell daniellowell assigned ce1adon and unassigned asroy Feb 5, 2019
@ghostplant ghostplant changed the title [BUG] Wrong answer for miopenOpTensor() [BUG] Wrong result for miopenOpTensor() Feb 13, 2019
@mythreyi22
Copy link

The root cause seems to be a wrong execute of float3 data copy, the following patch fixes this bug:

NOTE: A typical *((float3*)y) = *((float3*)x) just copy the first float element and skip the second and third elements.

@@ -909,15 +910,24 @@ __kernel void Op4dTensorLite(const global MIOPEN_TYPE* a,
                              const long Coffset)
 {
     int gid0 = get_global_id(0);
-
     int index = gid0 * RD_BLCK;

     MIOPEN_TYPE a_dat[RD_BLCK];
     MIOPEN_TYPE b_dat[RD_BLCK];
     MIOPEN_TYPE c_dat[RD_BLCK];

+#if RD_BLCK == 3
+    a_dat[0] = (a + index + Aoffset)[0];
+    a_dat[1] = (a + index + Aoffset)[1];
+    a_dat[2] = (a + index + Aoffset)[2];
+    b_dat[0] = (b + index + Boffset)[0];
+    b_dat[1] = (b + index + Boffset)[1];
+    b_dat[2] = (b + index + Boffset)[2];
+#else
     *((READ_TYPE*)a_dat) = *((const global READ_TYPE*)(a + index + Aoffset));
     *((READ_TYPE*)b_dat) = *((const global READ_TYPE*)(b + index + Boffset));
+#endif
+
 #ifdef BETA
+ // Also need correct copy if RD_BLCK == 3
     *((READ_TYPE*)c_dat) = *((const global READ_TYPE*)(c + index + Coffset));
 #endif

This does not solve the issue. Facing the same issue after using this patch. Is it complete?

-Mythreyi

@ghostplant
Copy link
Author

ghostplant commented Mar 28, 2019

@mythreyi22 No, it is not complete cause every data cast to float3 (defined by READ_TYPE) should be expanded into 3 lines of float1 data copy.

@atamazov
Copy link
Contributor

float3 has stricter alignment than array of floats. That is why casting an address of array element to float3 ptr may yield invalid ptr.

@ghostplant
Copy link
Author

@atamazov So any better solution other than expanding these codes?

@atamazov
Copy link
Contributor

atamazov commented Mar 28, 2019

The root cause seems to be a wrong execute of float3 data copy

This is wrong data copy.

The fix might be performance waste.

Sorry.

@atamazov
Copy link
Contributor

atamazov commented Mar 28, 2019

Addresses of source & destination objects shall comply with float3/float4 alignment requirements if float3 ptrs are used.

We'll face the Undefined Behavior nightmare otherwise.

@atamazov
Copy link
Contributor

Unfortunately I can't see the whole context from mobile phone, and thus unable to give more specific/accurate recommendation.

@ghostplant
Copy link
Author

@atamazov The problem is batch of input data is supposed to be successively given and expected to be successively stored as well. If we follow the OpenCL float3 alignment, OpTensorAdd cannot fully handle adjacent data since 2 adjacent data of type-float3 would never meet the alignment requirements together.

@atamazov
Copy link
Contributor

atamazov commented Mar 28, 2019

2 adjacent data of type-float3 would never meet the alignment requirements together.

They always meet. Any float3 object has unused (hidden) 4th element at the end (gap). Alignments of float3 and float4 are the same.

If we follow the OpenCL float3 alignment...

We have UB if we don't))

@ghostplant
Copy link
Author

@atamazov So what is the specific float4 alignment requirements? e.g. addr % sizeof(float4) == 0?

@ghostplant
Copy link
Author

@atamazov Can you explain why a 9-successive-float array can satisfy the alignment together? If the starting address of the array satisfy the alignment, who will manage the calculation of the 4th element in this array?

@ce1adon
Copy link
Contributor

ce1adon commented Apr 1, 2019

@ghostplant @mythreyi22
Sorry for the inconvenience!
patch.zip

Please replace your MIOpen/src/ocl/tensorocl.cpp with attached file.
Also please update when next version MIOpen comes out.
Thank you!

@atamazov
Copy link
Contributor

atamazov commented Apr 1, 2019

@ce1adon This is diff between latest 1.7 release and contents of patch.zip you've uploaded. Is it correct?

issue-79-patch-001.diff.txt

@ce1adon
Copy link
Contributor

ce1adon commented Apr 1, 2019

@atamazov This is based on master branch of this repo.

@atamazov
Copy link
Contributor

atamazov commented Apr 1, 2019

@ce1adon Most likely it is ~= 1.7.x. If the diff is correct (I think it is), then it could be applied (which seems more common than overwriting the whole file).

@mythreyi22
Copy link

@ce1adon, Thanks!

@FelixSchwarz
Copy link

MIOpen 1.8 was released and lists this issue as fixed. Based on the patch from @ce1adon I think the exact commit was 903b1f1 .

@ghostplant
Copy link
Author

ghostplant commented Apr 13, 2019

Confirmed fixed. But found another bug from miopenTransformTensor to calculate y = 0.7 * y + 0.4 * x whose answer is also wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants