Integrate rocPRIM 0.3.1 milestone #15

iotamudelta · 2018-06-07T15:20:16Z

A couple of notes:

rocPRIM is still marked experimental
in this PR, most of the reduction kernels, l2loss, and softmax are converted from cub to rocPRIM
complex types are not supported out of the box - hence no complex reduction yet
some cub kernels are not yet converted (where, topk, ...) due to issues w/ rocPRIM and/or the TF interface to cub. There will be follow-up PRs for these.
all rocPRIM kernels in this PR are marked P (for in Progress) in the documentation until we have confirmed it works and the patch is accepted, then I'll mark done

…he snapshot referenced here is modified from that.

…ader using cub. Besides "regular" conversion, note that the CleanupSegments kernel must have its template parameters specified for hcc (and it won't hurt for nvcc).

…h rocPRIM.

…ation of docker tests.

Apply new macro/type alias definitions to l2loss, reduction, softmax where applicable.

whchung · 2018-06-07T15:23:45Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

@@ -13,17 +13,21 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/

-#if GOOGLE_CUDA
+#if defined GOOGLE_CUDA || defined TENSORFLOW_USE_ROCM


how about #if GOOGLE_CUDA || TENSORFLOW_USE_ROCM? easier for future global search-and-replace

whchung · 2018-06-07T15:25:54Z

tensorflow/core/kernels/reduction_ops_gpu_float.cu.cc

@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/

-#if GOOGLE_CUDA
+#if defined GOOGLE_CUDA || defined TENSORFLOW_USE_ROCM


whchung · 2018-06-07T15:27:42Z

tensorflow/core/kernels/reduction_ops_mean.cc

@@ -51,8 +51,10 @@ TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
                          ReductionOp<GPUDevice, type, int64,               \
                                      Eigen::internal::MeanReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
+#if GOOGLE_CUDA
 TF_CALL_complex64(REGISTER_GPU_KERNELS);


either leave comment here, or change core_kernels.md so we can revisit later?

The are marked as not done in core_kernels.md .

whchung · 2018-06-07T15:27:52Z

tensorflow/core/kernels/reduction_ops_prod.cc

@@ -52,8 +52,10 @@ TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
                                      Eigen::internal::ProdReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 TF_CALL_int32(REGISTER_GPU_KERNELS);
+#if GOOGLE_CUDA


whchung · 2018-06-07T15:28:00Z

tensorflow/core/kernels/reduction_ops_sum.cc

@@ -51,8 +51,10 @@ TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
          .HostMemory("reduction_indices"),                                    \
      ReductionOp<GPUDevice, type, int64, Eigen::internal::SumReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
+#if GOOGLE_CUDA


whchung · 2018-06-07T15:29:06Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

            in, (T*)temp_storage.flat<int8_t>().data(), in_size, op, init);

    // take care that we only reduce blocks that had some valid elements in them
    // TODO(eriche): CUB currently has a bug in HeadSegmentedReduce that
    // requires it to be used with a full warp.  Can reduce 32 -> num_blocks
    // when this is fixed.
-    CleanupSegments<<<1, 32, 0, cu_stream>>>(
-        (T*)temp_storage.flat<int8_t>().data(), out, 1, 1, num_blocks, op,
+    GPU_LAUNCH_KERNEL(CleanupSegments<T*,OUT_T,Op>, dim3(1), dim3(32), 0, cu_stream,


is this kernel launch needed in case of TENSORFLOW_USE_ROCM?

Unclear, but I wanted to have a working, straight conversion first and worry about performance next.

whchung · 2018-06-07T15:29:38Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

@@ -541,34 +557,34 @@ void LaunchScalarReduction(OpKernelContext* ctx, OUT_T out, IN_T in,
 template <typename T, typename Op, typename OUT_T, typename IN_T>
 void LaunchRowReduction(OpKernelContext* ctx, OUT_T out, IN_T in, int num_rows,
                        int num_cols, Op op, T init,
-                        const cudaStream_t& cu_stream) {
+                        const gpuStream_t& cu_stream) {
  if (num_cols < 1024) {
    const int threads_per_block = 128;
    const int warps_per_block = threads_per_block / 32;


should it be 64 on AMD hardware?

probably - but for now wanted to get a clean patch in that works and worry about performance later?

i understand. but depending on the implementation of the kernel we might even have incorrect results instead of just performing slower.

fair enough - let's see what the CI has to say. Either way I am committed to getting performance optimal on our HW next.

whchung · 2018-06-07T15:30:26Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

      transform_iter(counting_iter, row_offset_op);

  std::size_t temp_storage_bytes = 0;
  Tensor temp_storage;
  for (int i = 0; i < 2; ++i) {
-    auto success = cub::DeviceSegmentedReduce::Reduce(
+    auto success = gpuprim::DeviceSegmentedReduce::Reduce(
        i == 0 ? nullptr : temp_storage.flat<int8_t>().data(),
        temp_storage_bytes, in, out, num_rows, transform_iter,
        transform_iter + 1, op, init, cu_stream);

    OP_REQUIRES(ctx, success == 0,
                errors::Internal("CUB segmented reduce error",


the message could use some change i think as it's not solely for CUB now

I was wondering about this but left them intact as technically we are using the hipcub API as well.

whchung · 2018-06-07T15:32:38Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

@@ -644,22 +660,22 @@ void LaunchColumnReduction_LTE4096Cols(OpKernelContext* ctx, OUT_T out, IN_T in,
                                          sizeof(T) * extent_y * grid_dim.y)}),
                                      &temp_storage));

-    ColumnReduceKernel<<<grid_dim, block_dim, 0, cu_stream>>>(
+    GPU_LAUNCH_KERNEL(ColumnReduceKernel, dim3(grid_dim), dim3(block_dim), 0, cu_stream,
        in, (T*)temp_storage.flat<int8_t>().data(), extent_x, extent_y, op,
        init);

    dim3 new_grid_dim((grid_dim.y * extent_y + 31) / 32, 1, 1);


shall the grid dimension calculation take into consideration that AMD wavefront has 64 threads instead of 32?

whchung · 2018-06-07T15:33:09Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

@@ -607,22 +623,22 @@ void LaunchColumnReduction_LTE16Cols(OpKernelContext* ctx, OUT_T out, IN_T in,
                                      TensorShape({static_cast<int64>(
                                          sizeof(T) * extent_y * grid_dim.y)}),
                                      &temp_storage));
-    ColumnReduceMax16ColumnsKernel<<<grid_dim, block_dim, 0, cu_stream>>>(
+    GPU_LAUNCH_KERNEL(ColumnReduceMax16ColumnsKernel, dim3(grid_dim), dim3(block_dim), 0, cu_stream,
        in, (T*)temp_storage.flat<int8_t>().data(), extent_x, extent_y, op,
        init);

    dim3 new_grid_dim((grid_dim.y * extent_y + 31) / 32, 1, 1);


shall the grid dimension calculation take into consideration that AMD wavefront has 64 threads instead of 32?

whchung · 2018-06-07T15:36:15Z

@iotamudelta I've completed initial sweep of the PR. Let's take CI to do its job. I'm in particularly worried about the diff between number of threads running concurrently on AMD and NV hardware (32/64) and how it might affect configs for certain kernel launch sites.

parallelo · 2018-06-07T19:28:09Z

@iotamudelta Jenkins has multiple duplicate jobs in the queue for this PR; we need to change some settings it seems.

Since at least one job has already completed today, I'm going to kill the follow-on jobs.

For your reference, here's one of the completed jobs from today:
http://205.234.28.200:21096/job/tensorflow-upstream-unit-tests/68/console

Executed 458 out of 458 tests: 414 tests pass and 44 fail locally.

whchung · 2018-06-11T20:58:36Z

build

@@ -11,4 +11,4 @@
 pip uninstall -y tensorflow || true
 bazel build --config=opt --config=rocm //tensorflow/tools/pip_package:build_pip_package --verbose_failures &&
 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg &&
-pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl
+pip install /tmp/tensorflow_pkg/tensorflow-1.8.0rc1-cp27-cp27mu-linux_x86_64.whl


this is incorrect, should drop rc1

whchung · 2018-06-11T20:58:45Z

build_python3

@@ -11,4 +11,4 @@
 pip3 uninstall -y tensorflow || true
 bazel build --config=opt --config=rocm //tensorflow/tools/pip_package:build_pip_package --verbose_failures &&
 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg &&
-pip3 install /tmp/tensorflow_pkg/tensorflow-1.8.0-cp35-cp35m-linux_x86_64.whl
+pip3 install /tmp/tensorflow_pkg/tensorflow-1.8.0rc1-cp35-cp35m-linux_x86_64.whl


this is incorrect, should drop rc1

…d was lost.

whchung · 2018-06-11T21:02:23Z

tensorflow/core/kernels/reduction_gpu_kernels.cu.h

  if (extent_y <= 16) {
    LaunchColumnReduction_LTE16Cols(ctx, out, in, extent_x, extent_y, op, init,
                                    cu_stream);
  } else if (extent_y <= 4096) {
-    LaunchColumnReduction_LTE4096Cols(ctx, out, in, extent_x, extent_y, op,
+    LaunchColumnReduction_LTE4096Cols<T, Op, OUT_T, IN_T>(ctx, out, in, extent_x, extent_y, op,


@iotamudelta it seems this is the only call path which leads to the crashing kernel? can we first check if the more native path, ColumnReduceSimpleKernel, work, perhaps by commenting out this path and mark it as FIXME?

BTW from the look of the implementation cu_stream ought to be right. could you post links to IR / ISA dump here?

my current suspicion is that the code object for the offending kernel might somehow be corrupted. so as an initial experiment let's try skip the path and see if we get the ball rolling.

iotamudelta · 2018-06-11T21:03:19Z

that must have snug in when I merged, sorry.

BTW: I am seeing different cp values on my machine compared to the build file (on py3 36 vs 35)

iotamudelta · 2018-06-11T21:58:27Z

@parallelo last CI failure looks to be something else?

ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Could not init /home/jenkins/workspace/tensorflow-upstream-unit-tests

parallelo · 2018-06-11T22:00:40Z

Yeah, sorry about that. Just added a new node to CI. Unfortunately, it wasn't fully set up, and it chomped through two builds (93 & 94) without actually making progress. But build 92 looked pretty good (minus 4 errors).

Jenkins: retest this please

parallelo · 2018-06-12T01:52:30Z

Re-ran the test, and looks like there are two regressions in build 95:

//tensorflow/examples/tutorials/mnist:fully_connected_feed_test 
//tensorflow/examples/tutorials/mnist:mnist_with_summaries_test

They both appear to have the HSA invalid arg signature:

### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ARGUMENT (0x1001) at file:mcwamp_hsa.cpp line:1165

whchung · 2018-06-12T02:47:41Z

@iotamudelta so there are other kernels bumping into the very same issue.

Let’s first find out what kernels are causing the issues usin env vars. Disable them and try use the simpler variations and see if we get tests passing. Then collect IR and ISA dumps so I can study if we are bumping into lower-level issues.

…n ROCm, this fixes for me the two MNIST failures found by CI.

iotamudelta · 2018-06-12T15:07:43Z

@whchung I've disabled the other specialized column reduce branch now - that fixes the issue. The common denominator I can see between the two kernels that ultimately fail is the scratch space allocation. I'll now get into recompiling to dump isa/ir for them and will append to this issue.

Do you want to merge the current status once CI is sorted out and then have a follow-up PR once we sorted the bugs out in the specialized cases?

whchung · 2018-06-12T15:10:29Z

@iotamudelta In the latest commit, it appears we disabled the following paths:

LaunchColumnReduction_LTE16Cols
LaunchColumnReduction_LTE4096Cols

and use the default, presumably slower, ColumnReduceSimpleKernel. Let's see what CI says about this change.

Meanwhile please try collect IR / ISA dump for the offending kernels. I think I'll also need their binary versions as well, as the error message from ROCR runtime suggests the issue be something of more lower level.

whchung · 2018-06-12T15:12:13Z

@iotamudelta we can merge the PR shall CI agrees, and leave perf improvements in follow-up PRs.

iotamudelta · 2018-06-12T15:56:05Z

@whchung Here are the dumps for an MNIST run. Please note that they are bzip2 compressed (github doesn't like the bz2 ending, so renamed to zip). Let me know if you need more!

I like the new crosstool_wrapper_driver. Makes dumping easier. :-)

dump.linked.bc.zip
dump.selected.bc.zip
dump-gfx900.hsaco.zip
dump-gfx900.isa.zip
dump-gfx900.opt.bc.zip

iotamudelta · 2018-06-12T16:02:12Z

green check mark! 👍

VincentSC · 2018-06-12T17:40:32Z

FYI: we released another dev-branch https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2
Final version v1.0 is expected on Friday.

whchung · 2018-06-12T20:53:13Z

@iotamudelta none of the attachment could really be unzipeed. Could you help double check?

iotamudelta · 2018-06-12T21:52:22Z

Works for me. See note: bunzip2 , not unzip

whchung · 2018-06-12T22:06:22Z

Ok I didn’t noticed the note. Let me try them later

Integrate rocPRIM 0.3.1 milestone

VincentSC · 2018-06-18T10:02:42Z

Final version of rocPRIM is on Github: https://github.com/ROCmSoftwarePlatform/rocPRIM/

On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () #9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () #10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () #11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```

iotamudelta added 14 commits May 29, 2018 15:23

Use an internal rocPRIM snapshot to proceed with porting. Note that t…

ce9adf0

…he snapshot referenced here is modified from that.

With the modified snapshot of rocPRIM, convert the basic reduction he…

95fe890

…ader using cub. Besides "regular" conversion, note that the CleanupSegments kernel must have its template parameters specified for hcc (and it won't hurt for nvcc).

Reduction operations for a variety of numerical types now compile wit…

3d677c7

…h rocPRIM.

Enable GPU l2loss through reduction/rocPRIM.

dcc3064

Merge branch 'rocprim-integration' into rocprim-master

952b825

Use newer rocPRIM snapshot and change filepath to root dir in anticip…

a1a5128

…ation of docker tests.

Register a bunch of reduction based kernels for GPU.

b813353

Document kernel progress (P) for cub kernels here.

d5d4030

Do not register kernels for complex type yet.

8c2b8cb

Also here, do not register complex kernels yet.

ccf2c87

softmax can be enabled if reduction_op is made a general dependency

1240a0f

Merge branch 'develop-upstream' into rocprim-master

553a9b6

Apply new macro/type alias definitions to l2loss, reduction, softmax where applicable.

Fix formatting caused by merge conflict.

2f51bf4

Use the rocPRIM 0.3.1 milestone available from github.

526021c

iotamudelta requested review from whchung and dagamayank June 7, 2018 15:20

whchung reviewed Jun 7, 2018

View reviewed changes

Remove defined leftover.

4e105a6

whchung reviewed Jun 7, 2018

View reviewed changes

Remove defined.

1e44c5a

whchung reviewed Jun 7, 2018

View reviewed changes

whchung reviewed Jun 11, 2018

View reviewed changes

Remove rc1 again, as was the case in the branch that got merged in an…

15a4324

…d was lost.

whchung requested changes Jun 11, 2018

View reviewed changes

Do not use problematic branch on ROCm. nn_test now works for me.

ca30075

Disable also the other special-case column reduction kernel for now o…

ccaf9bb

…n ROCm, this fixes for me the two MNIST failures found by CI.

whchung approved these changes Jun 12, 2018

View reviewed changes

whchung merged commit 869b110 into ROCm:develop-upstream Jun 12, 2018

iotamudelta deleted the rocprim-master branch June 12, 2018 16:04

sunway513 pushed a commit that referenced this pull request Jun 13, 2018

Merge pull request #15 from iotamudelta/rocprim-master

e8892c5

Integrate rocPRIM 0.3.1 milestone

wormwang mentioned this pull request Jun 20, 2019

[aarch64] Run python meet core dump failure on TF 1.12.2 with rocm 2.4 #506

Closed

Integrate rocPRIM 0.3.1 milestone #15

Integrate rocPRIM 0.3.1 milestone #15

Conversation

iotamudelta commented Jun 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whchung commented Jun 7, 2018

parallelo commented Jun 7, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iotamudelta commented Jun 11, 2018

iotamudelta commented Jun 11, 2018

parallelo commented Jun 11, 2018 • edited

parallelo commented Jun 12, 2018

whchung commented Jun 12, 2018

iotamudelta commented Jun 12, 2018

whchung commented Jun 12, 2018

whchung commented Jun 12, 2018

iotamudelta commented Jun 12, 2018

iotamudelta commented Jun 12, 2018

VincentSC commented Jun 12, 2018

whchung commented Jun 12, 2018

iotamudelta commented Jun 12, 2018

whchung commented Jun 12, 2018

VincentSC commented Jun 18, 2018

parallelo commented Jun 7, 2018 •

edited

parallelo commented Jun 11, 2018 •

edited