Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL Backend #2195

Closed
wants to merge 93 commits into from

Conversation

@lunochod
Copy link

commented Mar 25, 2015

About

The proposed changes add OpenCL support to Caffe. All GPU functions can be executed using AMD GPUs w/ OpenCL 1.2 or 2.0 as well as nVidia GPUs w/ OpenCL 1.1.

Build Instructions

https://github.com/lunochod/caffe/wiki/OpenCL-Backend

OpenCL Tests

All GPU tests successfully complete using this OpenCL version of Caffe.

Performance and Stability

The main goal was to provide an OpenCL port to the Caffe community. As such it is not yet optimized for performance or stability.

Help Wanted

Let's make it better and faster together.

lunochod added 9 commits Mar 21, 2015
- Initial Commit
commit build system changes and header files
- OpenCL Backend Feature Complete Commit
The OpenCL build target is configured in Makefile and
Makefile.config.OpenCL. clBLAS is used as the backend for all BLAS
functionas that require it.

All GPU tests succesfully complete using AMD or nVidia OpenCL.
@bhack

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2015

Have you looked at #610?

@kuke

This comment has been minimized.

Copy link

commented Mar 26, 2015

Hi, I am interested in your OpenCL implementation of Caffe. Have you trained data set like MNIST, CIFAR, IMAGENET using your code? I encountered some problems when using your project to train cifar10 network. I would like to know whether you have run into the same issues. could you please kindly give me some instructions ? Thanks.

@lunochod

This comment has been minimized.

Copy link
Author

commented Mar 26, 2015

Hello,

 thanks for your interest. At this point the code only passes the 

GPU unit tests that are provided with Caffe. I am certain that the tests
are not sufficient in order to ensure that any applications build with
Caffe will work.

Can you please provide some more details that will help me to understand
the issue? I'll also add documentation in the next couple of days.

Robert

On 03/26/2015 02:56 AM, kuke wrote:

Hi, I am interested in your OpenCL implementation of Caffe. Have you
trained data set like MNIST, CIFAR, IMAGENET using your code? I
encountered some problems when using your project to train cifar10
network. I would like to know whether you have run into the same
issues. could you please kindly give me some instructions ? Thanks.


Reply to this email directly or view it on GitHub
#2195 (comment).

@jyegerlehner

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2015

Hi @lunochod,

Can you please provide some more details that will help me to understand the issue? I'll also add documentation in the next couple of days.

He is referring to several defacto-standard datasets that are often used by researchers when reporting performance of their models, and that are included as examples with Caffe. If you look here and scroll down to "Examples", you'll see links to MNIST, CIFAR, and IMAGENET tutorials that provide step-by-step instructions for running those example models. So the expectation is they would perform the same under OpenCL as CUDA or CPU. Sorry if that was already obvious to you; maybe you were just asking him to provide details of what's going wrong in his case.

Thanks for this contribution! I too am interested in being able to run on OpenCL devices.

@jyegerlehner

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2015

To put a finer point on bhack's allusion to PR 610: the direction they were going was to have an abstract interface Device for all the math operations to be performed. Two specializations of this interface, CPU and GPU override the abstract virtual functions, hiding and encapsulating the code for performing the operations on the CPU (using BLAS) and GPU (using cuda/cuBLAS). With the intent that adding support for OpenCL would mostly be a matter of providing another new specialization of Device, perhaps CLGPU, which would presumably use clBLAS (and perhaps some cl kernels) to get things done. Thus the same build of the library would be capable of running on Cuda devices, CPU, or OpenCL with all the differences in behavior being runtime decisions through good ol' polymorphic behavior of the Device subclasses.

By contrast, it looks like the approach of this PR is to use compile time switch USE_OPENCL or USE_CUDA, such that the library is being built either for OpenCL or for Cuda, but not both. Am I seeing that right?

Among the good things about the approach of 610 is that it fixes the situation where there is separate Forward_cpu and Forward_gpu for each layer class, and a separate Backward_cpu and Backward_gpu for each layer class. The *_cpu and *_gpu calls are mostly just duplicates of each other, except that an operation such as caffe_cpu_gemm() in the cpu version is replaced with caffe_gpu_gemm() in the GPU version. 610 lets a layer just call device->gemm(), and so it doesn't need to know whether it is a CPU or GPU-executed operation because that is all encapulated within the Device instance. Thus there is no longer any need for separate Forward_cpu() and Forward_gpu() methods: it just has a single Forward() method. So half the code just goes away for a big refactoring a win.

This was referenced Mar 28, 2015
@jyegerlehner jyegerlehner referenced this pull request Mar 30, 2015

# uncomment one
CPU_ONLY := 1

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

In the various Makefile.config.* files, CPU_ONLY := 1 is shown. This is confusing; what is the intended effect if CPU_ONLY is 1 in some and 0 in others? I think there should be a single authoritative source where the decision is made. I'm not sure how all this hangs together, but I'm thinking that since you introduced USE_OPENCL and USE_CUDA, CPU_ONLY should be removed. If neither USE_OPENCL nor USE_CUDA is specified, then what was called CPU_ONLY is what you are left with. I.e. one always gets the CPU implementation. USE_OPENCL and USE_CUDA will determine whether those implementations are available. But maybe you had something else in mind.

This comment has been minimized.

Copy link
@lunochod

lunochod Mar 31, 2015

Author

The hand-crafted makefiles are confusing. My intention was to have three
device dependent Makefiles:

Makefile.config.CPU
Makefile.config.CUDA
Makefile.config.OpenCL

and allow variables in Makefile.config to select which file should be
included:

CPU_ONLY := 1 // build CPU only version
USE_CUDA := 1 // build CUDA version
USE_OPENCL := 1 // build OpenCL version

The OpenCL version still requires the CPU_ONLY flag to be set in order
to prevent running into CUDA code. The reason is that Caffe branches
between the CPU code and the CUDA code using the preprocessor:

#ifdef CPU_ONLY

#else

#endif

and I haven't fixed that everywhere yet.

In the meantime I add support to build the OpenCL backend using cmake
which works better and compiles faster.

Robert

On 03/30/2015 09:47 AM, jyegerlehner wrote:

In Makefile
#2195 (comment):

+# uncomment one
+CPU_ONLY := 1

In the various Makefile.config.* files, CPU_ONLY := 1 is shown. This
is confusing; what is the intended effect if CPU_ONLY is 1 in some and
0 in others? I think there should be a single authoritative source
where the decision is made. I'm not sure how all this hangs together,
but I'm thinking that since you introduced USE_OPENCL and USE_CUDA,
CPU_ONLY should be removed. If neither USE_OPENCL nor USE_CUDA is
specified, then what was called CPU_ONLY is what you are left with.
I.e. one always gets the CPU implementation. USE_OPENCL and USE_CUDA
will determine whether those implementations are available. But maybe
you had something else in mind.


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/pull/2195/files#r27407792.

@jyegerlehner

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2015

Hi @lunochod, I spent some time hacking on this code to get the tests running and (mostly) passing. Looks to me like you did lots of brilliant work, and it was exciting to see the Caffe tests running on the Radeon GPU.

There were a number of changes I needed to make to get there, see:
jyegerlehner@6395d00
Could be I was doing something wrong/not understanding so I just provide it as docs in case its helpful. I will try to make comments on your PR to explain why I changed what I did.

TEST_OBJS := $(TEST_CXX_OBJS)
TEST_BINS := $(TEST_CXX_BINS)
ALL_WARNS := $(ALL_CXX_WARNS)
TEST_FILTER := --gtest_filter="*GPU*"

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

I don't think this gtest filter is right. Vast majority of tests that run on the GPU don't have the string "GPU" in their name. With this filter, only about 30 tests were being run. To get them all to run I commented this out.

# Contributions simplifying and improving our build system are welcome!

# set CPU-only switch to build a version of caffe that runs on the CPU only
CPU_ONLY := 1

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

Please see previous comment about CPU_ONLY in Makefile.

# USE_CUDNN := 1

# CUDA directory contains bin/ and lib/ directories that we need.
USE_CUDA := 1

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

USE_CUDA also can be specified in Makefile. This file isn't included in Makefile if the USE_CUDA in Makefile is not defined. I don't think we want a compiler switch specified in more than one place, with non-sensical combinations possible. Should just appear one place. Also should have static asserts to enforce mutual exclusivity of USE_CUDA and USE_OPENCL if that is what is intended.

I also have reservations about USE_CUDA := 1. Naive user might think USE_CUDA := 0 turns it off. This gets to difference between behavior #if USE_CUDA vs #ifdef USE_CUDA. If USE_CUDA :=0, #if USE_CUDA sees its argument as false, whereas #ifdef USE_CUDA sees it as true since all it cares is that it has been defined, not what it's value is.

I'm not clear on what best practices are on these things; I usually try to to avoid compilation switches myself. Maybe you're way ahead of me.

STUB_GPU(ConvolutionLayer);
#endif

INSTANTIATE_CLASS(ConvolutionLayer);
//REGISTER_LAYER_CLASS(Convolution);

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

Commenting out REGISTER_LAYER_CLASS caused tests to fail because.. well.. the layer type isn't registered. Uncommenting this line fixed this problem. Same thing with multiple other layer classes. Please see this commit for all the cases.

This comment has been minimized.

Copy link
@lunochod

lunochod Mar 31, 2015

Author

I run into the same issue. But the problem only exists in the Caffe
version that uses the hand-crafted makefiles and I couldn't fully
resolve it there. The solution is to use the latest code version and
build using cmake.

Robert

On 03/30/2015 10:09 AM, jyegerlehner wrote:

In src/caffe/layers/conv_layer.cpp
#2195 (comment):

STUB_GPU(ConvolutionLayer);
#endif

INSTANTIATE_CLASS(ConvolutionLayer);
+//REGISTER_LAYER_CLASS(Convolution);

Commenting out REGISTER_LAYER_CLASS caused tests to fail because..
well.. the layer type isn't registered. Uncommenting this line fixed
this problem. Same thing with multiple other layer classes. Please see
my commit
jyegerlehner@6395d00
for all the cases.


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/pull/2195/files#r27410109.

caffe_rng_uniform(n, (double) 0, (double) UINT_MAX, fBuffer);
for ( int i = 0; i < n; i++ ) {
buffer[i] = (unsigned int) fBuffer[i];
printf("buffer[%3d] = %u\n", i, buffer[i]);

This comment has been minimized.

Copy link
@jyegerlehner

jyegerlehner Mar 30, 2015

Contributor

Delete this printf please. Creates a big mess.

This comment has been minimized.

Copy link
@lunochod

lunochod Mar 31, 2015

Author

Fixed.

Robert

On 03/30/2015 10:16 AM, jyegerlehner wrote:

In src/caffe/util/OpenCL/OpenCLSupport.cpp
#2195 (comment):

  • if ( fBuffer == NULL ) {
  •   LOG(ERROR)<<"failed to allocate cpu memory for buffer of "<<bytes<<" Bytes.";
    
  •   return false;
    
  • }
  • bytes = n*sizeof(unsigned int);
  • unsigned int* buffer = (unsigned int*) malloc(bytes);
  • if ( buffer == NULL ) {
  •   LOG(ERROR)<<"failed to allocate cpu memory for buffer of "<<bytes<<" Bytes.";
    
  •   return false;
    
  • }
  • caffe_rng_uniform(n, (double) 0, (double) UINT_MAX, fBuffer);
  • for ( int i = 0; i < n; i++ ) {
  •   buffer[i] = (unsigned int) fBuffer[i];
    
  •   printf("buffer[%3d] = %u\n", i, buffer[i]);
    

Delete this printf please. Creates a big mess.


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/pull/2195/files#r27410768.

@jyegerlehner

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2015

Couple more things.

  1. In several of the test files (test_common.cpp, test_inner_product_layer.cpp, test_math_functions.cpp), there was #ifndef CPU_ONLY around some CUDA code. Building to run the tests on OpenCL, I had CPU_ONLY undefined, and USE_OPENCL defined. Thus it was trying to compile this code. The machine has no Cuda on it, just Opencl, so it wouldn't compile. I changed these to #ifdef USE_CUDA. You can see these changes in this commit.

  2. After all was done, I have one test failing:

[----------] Global test environment tear-down
[==========] 1147 tests from 192 test cases ran. (2083174 ms total)
[  PASSED  ] 1146 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SyncedMemoryTest.TestGPUWrite

TestGPUWrite looked like this:

[ RUN      ] SyncedMemoryTest.TestGPUWrite
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:110: Failure
Value of: 1
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
src/caffe/test/test_syncedmem.cpp:119: Failure
Value of: 2
Expected: (static_cast<const char*>(cpu_data))[i]
Which is: '\0'
[  FAILED  ] SyncedMemoryTest.TestGPUWrite (1 ms)

I haven't tried looking into that yet.

CCC shows the Catalyst "Driver Packaging Version" as 13.35.1005. clinfo shows

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 1.2 AMD-APP (1411.4)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_amd_hsa 


  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Device ID:                     4098
  Board name:                    AMD Radeon HD 7900 Series 
  Device Topology:               PCI[ B#1, D#0, F#0 ]
  Max compute units:                 28
  Max work items dimensions:             3
    Max work items[0]:               256
    Max work items[1]:               256
    Max work items[2]:               256
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               800Mhz
  Address bits:                  32
  Max memory allocation:             1073741824
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                2702180352
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Platform ID:                   0x00007fe999347500
  Name:                      Tahiti
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1411.4 (VM)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1411.4)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir 


  Device Type:                   CL_DEVICE_TYPE_CPU
  Device ID:                     4098
  Board name:                    
  Max compute units:                 6
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               1024
  Preferred vector width char:           16
  Preferred vector width short:          8
  Preferred vector width int:            4
  Preferred vector width long:           2
  Preferred vector width float:          8
  Preferred vector width double:         4
  Native vector width char:          16
  Native vector width short:             8
  Native vector width int:           4
  Native vector width long:          2
  Native vector width float:             8
  Native vector width double:            4
  Max clock frequency:               3511Mhz
  Address bits:                  64
  Max memory allocation:             4192962560
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                8192
  Max image 2D height:               8192
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           4096
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                16771850240
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Global
  Local memory size:                 32768
  Kernel Preferred work group size multiple:     1
  Error correction support:          0
  Unified memory for Host and Device:        1
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             Yes
  Queue properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Platform ID:                   0x00007fe999347500
  Name:                      AMD FX(tm)-6300 Six-Core Processor
  Vendor:                    AuthenticAMD
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1411.4 (sse2,avx,fma4)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1411.4)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_amd_svm 

I can commit some time helping bring this forward in whatever form the Brewers would find acceptable.

@jyegerlehner

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2015

By the way there's a ton of lint. Run make lint. The enforced style is a restrictive, a pain and I don't like it but we live with it. There's also lots of tabs. Need two spaces instead.

allow to build and install cifar10 example using cmake
make sure cifar10 runs
performance improvements
@naibaf7

This comment has been minimized.

Copy link
Member

commented Aug 27, 2015

@lunochod
Really not?

That is true. The application will compile a kernel for each of the devices reported by the OpenCL runtime. All of the devices need to support double precision. Currently the first GPU device is selected for execution, but that happens only after compilation of all kernels.

Your comment to it last week.

@lunochod

This comment has been minimized.

Copy link
Author

commented Aug 27, 2015

@naibaf7

Your comment to it last week.

My comment from last week describes my implementation, not the problem @Nirvik-D reported. The problem he reported happens at compile time. And while I have an idea why that happens, I usually figure stuff out before I give advice.

@bhack

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2015

I like this new collaborative mood.

@naibaf7

This comment has been minimized.

Copy link
Member

commented Aug 27, 2015

@lunochod
Let me be a bit more helpful:
https://github.com/viennacl/viennacl-dev/blob/master/viennacl/ocl/device.hpp
line 91:

cl_int err = clGetDeviceInfo(device_, CL_DEVICE_AVAILABLE, sizeof(cl_bool), static_cast<void *>(&available_), NULL);

should probably be checked before compiling for an OpenCL device that was detected by the OpenCL implementation.

@lunochod

This comment has been minimized.

Copy link
Author

commented Aug 27, 2015

@naibaf7

@Nirvik-D has provided the command line output of 'clinfo'. It lists one platform and two devices

CL_PLATFORM_NAME : AMD Accelerated Parallel Processing
CL_DEVICE_TYPE_GPU : AMD Radeon HD 7500M/7600M
CL_DEVICE_TYPE_CPU : AMD A8-4555M APU with Radeon(tm) HD Graphics

The source code for clinfo is available on GitHub:

https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c

It shows that the device listing doesn't exclude devices that are not available. The information provided by clinfo is the same my implementation finds when OpenCL gets initialized at startup, which means that there is no 3rd device that is not available as you suspect.

But check for CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE here:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetDeviceInfo.html

This property is reported to be zero by clinfo for the GPU on the user's platform, which means that it doesn't have support for native double as required by the application and hence that is the reason for the error.

Thanks for trying to help Fabian.

@naibaf7

This comment has been minimized.

Copy link
Member

commented Aug 27, 2015

@lunochod
Ok I see what you mean now.

There was a 3rd device showing up before @Nirvik-D upgraded to 15.7 / APP SDK 3.0.0 but if that apparently got resolved then ok that is fine.

So the new error solely is based on missing double support as you pointed out.

@hughperkins

This comment has been minimized.

Copy link

commented Aug 29, 2015

Hi @robert, just a quick observation, you are working in what is now a very crowded space. Seems like the caffe OpenCL ports are now largely feature complete, but Theano and Chainer are still CUDA-specific:

Library OpenCL port available? OpenCL port in Soumith's benchmarks?
DeepCL Yes Yes
Torch Yes Yes
Caffe 2610 2195 etc ... Yes
Theano
Chainer
@bhack

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2015

For all the followers of this PR a nice performance thread on benchmark results was started at soumith/convnet-benchmarks#56

@ghost

This comment has been minimized.

Copy link

commented Sep 15, 2015

Hi,
I've tested your caffe and I found a bug in util/OpenCL/OpenCLManager.cpp.
I have two OpenCL Platforms (AMD and NVIDIA in order) in my desktop. AMD platform doesn't have a GPU device but NVIDIA does. However, OpenCLManager couldn't find a platform which has a GPU. I looked up OpenCLManager.cpp and found the reason.

method: OpenCLManager::Query() in OpenCLManager.cpp:130,
(https://github.com/lunochod/caffe/blob/master/src/caffe/util/OpenCL/OpenCLManager.cpp#L130)

  for (ClPlatformsIter it = cl_platforms.begin(); it != cl_platforms.end();
      ++it) {
    std::tr1::shared_ptr<OpenCLPlatform> pp = std::tr1::shared_ptr<
        OpenCLPlatform>(
        new OpenCLPlatform(
            *it));
    if (!pp->Query()) {
      LOG(ERROR)<< "failed to query platform.";
      return false;
    }
    platforms_.push_back(
        pp);
  }

For each iteration, pp queries its devices (especially GPU devices) but it returns false and terminates method OpenCLManager::Query() itself when it fails to query: !pp->Query(). Therefore, method OpenCLManager::Query() cannot add any other platforms if the first platform has no specified devices (like GPUs). In my case, the first platform was AMD without GPU so method OpenCLManager::Query() couldn't add the second platform - NVIDIA with GPU.

I fixed this method by myself and it works. I thought it was just a minor bug - other guys' work! - so I didn't make branches, requests, all complex things.... whatever.

Thanks.

@lunochod

This comment has been minimized.

Copy link
Author

commented Sep 15, 2015

@sungbin0105

Good catch. Thanks, I'll fix that.

Robert

@NeutralCode

This comment has been minimized.

Copy link

commented Sep 22, 2015

Can caffe process 'device_query' command?

./caffe device_query -gpu 0
@hughperkins

This comment has been minimized.

Copy link

commented Oct 18, 2015

(personal observation: this is such a crazy-long PR :-P Maybe worth creating as a separate fork, with its own issue tracker? Just an idea :-) )

@inferrna

This comment has been minimized.

Copy link

commented Oct 20, 2015

Got error with both ViennaCL 1.6.2 and 1.7.0 versions.

I1020 15:21:31.319463 11600 net.cpp:155] Setting up loss
I1020 15:21:31.319471 11600 net.cpp:163] Top shape: (1)
I1020 15:21:31.319479 11600 net.cpp:168]     with loss weight 1
I1020 15:21:31.319491 11600 net.cpp:239] loss needs backward computation.
I1020 15:21:31.319499 11600 net.cpp:239] ip3 needs backward computation.
I1020 15:21:31.319507 11600 net.cpp:239] ip2 needs backward computation.
I1020 15:21:31.319514 11600 net.cpp:239] ip1 needs backward computation.
I1020 15:21:31.319522 11600 net.cpp:243] data does not need backward computation.
I1020 15:21:31.319530 11600 net.cpp:286] This network produces output loss
I1020 15:21:31.319540 11600 net.cpp:300] Network initialization done.
I1020 15:21:31.319548 11600 net.cpp:301] Memory required for data: 20200
I1020 15:21:31.319568 11600 solver.cpp:67] Solver scaffolding done.
ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program ''
Number of kernels in program: 0
Traceback (most recent call last):
  File "trainsin.py", line 30, in <module>
    solver.step(130000)
RuntimeError: Kernel not found

Caffe examples also fails.

@lunochod

This comment has been minimized.

Copy link
Author

commented Oct 21, 2015

@inferrna @naibaf7

This OpenCL PR doesn't use ViennaCL as a backend, you might want to let Fabian #2610 know.

@shelhamer

This comment has been minimized.

Copy link
Member

commented Oct 21, 2015

@lunochod thanks for your continued efforts on this! I hope that all the OpenCL branches out there now can be coordinated in the future. Happy hacking.

@SlimeQ

This comment has been minimized.

Copy link

commented Oct 25, 2015

hey, i'm on kubuntu 14.04 with a radeon 290x gpu. i can run the cifar test just fine, but the unit tests keep causing a system-wide failure at a certain point. typescripting the process reveals this test is the culprit

[----------] 6 tests from SliceLayerTest/3, where TypeParam = caffe::DoubleGPU
[ RUN      ] SliceLayerTest/3.TestSetupNum
[       OK ] SliceLayerTest/3.TestSetupNum (0 ms)
[ RUN      ] SliceLayerTest/3.TestSetupChannels
[       OK ] SliceLayerTest/3.TestSetupChannels (0 ms)
[ RUN      ] SliceLayerTest/3.TestSliceAcrossNum
[       OK ] SliceLayerTest/3.TestSliceAcrossNum (3 ms)
[ RUN      ] SliceLayerTest/3.TestSliceAcrossChannels
[       OK ] SliceLayerTest/3.TestSliceAcrossChannels (3 ms)
[ RUN      ] SliceLayerTest/3.TestGradientAcrossNum
[       OK ] SliceLayerTest/3.TestGradientAcrossNum (

i'm unsure why this particular test would fail. any insight?

@cynthiazheng

This comment has been minimized.

Copy link

commented Nov 9, 2015

Hi,
I got the same error with @dayeolee
CMakefile was generated well, but error occurred when I make it.

./include/caffe/util/OpenCL/OpenCLMemory.hpp:44:3: error: ‘caffe::OpenCLMemory::OpenCLMemory(const caffe::OpenCLMemory&)’ is private
In file included from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/bits/stl_algobase.h:65:0,
from /opt/cenoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/bits/char_traits.h:41,
from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/ios:41,
from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/ostream:40,
from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/iostream:40,
from src/caffe/util/OpenCL/OpenCLDevice.cpp:5:

Which old commit should I use? Could someone help?

@naibaf7 naibaf7 referenced this pull request Jan 19, 2016

@shelhamer shelhamer added the OpenCL label Jan 26, 2016

@briansp2020

This comment has been minimized.

Copy link

commented Jan 26, 2016

AMD just released Caffe built with HIP and HCC (https://bitbucket.org/multicoreware/hccaffe/overview).
That makes me wonder, do people here want Caffe ported to OpenCL because you want Caffe running on AMD hardware? Or, do you want OpenCL because it has potential to work with more hardware.

I'm just wondering how many people who have been helping out with this project would support the new effort from AMD.

@bhack

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2016

@naibaf7 What is the policy now? I think that not make sense to have Opencl PR opened against master now that we have an official Opencl branch.

@olesalscheider

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

@briansp2020 I would be interested in running Caffe on AMD hardware. Do you have any performance numbers for HcCaffe in comparison with the OpenCL branch? I do not expect it to be as fast as Caffe with cuDNN but it would be nice if it matched the CUDA implementation without cuDNN...

@naibaf7 naibaf7 closed this Feb 25, 2016

@naibaf7

This comment has been minimized.

Copy link
Member

commented Feb 25, 2016

Pull requests on OpenCL should now be made against https://github.com/BVLC/caffe/tree/opencl

@hughperkins

This comment has been minimized.

Copy link

commented May 15, 2016

Robert, how can I cite your opencl caffe fork?

@hughperkins

This comment has been minimized.

Copy link

commented May 15, 2016

(need your surname too, probably)

@hughperkins

This comment has been minimized.

Copy link

commented May 15, 2016

@gujunli

This comment has been minimized.

Copy link

commented May 15, 2016

That is Robert

Sent from my iPhone

On May 15, 2016, at 2:39 PM, Hugh Perkins notifications@github.com wrote:

This is you? https://www.linkedin.com/in/robert-engel-131b87107


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.