Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

Closed
kloudkl opened this issue Mar 8, 2014 · 2 comments
Closed

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Mar 8, 2014

https://developer.nvidia.com/cublasxt

CublasXT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cublasXT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.

Starting with CUDA 6.0, a free version of cublasXT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.

The premier version of cublasXT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!

NVBLAS

NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cublasXT, and is included with the CUDA tookit. All versions of cublasXT work with NVBLAS.

AVAILABILITY

The free version of cublasXT is included with the CUDA Tookit in version 6.0 and beyond.

A free evaluation version of cublasXT Premier will be available to members of the CUDA Registered Developer Program.

@kloudkl
Copy link
Contributor Author

kloudkl commented May 3, 2014

According to "New Features in CUDA 6 Make GPU Acceleration Easier", this can be done by simply re-linking or changing library load order of the NVBLAS introduced by the CUDA 6 using gcc myapp.c –lnvblas -lmkl_rt -o myapp or env LD_PRELOAD=libnvblas.so myapp.

@shelhamer
Copy link
Member

Closing as Caffe is pretty exacting in how it wants its memory to be handled and this is unlikely to be adopted at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants