Skip to content

Commit

Permalink
oneMKL max and min
Browse files Browse the repository at this point in the history
First set of change.
  • Loading branch information
Sarbojit2019 committed Nov 2, 2022
1 parent 3589f87 commit eaf3a9b
Show file tree
Hide file tree
Showing 5 changed files with 130 additions and 0 deletions.
33 changes: 33 additions & 0 deletions deps/onemkl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,39 @@ extern "C" int onemklZgemm(syclQueue_t device_queue, onemklTranspose transA,
return 0;
}

extern "C" void onemklDamax(syclQueue_t device_queue, int64_t n, const double *x, int64_t incx, int64_t *result)
{
//printf("Hello World\n\n");
printf("n %ld, incx %ld, x %p\n", n, incx, x);
int64_t r=0;
oneapi::mkl::blas::column_major::iamax(device_queue->val, n, x, incx, &r);
}
extern "C" void onemklSamax(syclQueue_t device_queue, int64_t n, const float *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamax(device_queue->val, n, x, incx, result);

This comment has been minimized.

Copy link
@pengtu

pengtu Nov 2, 2022

@maleadt: The oneMKL iamax takes a sycl::buffer<int64, 1> &result to return value: https://spec.oneapi.io/versions/1.0-rev-1/elements/oneMKL/source/domains/blas/iamax.html#iamax-buffer-version.

What is the proper way to allocate the result buffer and pass it to oneMKL:

  1. Allocate a new sycl::buffer here and copy the result out seems to be an overkill and may introduce new dependency on the dpcpp compiler
  2. Allocate a oneVector in wrappers.jl; pass a ZeRef?

Add @Sarbojit2019

This comment has been minimized.

Copy link
@pengtu

This comment has been minimized.

Copy link
@maleadt

maleadt Nov 2, 2022

Why can't we use the USM version and just pass pointers?

This comment has been minimized.

Copy link
@pengtu

pengtu Nov 2, 2022

To use USM, we still need to call malloc_shared() to allocate the 'result' and we also need to change the input array from sycl::buffer to USM, which means also changing the current oneArray allocation mechanism to USM.

This comment has been minimized.

Copy link
@maleadt

maleadt Nov 2, 2022

Ah, so there's no way to support "arbitrary" CPU pointers (as e.g. CUBLAS does for dual CPU/GPU arguments)? I don't think this is a problem though, as we control the allocation.

wrt. switching oneArray to USM, what is the performance impact of that? We currently use device buffers in all our GPU packages to avoid any overhead (while supporting unified allocations if explicitly requested). Switching the default isn't hard though, just a matter of changing this to SharedBuffer: https://github.com/JuliaGPU/oneAPI.jl/blob/0d9ae2f12bc3915d5595cee7ee2277f86eb6e405/src/array.jl#L95-L97

This comment has been minimized.

Copy link
@pengtu

pengtu Nov 3, 2022

Yeah, there is no way to support "arbitrary" CPU pointer as in CUBLAS. You are correct that we can call the USM version with our zeMemAllocDevice allocated device memory. In our C wrapper, we may just need to create a shared 'result' variable as in the following iamax test code:

auto result_p = (int64_t*)oneapi::mkl::malloc_shared(64, sizeof(int64_t), *dev, cxt);

https://github.com/oneapi-src/oneMKL/blob/develop/tests/unit_tests/blas/level1/iamax_usm.cpp#L83

We will give it a try to see if it works.

This comment has been minimized.

Copy link
@Sarbojit2019

Sarbojit2019 Nov 3, 2022

Author Owner

I thought 'RefOrZeRef' makes it accessible to CPU and GPU. Is not correct assumption? That is why I have created result as result::RefOrZeRef{Int64}.
@maleadt, is there anything specific need to be set in order to use buffered version vs USM version of the API? when I try to pass sycl::buffer as type to result below is the compile time error message I get.

/home/sarbojit/src/oneAPI.jl/deps/onemkl.cpp:90:5: error: no matching function for call to 'iamax'
    oneapi::mkl::blas::column_major::iamax(device_queue->val, n, x, incx, &r);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sarbojit/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/bin-llvm/../include/oneapi/mkl/blas/usm_decls.hpp:1299:28: note: candidate function not viable: no known conversion from 'sycl::buffer<std::int64_t, 1> *' (aka 'buffer<long, 1> *') to 'std::int64_t *' (aka 'long *') for 5th argument
DLL_EXPORT cl::sycl::event iamax(cl::sycl::queue &queue, std::int64_t n, const double *x, std::int64_t incx,

This comment has been minimized.

Copy link
@maleadt

maleadt Nov 3, 2022

RefOrZeRef is a target type for ccall, not to be used directly. You should use ZeRef, which is backed by oneArray, so not USM right now (see above): https://github.com/JuliaGPU/oneAPI.jl/blob/d68cd24d285a290d8a07f48f4ecab8bd8a0e8b6a/src/array.jl#L224-L227

This comment has been minimized.

Copy link
@maleadt

maleadt Nov 3, 2022

Actually, I'm confused: isn't USM just the idea of using pointers from either malloc_device OR malloc_shared, instead of buffers? That's at least what I get from reading https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/memory/usm-buffer.html. So shouldn't iamax document that for the USM version the result pointer should be a host (or shared) pointer, while the input should be device (or shared) memory?

This comment has been minimized.

Copy link
@pengtu

pengtu Nov 3, 2022

You are correct that USM is just an idea of using pointers for malloc_device or malloc_shared. The result pointer can be a host/shared/device pointer allocated by malloc_device/malloc_host/malloc_shared, but not a normal host variable pointer.

This comment has been minimized.

Copy link
@maleadt

maleadt Nov 3, 2022

Makes sense. I've opened an issue for a better abstraction, as the C wrappers shouldn't have to allocate any memory.

}
extern "C" void onemklZamax(syclQueue_t device_queue, int64_t n, const double _Complex *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamax(device_queue->val, n, reinterpret_cast<const std::complex<double> *>(x), incx, result);
}
extern "C" void onemklCamax(syclQueue_t device_queue, int64_t n, const float _Complex *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamax(device_queue->val, n, reinterpret_cast<const std::complex<float> *>(x), incx, result);
}

extern "C" void onemklDamin(syclQueue_t device_queue, int64_t n, const double *x, int64_t incx, int64_t *result)
{
//printf("Hello World\n\n");
//printf("n %ld, incx %ld, x %p\n", n, incx, x);
int64_t r=0;
oneapi::mkl::blas::column_major::iamin(device_queue->val, n, x, incx, &r);
}
extern "C" void onemklSamin(syclQueue_t device_queue, int64_t n, const float *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamin(device_queue->val, n, x, incx, result);
}
extern "C" void onemklZamin(syclQueue_t device_queue, int64_t n, const double _Complex *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamin(device_queue->val, n, reinterpret_cast<const std::complex<double> *>(x), incx, result);
}
extern "C" void onemklCamin(syclQueue_t device_queue, int64_t n, const float _Complex *x, int64_t incx, int64_t *result){
oneapi::mkl::blas::column_major::iamin(device_queue->val, n, reinterpret_cast<const std::complex<float> *>(x), incx, result);
}

// other

Expand Down
10 changes: 10 additions & 0 deletions deps/onemkl.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ int onemklZgemm(syclQueue_t device_queue, onemklTranspose transA,
const double _Complex *B, int64_t ldb, double _Complex beta,
double _Complex *C, int64_t ldc);

void onemklDamax(syclQueue_t device_queue, int64_t n, const double *x, int64_t incx, int64_t *result);
void onemklSamax(syclQueue_t device_queue, int64_t n, const float *x, int64_t incx, int64_t *result);
void onemklZamax(syclQueue_t device_queue, int64_t n, const double _Complex *x, int64_t incx, int64_t *result);
void onemklCamax(syclQueue_t device_queue, int64_t n, const float _Complex *x, int64_t incx, int64_t *result);

void onemklDamin(syclQueue_t device_queue, int64_t n, const double *x, int64_t incx, int64_t *result);
void onemklSamin(syclQueue_t device_queue, int64_t n, const float *x, int64_t incx, int64_t *result);
void onemklZamin(syclQueue_t device_queue, int64_t n, const double _Complex *x, int64_t incx, int64_t *result);
void onemklCamin(syclQueue_t device_queue, int64_t n, const float _Complex *x, int64_t incx, int64_t *result);

void onemklDestroy();
#ifdef __cplusplus
}
Expand Down
40 changes: 40 additions & 0 deletions lib/mkl/libonemkl.jl
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,43 @@ function onemklZgemm(device_queue, transA, transB, m, n, k, alpha, A, lda, B, ld
B::ZePtr{ComplexF64}, ldb::Int64, beta::ComplexF64,
C::ZePtr{ComplexF64}, ldc::Int64)::Cint
end

function onemklSamax(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklSamax(device_queue::syclQueue_t, n::Int64,
x::ZePtr{Cfloat}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end

function onemklDamax(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklDamax(device_queue::syclQueue_t, n::Int64,
x::ZePtr{Cdouble}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end

function onemklCamax(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklCamax(device_queue::syclQueue_t, n::Int64,
x::ZePtr{ComplexF32}, incx::Int64,result::RefOrZeRef{Int64})::Cvoid
end

function onemklZamax(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklZamax(device_queue::syclQueue_t, n::Int64,
x::ZePtr{ComplexF64}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end

function onemklSamin(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklSamin(device_queue::syclQueue_t, n::Int64,
x::ZePtr{Cfloat}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end

function onemklDamin(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklDamin(device_queue::syclQueue_t, n::Int64,
x::ZePtr{Cdouble}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end

function onemklCamin(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklCamin(device_queue::syclQueue_t, n::Int64,
x::ZePtr{ComplexF32}, incx::Int64,result::RefOrZeRef{Int64})::Cvoid
end

function onemklZamin(device_queue, n, x, incx, result)
@ccall liboneapi_support.onemklZamin(device_queue::syclQueue_t, n::Int64,
x::ZePtr{ComplexF64}, incx::Int64, result::RefOrZeRef{Int64})::Cvoid
end
37 changes: 37 additions & 0 deletions lib/mkl/wrappers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,43 @@ end
#
# BLAS
#
# level 1

## iamax
for (fname, elty) in
((:onemklDamax,:Float64),
(:onemklSamax,:Float32),
(:onemklZamax,:ComplexF64),
(:onemklCamax,:ComplexF32))
@eval begin
function iamax(n::Integer, x::StridedArray{$elty})
result = Ref{Clong}()
#result = StridedArray{oneArray(Int64)};
n = length(x)
queue = global_queue(context(x), device(x))
$fname(sycl_queue(queue), n, x, stride(x, 1), result)
return result[]
end
end
end

## iamin
for (fname, elty) in
((:onemklDamax,:Float64),
(:onemklSamax,:Float32),
(:onemklZamax,:ComplexF64),
(:onemklCamax,:ComplexF32))
@eval begin
function iamin(n::Integer, x::StridedArray{$elty})
result = Ref{Clong}()
#result = StridedArray{oneArray(Int64)};
n = length(x)
queue = global_queue(context(x), device(x))
$fname(sycl_queue(queue), n, x, stride(x, 1), result)
return result[]
end
end
end

# level 3

Expand Down
10 changes: 10 additions & 0 deletions test/onemkl.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
using oneAPI
using oneAPI.oneMKL

using LinearAlgebra

m = 20

A = oneArray(rand(Float64, m))
#maxVal = oneMKL.iamax(m, A)
minVal = oneMKL.iamin(m, A)

0 comments on commit eaf3a9b

Please sign in to comment.