Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device API for random number generation #139

Open
WeiqunZhang opened this issue May 15, 2020 · 4 comments
Open

Device API for random number generation #139

WeiqunZhang opened this issue May 15, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request MKL

Comments

@WeiqunZhang
Copy link

There is host API for generating random numbers on device in oneMKL. However, there is no device API for random number generation. The problem is we sometimes do not know how many random numbers we need before we launch a kernel that needs to use random numbers. Sometimes even when we know, the number of rand numbers needed might be too big to fit into memory.

Could oneAPI provide device API for random number generation?

@rscohn2 rscohn2 added enhancement New feature or request MKL labels May 15, 2020
@rscohn2
Copy link
Member

rscohn2 commented May 15, 2020

@mkrainiuk: Can you look at this?

@mkrainiuk
Copy link
Contributor

Hi @WeiqunZhang, yes, we plan to provide device API for selected oneMKL domains including random number generators. We did some prototyping, and currently discussing API. If you have any ideas about what API will perfectly fit to your kernels, please share with us.

@mkrainiuk
Copy link
Contributor

summon @marius-cornea

@WeiqunZhang
Copy link
Author

Something like below would work for us.

cl::sycl::queue queue(...);                                                                        
std::size_t NSTATES  = ...; // # of states in the engine                                                   
onemkl::rng::philox4x32x10 engine(queue, N, SEED);                                                 
 
auto NTHREADS = NSTATES;              
queue.submit([&] (cl::sycl::handler& h)                                                            
{                                                                                                  
    onemkl::rng::accessor<onemkl::rng::philox4x32x10> engine_accessor(engine);                     
    h.parallel_for(cl::sycl::range<1>(NTHREADS),                                                          
    [=] (cl::sycl::item<1> item)                                                                   
    {                                                                                              
        onemkl::rng::uniform<double> distr(0.0, 1.0); // uniform random double [0,1)               
        auto r = distr(engine_accessor[item.get_linear_id()]);                                                           
    });                                                                                            
});                                                                                                

onemkl::rng::accessor is something I made up. I imagine it might be hard to capture philox4x32x10 by value onto device because of resource ownership issue. So accessor could be something like a struct with a non-owning pointer and it can be freely copied.

Also we usually use ordered queue. In that case, we also would like to be able to construct an accessor object outside the command group scope (i.e., outside queue.submit()) instead. To be able to do this is actually very important to us.

In the code above, we assume the number NSTATES passed to the constructor of the engine is the same as NTHREADS, the number of threads used to launch the device kernel. Ideally we would like this to be relaxed to that NTHREADS is just a multiple of NSTATES, because we may launch a kernel with much more threads than the maximum number of threads the hardware can run simultaneously. Otherwise, it would waste a lot of memory and we would have to recreate the engine (which is probably very expensive) when we need to launch a second kernel with different number of threads. To get around that in our CUDA code, we have implemented a mutex so that we can launch kernels with different number of threads using the same engine. Some threads in a group (but not those that do not need to generate random numbers) call lock. If one thread locks it, the whole group owns it. These threads can then generate a number of random numbers. Finally they unlock the mutex to let another group use it. Those that call lock must call unlock, but not all threads in the group have to go through the path of lock-generate-unlock. If oneMKL can provide such a mutex, that would be perfect. So the perfect API for us would be

cl::sycl::queue queue(...);   // ordered queue
std::size_t NSTATES  = ...; // # of states in the engine                                                   
onemkl::rng::philox4x32x10 engine(queue, N, SEED);                                                 
 
onemkl::rng::accessor<onemkl::rng::philox4x32x10> engine_accessor(engine);  

auto NTHREADS = n*NSTATES;
queue.submit([&] (cl::sycl::handler& h)                                                            
{                   
    h.parallel_for(cl::sycl::range<1>(NTHREADS),                                                          
    [=] (cl::sycl::item<1> item)                                                                   
    {
        if (item.get_local_linear_id() < 5) {
            onemkl::rng::scoped_mutex mutx(engine_accessor);                                   
            onemkl::rng::uniform<double> distr(0.0, 1.0); // uniform random double [0,1)               
            auto r = distr(engine_accessor[item.get_linear_id()]);
            // some threads might generate more random nubmers
            // some might use a different distribution
         }                                                          
    });                   
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request MKL
Projects
None yet
Development

No branches or pull requests

3 participants