### Asynchronous `target` with Tasks

The following example shows how the `task` and `target` constructs  are used to execute multiple `target` regions asynchronously. The task that  encounters the `task` construct generates an explicit task that contains  a `target` region. The thread executing the explicit task encounters a task  scheduling point while waiting for the execution of the `target` region  to complete, allowing the thread to switch back to the execution of the encountering  task or one of the previously generated explicit tasks.

In [None]:
//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.1c
* type: C
* version: omp_4.0
*/
#pragma omp declare target
float F(float);
#pragma omp end declare target
#define N 1000000000
#define CHUNKSZ 1000000
void init(float *, int);
float Z[N];
void pipedF(){
   int C, i;
   init(Z, N);
   for (C=0; C<N; C+=CHUNKSZ){
      #pragma omp task shared(Z)
      #pragma omp target map(Z[C:CHUNKSZ])
      #pragma omp parallel for
      for (i=0; i<CHUNKSZ; i++) Z[i] = F(Z[i]);
   }
   #pragma omp taskwait
}



The Fortran version has an interface block that contains the `declare` `target`.  An identical statement exists in the function declaration (not shown here).

In [None]:

! name: async_target.1f
! type: F-free
! version: omp_4.0
module parameters
integer, parameter :: N=1000000000, CHUNKSZ=1000000
end module
subroutine pipedF()
use parameters, ONLY: N, CHUNKSZ
integer            :: C, i
real               :: z(N)

interface
   function F(z)
   !$omp declare target
     real, intent(IN) ::z
     real             ::F
   end function F
end interface

   call init(z,N)

   do C=1,N,CHUNKSZ

      !$omp task shared(z)
      !$omp target map(z(C:C+CHUNKSZ-1))
      !$omp parallel do
         do i=C,C+CHUNKSZ-1
            z(i) = F(z(i))
         end do
      !$omp end target
      !$omp end task

   end do
   !$omp taskwait
   print*, z

end subroutine pipedF



The following example shows how the `task` and `target` constructs  are used to execute multiple `target` regions asynchronously. The task dependence  ensures that the storage is allocated and initialized on the device before it is  accessed.

In [None]:
//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.2c
* type: C
* version: omp_4.0
*/
#include <stdlib.h>
#include <omp.h>
#pragma omp declare target
extern void init(float *, float *, int);
#pragma omp end declare target
extern void foo();
extern void output(float *, int);
void vec_mult(float *p, int N, int dev)
{
   float *v1, *v2;
   int i;
   #pragma omp task shared(v1, v2) depend(out: v1, v2)
   #pragma omp target device(dev) map(v1, v2)
   {
       // check whether on device dev
       if (omp_is_initial_device())
   abort();
       v1 = (float *)malloc(N*sizeof(float));
       v2 = (float *)malloc(N*sizeof(float));
       init(v1, v2, N);
   }
   foo(); // execute other work asychronously
   #pragma omp task shared(v1, v2, p) depend(in: v1, v2)
   #pragma omp target device(dev) map(to: v1, v2) map(from: p[0:N])
   {
       // check whether on device dev
       if (omp_is_initial_device())
   abort();
       #pragma omp parallel for
       for (i=0; i<N; i++)
  p[i] = v1[i] * v2[i];
       free(v1);
       free(v2);
   }
   #pragma omp taskwait
   output(p, N);
}



The Fortran example below is similar to the C version above. Instead of pointers, though, it uses the convenience of Fortran allocatable arrays on the device. In order to preserve the arrays  allocated on the device across multiple `target` regions, a `target` `data` region  is used in this case.

If there is no shape specified for an allocatable array in a `map` clause, only the array descriptor (also called a dope vector) is mapped. That is, device space is created for the descriptor, and it is initially populated with host values. In this case, the  _v1_  and  _v2_  arrays will be in a non-associated state on the device. When space for  _v1_  and  _v2_  is allocated on the device in the first `target` region the addresses to the space will be included in their descriptors.

At the end of the first `target` region, the arrays  _v1_  and  _v2_  are preserved on the device  for access in the second `target` region. At the end of the second `target` region, the data  in array  _p_  is copied back, the arrays  _v1_  and  _v2_  are not.

A `depend` clause is used in the `task` directive to provide a wait at the beginning of the second  `target` region, to insure that there is no race condition with  _v1_  and  _v2_  in the two tasks. It would be noncompliant to use  _v1_  and/or  _v2_  in lieu of  _N_  in the `depend` clauses,  because the use of non-allocated allocatable arrays as list items in a `depend` clause would  lead to unspecified behavior.

This example is not strictly compliant with the OpenMP 4.5 specification since the allocation status of allocatable arrays  _v1_  and  _v2_  is changed inside the `target` region, which is not allowed. (See the restrictions for the `map` clause in the  _Data-mapping Attribute Rules and Clauses_   section of the specification.) However, the intention is to relax the restrictions on mapping of allocatable variables in the next release of the specification so that the example will be compliant.

In [None]:

! name: async_target.2f
! type: F-free
! version: omp_4.0
 subroutine mult(p,  N, idev)
   use omp_lib, ONLY: omp_is_initial_device
   real             :: p(N)
   real,allocatable :: v1(:), v2(:)
   integer ::  i, idev
   !$omp declare target (init)

   !$omp target data map(v1,v2)

   !$omp task shared(v1,v2) depend(out: N)
      !$omp target device(idev)
         if( omp_is_initial_device() ) &
            stop "not executing on target device"
         allocate(v1(N), v2(N))
         call init(v1,v2,N)
      !$omp end target
   !$omp end task

   call foo()  ! execute other work asychronously

   !$omp task shared(v1,v2,p) depend(in: N)
      !$omp target device(idev) map(from: p)
         if( omp_is_initial_device() ) &
            stop "not executing on target device"
         !$omp parallel do
            do i = 1,N
               p(i) = v1(i) * v2(i)
            end do
         deallocate(v1,v2)

      !$omp end target
   !$omp end task

   !$omp taskwait

   !$omp end target data

   call output(p, N)

end subroutine

