### `nowait` Clause on `target` Construct

The following example shows how to execute code asynchronously on a  device without an explicit task. The `nowait` clause on a `target`  construct allows the thread of the  _target task_  to perform other work while waiting for the `target` region execution to complete.  Hence, the the `target` region can execute asynchronously on the  device (without requiring a host thread to idle while waiting for  the  _target task_  execution to complete).

In this example the product of two vectors (arrays),  _v1_  and  _v2_ , is formed. One half of the operations is performed on the device, and the last half on the host, concurrently.

After a team of threads is formed the master thread generates  the  _target task_  while the other threads can continue on, without a barrier, to the execution of the host portion of the vector product. The completion of the  _target task_  (asynchronous target execution) is  guaranteed by the synchronization in the implicit barrier at the end of the  host vector-product worksharing loop region. See the `barrier`  glossary entry in the OpenMP specification for details.

The host loop scheduling is `dynamic`, to balance the host thread executions, since  one thread is being used for offload generation. In the situation where  little time is spent by the  _target task_  in setting  up and tearing down the the target execution, `static` scheduling may be desired.

In [None]:
//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.3c
* type: C
* version: omp_4.5
*/

#include <stdio.h>

#define N 1000000      //N must be even
void init(int n, float *v1, float *v2);

int main(){
   int   i, n=N;
   int   chunk=1000;
   float v1[N],v2[N],vxv[N];

   init(n, v1,v2);

   #pragma omp parallel
   {

      #pragma omp master
      #pragma omp target teams distribute parallel for nowait \
                                map(to: v1[0:n/2]) \
                                map(to: v2[0:n/2]) \
                                map(from: vxv[0:n/2])
      for(i=0; i<n/2; i++){ vxv[i] = v1[i]*v2[i]; }

      #pragma omp for schedule(dynamic,chunk)
      for(i=n/2; i<n; i++){ vxv[i] = v1[i]*v2[i]; }

   }
   printf(" vxv[0] vxv[n-1] %f %f\n", vxv[0], vxv[n-1]);
   return 0;
}



In [None]:

! name: async_target.3f
! type: F-free
! version: omp_4.5

program concurrent_async
   use omp_lib
   integer,parameter :: n=1000000  !!n must be even
   integer           :: i, chunk=1000
   real              :: v1(n),v2(n),vxv(n)

   call init(n, v1,v2)

   !$omp parallel

      !$omp master
      !$omp target teams distribute parallel do nowait &
      !$omp&                    map(to: v1(1:n/2))   &
      !$omp&                    map(to: v2(1:n/2))   &
      !$omp&                    map(from: vxv(1:n/2))
      do i = 1,n/2;    vxv(i) = v1(i)*v2(i); end do
      !$omp end master

      !$omp do schedule(dynamic,chunk)
      do i = n/2+1,n;  vxv(i) = v1(i)*v2(i); end do

   !$omp end parallel

   print*, " vxv(1) vxv(n) :", vxv(1), vxv(n)

end program

