Description
I have flang
from a recent git commit:
$ flang --version
flang version 21.0.0git (https://github.com/llvm/llvm-project.git 40cc7b4578fd2d65aaef8356fbe7caf2d84a8f3e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm/llvm-project/install/bin
I believe the following example is valid OpenMP code:
PROGRAM reproducer
IMPLICIT NONE
REAL, ALLOCATABLE, TARGET, DIMENSION(:) :: arr
INTEGER, PARAMETER :: ngrids = 2
INTEGER, PARAMETER :: cellsperdim = 4
INTEGER, PARAMETER :: cellspergrid = cellsperdim**3
INTEGER :: iouter, ip3
ALLOCATE(arr(ngrids * cellspergrid), source=-1.0)
!$omp target teams distribute private(ip3) map(tofrom: arr)
DO iouter = 1, ngrids
ip3 = (iouter - 1) * cellspergrid + 1
CALL kernel(arr(ip3))
END DO
!$omp end target teams distribute
PRINT *, arr
DEALLOCATE(arr)
CONTAINS
SUBROUTINE kernel(gridarr)
!$omp declare target
! Subroutine arguments
REAL, INTENT(INOUT), DIMENSION(cellsperdim, cellsperdim, cellsperdim) :: gridarr
! Local variables
INTEGER :: i, j, k
!$omp parallel do collapse(2) private(i, j, k) shared(gridarr)
DO i = 1, cellsperdim
DO j = 1, cellsperdim
DO k = 1, cellsperdim
gridarr(k, j, i) = REAL(i)
END DO
END DO
END DO
!$omp end parallel do
END SUBROUTINE kernel
END PROGRAM reproducer
This program works with both gfortran
and Cray ftn
. I can compile it just fine with flang: flang -O3 -fopenmp -fopenmp-version=52 -fopenmp-targets=nvptx64 flang-omp-device-bug.F90
but running it with mandatory offloading gives incorrect results:
$ OMP_TARGET_OFFLOAD=mandatory ./a.out
' -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
Disabling offloading gives the expected results:
$ OMP_TARGET_OFFLOAD=disabled ./a.out
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 4. 4. 4. 4.
4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 3. 3. 3.
3. 3. 3. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4.
I created an equivalent example in C, and both gcc
, clang
(from the same git commit as flang
) and Cray cc
run that example just fine both with and without offloading.
When I remove the omp parallel do
in the kernel subroutine, the example prints the expected result, both with and without offloading:
For offloading I have an Nvidia RTX 4080 Super, except with Cray where I am on a shared HPC system.