Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defect: Image number dependant MPI_Win_Lock error #737

Open
Oiubrab opened this issue Aug 6, 2021 · 1 comment
Open

Defect: Image number dependant MPI_Win_Lock error #737

Oiubrab opened this issue Aug 6, 2021 · 1 comment
Labels

Comments

@Oiubrab
Copy link

Oiubrab commented Aug 6, 2021

System Information:

  • OpenCoarrays Version: OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.9.2)
  • Fortran Compiler: GNU Fortran (GCC) 11.1.0
  • C compiler used for building lib:
  • Installation method: install.sh
  • All flags & options passed to the installer: all default
  • Output of uname -a:Linux manjaro 5.13.5-1-MANJARO tests dis_transpose: test passed  #1 SMP PREEMPT Mon Jul 26 07:43:29 UTC 2021 x86_64 GNU/Linux
  • MPI library being used: openmpi 4.1.1-2
  • Machine architecture and number of physical cores: 8th/9th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (16 threads)
  • cmake version 3.21.1

Note, in running mpicc -show, I get /opt/nvidia/hpc_sdk/Linux_x86_64/21.1/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpicc: error while loading shared libraries: libnvcpumath.so: cannot open shared object file: No such file or directory

The issue

What I was trying to do

I was trying to run four concurrent images, executing the compilation of my code, found at https://github.com/Oiubrab/byinheritance, executing sudo chmod u+x i_am_in_command.zsh && ./i_am_in_command.zsh clean 2 test print. The why is described in the github readme, found in the link, but basically I have created a neural network that computes a trading action in fortran. Ultimately, the pertinent execution lies in the aforementioned bash script line cafrun -n 4 --use-hwthread-cpus ./lack_of_comprehension $3.

What Happened

When this line is run, there is an mpi error generated and, having put in two print statements to catch the error, where place represents the order of the placing of the statements, I get:

Invalid Trades:
[1, 0, 0, 0, 0]
0
 
Network Choice:
[0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0]
[0, 0, 0]
 
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}
 
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241351.6863492}
 
run:  1
 image number:           1 Place:           1
 image number:           2 Place:           1
 image number:           3 Place:           1
 image number:           4 Place:           1
 image number:           2 Place:           2
 image number:           3 Place:           2
 image number:           1 Place:           2
[manjaro:25102] *** An error occurred in MPI_Win_detach
[manjaro:25102] *** reported by process [3482124289,0]
[manjaro:25102] *** on win rdma window 5
[manjaro:25102] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25102] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25102] ***    and potentially your MPI job)
[manjaro:25098] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25098] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.
 
Invalid Trades:
[0, 0, 1, 0, 0]
2
 
Network Choice:
[0, 0, 0, 0, 0, 0, 0] [1, 0, 0, 0, 0, 0, 0] [0, 1, 0, 1, 0, 0, 0]
[0, -1, -10]
 
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}
 
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241360.06828}
 
run:  2
 image number:           1 Place:           1
 image number:           2 Place:           1
 image number:           3 Place:           1
 image number:           4 Place:           1
 image number:           1 Place:           2
 image number:           2 Place:           2
 image number:           3 Place:           2
[manjaro:25174] *** An error occurred in MPI_Win_detach
[manjaro:25174] *** reported by process [3486973953,2]
[manjaro:25174] *** on win rdma window 5
[manjaro:25174] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25174] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25174] ***    and potentially your MPI job)
[manjaro:25168] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25168] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.
 
Invalid Trades:
[0, 0, 1, 0, 0]
2
 
Network Choice:
[1, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 1, 0, 0, 0, 0, 0]
[-1, 0, -2]
 
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.39, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.12, 'units_owned': 0}
 
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241366.8761365}

What I expected to happen

Markets and network choices vary. This is expected. What is not expected is the error and the fact that the fourth image does not run to the second place. I should see the output above, but without the error, and with a image number: 4 Place: 2 line. This exact code (minus the print statements) ran without a hitch with the last version of openmpi (openmpi-4.0.5-3-x86_64). I have since tried to run other opencoarrays programs I have written and found various errors trying to run less than six threads.

Step by step reproduction

This error can be reproduced following the execution above. As this error seems to be code agnostic, you can also try running the process below to reproduce a similar error (again, this code was running previous to the update):

step 1

Take the following code and save as an f95 file (e.g test_arraymove.f95):

program test_arraycom

real,dimension(10) ,  codimension [*] :: x ,  y
integer ::  num_img , me
num_img = num_images()
me = this_image ()
print*,me,num_img

! Some code  here
x (2) = x ( 3 ) [ 6 ]!  get  value  from image 6
x ( 6 ) [ 4 ] = x (1)!  put  value on image 4
x ( : ) [ 2 ] = y ( : )!  put  array on image 2
sync all

!  Remote−to−remote  array  transfer
if(me == 1)then
	y(:)[num_img]=x(:)[  4  ]
	sync images (num_img)
else if(me == num_img) then
	sync images ([ 1 ])
end if

x(1:10:2)=y(1:10:2)[4]!  strided  get  from 4

end program

step 2

compile the code with caf test_arraymove.f95 -o programname

step 3

run the code, with the number of CPU threads, $2, below 6

i.e cafrun -n $2 --use-hwthread-cpus ./programname

step 4

get an error of the form:

           1           4
           2           4
           3           4
           4           4
[manjaro:28814] *** An error occurred in MPI_Win_lock
[manjaro:28814] *** reported by process [3708682241,1]
[manjaro:28814] *** on win rdma window 6
[manjaro:28814] *** MPI_ERR_RANK: invalid rank
[manjaro:28814] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:28814] ***    and potentially your MPI job)
[manjaro:28809] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:28809] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./testarraymove`
failed to run.
@stale
Copy link

stale bot commented Sep 7, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant