Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel compilers require some help #156

Closed
sbryngelson opened this issue May 23, 2023 · 16 comments · Fixed by #161
Closed

Intel compilers require some help #156

sbryngelson opened this issue May 23, 2023 · 16 comments · Fixed by #161
Assignees
Labels
bug Something isn't working or doesn't seem right

Comments

@sbryngelson
Copy link
Member

sbryngelson commented May 23, 2023

There seems to be some issue with OpenMPI + Intel compilers (I used openmpi/4.0.2-intel20.4 on Bridges2).

Update: same issue below occurs with intelmpi

  • Adding add_compile_options(-O1) to cmakelists.txt is required to pass tests
  • export FC=ifort; export CC=icc; export CXX=icc; before compiling is required (shouldn't this be picked up automatically by cmake? it otherwise tries to use GNU)

If one removes the optimization flag above, then we get error

   [ 97%]  Time step       50 of 51 @ t_step = 49
   NaN(s) in timestep output.           0           0           0           0
            50          49          39           0
  NaN(s) in timestep output.

for test Test tests/55533234: 2D -> bc=-1...

So, this issue appears not to be related to viscous + bubbles, but rather this is the first 2D test case. I think the confusion comes from the fact that the test immediately before this one is viscous + bubbles, but after double checking that one, it does indeed pass fine. This 2D case quickly throws NaNs.

This issue is related to the comments on PR: #152

@sbryngelson sbryngelson added the bug Something isn't working or doesn't seem right label May 23, 2023
@sbryngelson
Copy link
Member Author

From @lee-hyeoksu in the PR:

I tried to figure out what happens in test 555332334 and I found that the NaNs actually occur at the first time step when populating buffers in s_populate_conservative_variables_buffers in m_rhs.fpp.

In this subroutine, global parameters bc_s%beg and bc_s%end are used for s-direction (s = x, y, or z) to populate buffers based on BCs. However, in the subroutine, the values for these parameters are 0, although they should be some negative integers. So the subroutine does not assign appropriate values to q_cons_qp in buffer regions, which leads to NaNs.

bc_x and bc_y become 0 right after returning from s_read_data_files(q_cons_ts(1)%vf) (https://github.com/MFlowCode/MFC/blob/master/src/simulation/p_main.fpp#L185).
Still I am not sure why this happens for intel compilers, so I keep looking into this issue but I just wanted to share. Any suggestions would be appreciated!

@sbryngelson
Copy link
Member Author

One question is whether this issue has existed for previous versions of MFC, or if some newly introduced feature created it. @lee-hyeoksu can you try downloading and testing older MFC versions (like before your viscous+bubbles PR) and see if they pass this test without the add compiler options flag in cmakelists.

@sbryngelson
Copy link
Member Author

sbryngelson commented May 23, 2023

From @lee-hyeoksu in the PR:

I tried to figure out what happens in test 555332334 and I found that the NaNs actually occur at the first time step when populating buffers in s_populate_conservative_variables_buffers in m_rhs.fpp.

In this subroutine, global parameters bc_s%beg and bc_s%end are used for s-direction (s = x, y, or z) to populate buffers based on BCs. However, in the subroutine, the values for these parameters are 0, although they should be some negative integers. So the subroutine does not assign appropriate values to q_cons_qp in buffer regions, which leads to NaNs.

bc_x and bc_y become 0 right after returning from s_read_data_files(q_cons_ts(1)%vf) (master/src/simulation/p_main.fpp#L185).
Still I am not sure why this happens for intel compilers, so I keep looking into this issue but I just wanted to share. Any suggestions would be appreciated!

I'm not sure if this is actually an issue or not. You might notice this happening for other/older versions of MFC. This is because of a call to an MPI routine that sets the BCs for each processor (even though this is a 1-rank case). I think @anandrdbz can explain better.

I could be wrong about the details here, though, and this is a bug.

@sbryngelson sbryngelson changed the title OpenMPI + Intel compilers requires some help Intel compilers requires some help May 23, 2023
@sbryngelson sbryngelson changed the title Intel compilers requires some help Intel compilers require some help May 23, 2023
@anandrdbz
Copy link
Contributor

anandrdbz commented May 24, 2023

@lee-hyeoksu the case of 0 ranks should be analogous to a periodic boundary condition, wherein you're essentially performing an MPI_SENDRECV from rank 0 to rank 0.

So this issue will occur only when bc_s is -1, which I presume is the test that fails.

If you wish to turn off this feature, you can go to line 186 in src/simulation/m_mpi_proxy.fpp and remove parallel_io from the if condition (change if(num_procs == 1 and parall_io) to if(num_procs == 1) ). Not sure why the parallel_io is added here in the first place

This should produce exactly the same results for all the test cases running on 1 rank. If it fails on 2 ranks, where this can't be circumvented, the issue would pertain to a bug in the mpi bindings with intel compilers.

@anandrdbz
Copy link
Contributor

anandrdbz commented May 24, 2023

@lee-hyeoksu A small correction, you would have to change the if condition in pre_process, simulation as well as post_process in the s_mpi_decompose_computational_domain subroutine in m_mpi_proxy.fpp

@lee-hyeoksu
Copy link
Contributor

@anandrdbz I ran test after removing parallel_io from the if condition in m_mpi_proxy.fpp in pre_process, simulation, and post_process as you noted, and segmentation fault occurred for all tests.

Screenshot 2023-05-24 at 12 26 38 AM

@anandrdbz
Copy link
Contributor

anandrdbz commented May 24, 2023

@lee-hyeoksu I think it's a mistake on my end, you only need to do this in simulation and post_process, and not pre_process as the resetting does not occur here.

Simulation should surely work without these, however, I feel post_process might actually still cause a bug as offset_s gets set here. So maybe you can try it once with the changes only in simulation and then try adding post_process if the bug persists

@lee-hyeoksu
Copy link
Contributor

@anandrdbz Then pre_process works but the same segmentation fault appears in simulation part.

@anandrdbz
Copy link
Contributor

anandrdbz commented May 24, 2023

@lee-hyeoksu I'm unable to see what's causing the seg fault at the moment, but another more laborious but robust procedure to circumvent this issue is to explicitly enforce (bc_s == -1 and num_procs > 1) in the if conditions preceding the mpi_cart_rank function calls in this subroutine (for simulation and post_process).

As an example, in line 507 in simulation, you would replace if(proc_coords(1) > 0 .or. bc_x%beg == -1 ) with if(proc_coords(1) > 0 .or. (bc_x%beg == -1 .and. num_procs > 1) )

You'd have to do this for all 6 if conditions in simulation and post_process and revert back to adding parallel_io in the if condition for all modules.

Perhaps there's a variable that gets used at a later point in this subroutine which is currently causing it to seg fault

@sbryngelson
Copy link
Member Author

sbryngelson commented May 24, 2023

Check this out: https://github.com/MFlowCode/MFC/actions/runs/5069572284/jobs/9103376183#step:5:814

The modular fp PR #119 is failing on the exact same test case when switching from double to single precision (passes on double precision). I'm not sure what changes you made here @wilfonba, but they seem somehow (bizarrely) related to this Intel issue.

@anandrdbz
Copy link
Contributor

anandrdbz commented May 25, 2023

@sbryngelson My latest PR should fix the issue with the intel compilers. I didn't really change anything fundamentally in the code, just added a NaN check in the receive buffers in the mpi module (the details are in the PR). This indicates a compiler bug in intelmpi and not a bug in the code

As far as modular-fp is concerned, it seems to fail for this case with nvhpc compilers, however, it fails on another test case with gcc. I think it might be a coincidence that it failed with this particular case using nvhpc, since this case is also the first 2D case.

@sbryngelson
Copy link
Member Author

@anandrdbz be careful, the issue seems to only sometimes causes the run to fail, so if you test you should test a few times. Right now, I also removed the Intel compiler from the CI, though if the PR fixes it I will add it back.

@anandrdbz
Copy link
Contributor

@sbryngelson It seems like the CI is very finicky, it seems to randomly fail for some of the tests (most recently it failed a GPU one, but an identical commit on GPUs passed an hour earlier but randomly crashed on a CPU test). So I'm not sure what to do except just re-running it multiple times.

@sbryngelson
Copy link
Member Author

I will re-run the master branch CI to see if this is an issue with the PR(s) or something else.

@henryleberre
Copy link
Member

@sbryngelson CMake doesn't automatically pick the Intel compilers when they are loaded because others are also installed, usually GNU and possibly LLVM/Clang. If these environment variables are set, they override the regular compiler selection process.

@anandrdbz
Copy link
Contributor

Yes, I had to use export CC=icc, CX=icpc, FC=ifort to get the intel compilers to work

@sbryngelson sbryngelson linked a pull request Aug 1, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working or doesn't seem right
Development

Successfully merging a pull request may close this issue.

6 participants