Intel compilers require some help #156

sbryngelson · 2023-05-23T14:11:27Z

There seems to be some issue with OpenMPI + Intel compilers (I used openmpi/4.0.2-intel20.4 on Bridges2).

Update: same issue below occurs with intelmpi

Adding add_compile_options(-O1) to cmakelists.txt is required to pass tests
export FC=ifort; export CC=icc; export CXX=icc; before compiling is required (shouldn't this be picked up automatically by cmake? it otherwise tries to use GNU)

If one removes the optimization flag above, then we get error

   [ 97%]  Time step       50 of 51 @ t_step = 49
   NaN(s) in timestep output.           0           0           0           0
            50          49          39           0
  NaN(s) in timestep output.

for test Test tests/55533234: 2D -> bc=-1...

So, this issue appears not to be related to viscous + bubbles, but rather this is the first 2D test case. I think the confusion comes from the fact that the test immediately before this one is viscous + bubbles, but after double checking that one, it does indeed pass fine. This 2D case quickly throws NaNs.

This issue is related to the comments on PR: #152

The text was updated successfully, but these errors were encountered:

sbryngelson · 2023-05-23T14:26:06Z

From @lee-hyeoksu in the PR:

I tried to figure out what happens in test 555332334 and I found that the NaNs actually occur at the first time step when populating buffers in s_populate_conservative_variables_buffers in m_rhs.fpp.

In this subroutine, global parameters bc_s%beg and bc_s%end are used for s-direction (s = x, y, or z) to populate buffers based on BCs. However, in the subroutine, the values for these parameters are 0, although they should be some negative integers. So the subroutine does not assign appropriate values to q_cons_qp in buffer regions, which leads to NaNs.

bc_x and bc_y become 0 right after returning from s_read_data_files(q_cons_ts(1)%vf) (https://github.com/MFlowCode/MFC/blob/master/src/simulation/p_main.fpp#L185).
Still I am not sure why this happens for intel compilers, so I keep looking into this issue but I just wanted to share. Any suggestions would be appreciated!

sbryngelson · 2023-05-23T14:33:07Z

One question is whether this issue has existed for previous versions of MFC, or if some newly introduced feature created it. @lee-hyeoksu can you try downloading and testing older MFC versions (like before your viscous+bubbles PR) and see if they pass this test without the add compiler options flag in cmakelists.

sbryngelson · 2023-05-23T14:34:55Z

From @lee-hyeoksu in the PR:

I tried to figure out what happens in test 555332334 and I found that the NaNs actually occur at the first time step when populating buffers in s_populate_conservative_variables_buffers in m_rhs.fpp.

In this subroutine, global parameters bc_s%beg and bc_s%end are used for s-direction (s = x, y, or z) to populate buffers based on BCs. However, in the subroutine, the values for these parameters are 0, although they should be some negative integers. So the subroutine does not assign appropriate values to q_cons_qp in buffer regions, which leads to NaNs.

bc_x and bc_y become 0 right after returning from s_read_data_files(q_cons_ts(1)%vf) (master/src/simulation/p_main.fpp#L185).
Still I am not sure why this happens for intel compilers, so I keep looking into this issue but I just wanted to share. Any suggestions would be appreciated!

I'm not sure if this is actually an issue or not. You might notice this happening for other/older versions of MFC. This is because of a call to an MPI routine that sets the BCs for each processor (even though this is a 1-rank case). I think @anandrdbz can explain better.

I could be wrong about the details here, though, and this is a bug.

anandrdbz · 2023-05-24T07:00:56Z

@lee-hyeoksu the case of 0 ranks should be analogous to a periodic boundary condition, wherein you're essentially performing an MPI_SENDRECV from rank 0 to rank 0.

So this issue will occur only when bc_s is -1, which I presume is the test that fails.

If you wish to turn off this feature, you can go to line 186 in src/simulation/m_mpi_proxy.fpp and remove parallel_io from the if condition (change if(num_procs == 1 and parall_io) to if(num_procs == 1) ). Not sure why the parallel_io is added here in the first place

This should produce exactly the same results for all the test cases running on 1 rank. If it fails on 2 ranks, where this can't be circumvented, the issue would pertain to a bug in the mpi bindings with intel compilers.

anandrdbz · 2023-05-24T07:03:07Z

@lee-hyeoksu A small correction, you would have to change the if condition in pre_process, simulation as well as post_process in the s_mpi_decompose_computational_domain subroutine in m_mpi_proxy.fpp

lee-hyeoksu · 2023-05-24T07:31:05Z

@anandrdbz I ran test after removing parallel_io from the if condition in m_mpi_proxy.fpp in pre_process, simulation, and post_process as you noted, and segmentation fault occurred for all tests.

anandrdbz · 2023-05-24T07:42:26Z

@lee-hyeoksu I think it's a mistake on my end, you only need to do this in simulation and post_process, and not pre_process as the resetting does not occur here.

Simulation should surely work without these, however, I feel post_process might actually still cause a bug as offset_s gets set here. So maybe you can try it once with the changes only in simulation and then try adding post_process if the bug persists

lee-hyeoksu · 2023-05-24T07:53:05Z

@anandrdbz Then pre_process works but the same segmentation fault appears in simulation part.

anandrdbz · 2023-05-24T08:04:49Z

@lee-hyeoksu I'm unable to see what's causing the seg fault at the moment, but another more laborious but robust procedure to circumvent this issue is to explicitly enforce (bc_s == -1 and num_procs > 1) in the if conditions preceding the mpi_cart_rank function calls in this subroutine (for simulation and post_process).

As an example, in line 507 in simulation, you would replace if(proc_coords(1) > 0 .or. bc_x%beg == -1 ) with if(proc_coords(1) > 0 .or. (bc_x%beg == -1 .and. num_procs > 1) )

You'd have to do this for all 6 if conditions in simulation and post_process and revert back to adding parallel_io in the if condition for all modules.

Perhaps there's a variable that gets used at a later point in this subroutine which is currently causing it to seg fault

sbryngelson · 2023-05-24T16:03:42Z

Check this out: https://github.com/MFlowCode/MFC/actions/runs/5069572284/jobs/9103376183#step:5:814

The modular fp PR #119 is failing on the exact same test case when switching from double to single precision (passes on double precision). I'm not sure what changes you made here @wilfonba, but they seem somehow (bizarrely) related to this Intel issue.

anandrdbz · 2023-05-25T12:02:27Z

@sbryngelson My latest PR should fix the issue with the intel compilers. I didn't really change anything fundamentally in the code, just added a NaN check in the receive buffers in the mpi module (the details are in the PR). This indicates a compiler bug in intelmpi and not a bug in the code

As far as modular-fp is concerned, it seems to fail for this case with nvhpc compilers, however, it fails on another test case with gcc. I think it might be a coincidence that it failed with this particular case using nvhpc, since this case is also the first 2D case.

sbryngelson · 2023-05-25T14:40:03Z

@anandrdbz be careful, the issue seems to only sometimes causes the run to fail, so if you test you should test a few times. Right now, I also removed the Intel compiler from the CI, though if the PR fixes it I will add it back.

anandrdbz · 2023-05-25T14:42:57Z

@sbryngelson It seems like the CI is very finicky, it seems to randomly fail for some of the tests (most recently it failed a GPU one, but an identical commit on GPUs passed an hour earlier but randomly crashed on a CPU test). So I'm not sure what to do except just re-running it multiple times.

sbryngelson · 2023-05-25T14:51:03Z

I will re-run the master branch CI to see if this is an issue with the PR(s) or something else.

henryleberre · 2023-05-25T14:55:05Z

@sbryngelson CMake doesn't automatically pick the Intel compilers when they are loaded because others are also installed, usually GNU and possibly LLVM/Clang. If these environment variables are set, they override the regular compiler selection process.

anandrdbz · 2023-05-25T14:57:34Z

Yes, I had to use export CC=icc, CX=icpc, FC=ifort to get the intel compilers to work

sbryngelson added the bug Something isn't working or doesn't seem right label May 23, 2023

sbryngelson assigned henryleberre, anshgupta1234, wilfonba and lee-hyeoksu May 23, 2023

sbryngelson mentioned this issue May 23, 2023

hypoelastic examples cases fix #152

Merged

sbryngelson unassigned anshgupta1234 May 23, 2023

sbryngelson changed the title ~~OpenMPI + Intel compilers requires some help~~ Intel compilers requires some help May 23, 2023

sbryngelson changed the title ~~Intel compilers requires some help~~ Intel compilers require some help May 23, 2023

sbryngelson assigned anandrdbz and unassigned henryleberre, wilfonba and lee-hyeoksu Jul 18, 2023

sbryngelson linked a pull request Aug 1, 2023 that will close this issue

Intel Compiler Bug #161

Merged

sbryngelson closed this as completed in #161 Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel compilers require some help #156

Intel compilers require some help #156

sbryngelson commented May 23, 2023 •

edited

Loading

sbryngelson commented May 23, 2023

sbryngelson commented May 23, 2023

sbryngelson commented May 23, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

lee-hyeoksu commented May 24, 2023

anandrdbz commented May 24, 2023 •

edited

Loading

lee-hyeoksu commented May 24, 2023

anandrdbz commented May 24, 2023 •

edited

Loading

sbryngelson commented May 24, 2023 •

edited

Loading

anandrdbz commented May 25, 2023 •

edited

Loading

sbryngelson commented May 25, 2023

anandrdbz commented May 25, 2023

sbryngelson commented May 25, 2023

henryleberre commented May 25, 2023

anandrdbz commented May 25, 2023

Intel compilers require some help #156

Intel compilers require some help #156

Comments

sbryngelson commented May 23, 2023 • edited Loading

sbryngelson commented May 23, 2023

sbryngelson commented May 23, 2023

sbryngelson commented May 23, 2023 • edited Loading

anandrdbz commented May 24, 2023 • edited Loading

anandrdbz commented May 24, 2023 • edited Loading

lee-hyeoksu commented May 24, 2023

anandrdbz commented May 24, 2023 • edited Loading

lee-hyeoksu commented May 24, 2023

anandrdbz commented May 24, 2023 • edited Loading

sbryngelson commented May 24, 2023 • edited Loading

anandrdbz commented May 25, 2023 • edited Loading

sbryngelson commented May 25, 2023

anandrdbz commented May 25, 2023

sbryngelson commented May 25, 2023

henryleberre commented May 25, 2023

anandrdbz commented May 25, 2023

sbryngelson commented May 23, 2023 •

edited

Loading

sbryngelson commented May 23, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

anandrdbz commented May 24, 2023 •

edited

Loading

sbryngelson commented May 24, 2023 •

edited

Loading

anandrdbz commented May 25, 2023 •

edited

Loading