Skip to content

Conversation

@fvitt
Copy link
Contributor

@fvitt fvitt commented Jan 9, 2024

Enable threading on derecho.
Remove the default setting of MPICH_MPIIO_HINTS on derecho which seems to degrade performance of IO for normal resolution configurations.

Test suite:

  PASS ERC_D_Ln9.ne16_ne16_mg17.QPC5HIST.derecho_intel.cam-outfrq3s_usecase
  PASS ERC_Ln9_P64x4.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s
  PASS ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9.f09_f09_mg17.FCvbsxHIST.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9.f19_f19_mg17.FWma2000climo.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9_P512x4.f19_f19_mg17.FX2000.derecho_intel.cam-outfrq9s
  PASS SMS_Ld1.ne30pg3_ne30pg3_mg17.FC2010climo.derecho_intel.cam-outfrq1d
  PASS SMS_Lm13.f10_f10_mg37.F2000climo.derecho_intel.cam-outfrq1m

Test baseline:

    PASS ERC_D_Ln9.ne16_ne16_mg17.QPC5HIST.derecho_intel.cam-outfrq3s_usecase BASELINE cam_cesm2_2_rel_09:
    PASS SMS_D_Ln9.f09_f09_mg17.FCvbsxHIST.derecho_intel.cam-outfrq9s BASELINE cam_cesm2_2_rel_09:
    PASS SMS_D_Ln9.f19_f19_mg17.FWma2000climo.derecho_intel.cam-outfrq9s BASELINE cam_cesm2_2_rel_09:
    PASS SMS_Ld1.ne30pg3_ne30pg3_mg17.FC2010climo.derecho_intel.cam-outfrq1d BASELINE cam_cesm2_2_rel_09:
    PASS SMS_Lm13.f10_f10_mg37.F2000climo.derecho_intel.cam-outfrq1m BASELINE cam_cesm2_2_rel_09:

Test namelist changes: N/A

Test status: bit for bit unchanged

        modified:   config/cesm/machines/config_machines.xml
<arg name="label"> --label</arg>
<arg name="buffer"> --line-buffer</arg>
<arg name="num_tasks" > -n {{ total_tasks }}</arg>
<arg name="label">--label</arg>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is for cesm2.2 - we have added an mpibind script provided by cisl here. Can you compare and contrast these two methods so that we can move forward with a consistant approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of the mpibind script. I will give that a try.

<env name="FI_CXI_RX_MATCH_MODE">hybrid</env>
<env name="FI_MR_CACHE_MONITOR">memhooks</env>
</environment_variables>
<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the performance difference that you saw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is degraded in waccmx quit a bit.

FX2000 at f19 resolution:
  with the `MPICH_MPIIO_HINTS` setting:
    Model Throughput:         0.19   simulated_years/day 
  without the `MPICH_MPIIO_HINTS` setting:
    Model Throughput:         0.54   simulated_years/day

However, I see MPICH_MPIIO_HINTS does improve performance of F2000climo.

F2000climo at f09 resolution:
  with the `MPICH_MPIIO_HINTS` setting:
    Model Throughput:         9.30   simulated_years/day
  without the `MPICH_MPIIO_HINTS` setting:
    Model Throughput:         7.90   simulated_years/day

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case is fx2000 using threads while f2000climo is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of those cases are not using threads (NTHRDS=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Francis and Jim, just want to provide some data points from my threading tests for comparison:

F2000climo at f09 resolution, one CPU node on Derecho (128 CPU cores):
1. no threading, use "mpiexec -np 128": 1.349 SYPD
2. no threading, use "mpibind" for 128 MPI tasks: 1.367 SYPD
3. w/ threading (64 MPI tasks and 2 threads per MPI task), use "mpibind": 1.506 SYPD

If Francis is observing a performance degradation here, I wound like to understand whether this is due to compset, resolution, threading or other settings, and provide feedback to CISL later if there is an improvement can be done on CISL's end.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Francis - thank you for this - Can you please document performance using the PFS test and compare your changes to the mpibind script?

@fvitt
Copy link
Contributor Author

fvitt commented Jan 10, 2024

@sjsprecious
When using the mpibind script some of my jobs fail with these errors:

cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory

Could this be caused by running several test jobs simultaneously?

@sjsprecious
Copy link
Collaborator

@sjsprecious When using the mpibind script some of my jobs fail with these errors:

cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory

Could this be caused by running several test jobs simultaneously?

It could be. However, the error message indicates that it is just trying to remove some temporary files from the tmp folder, which should not fail your test directly. I will forward this issue to Rory and see if he has a better insight here.

@sjsprecious
Copy link
Collaborator

@fvitt CISL has updated the mpibind script to address your issue. Let me know if it works for your simulations. Thanks.

…CAM/CAM-Chem/WACCM(X) at 1 and 2 degrees

        modified:   config/cesm/machines/config_machines.xml
Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@jedwards4b jedwards4b merged commit 01e481f into ESMCI:maint-5.8_5.8.32 Jan 17, 2024
@jedwards4b jedwards4b deleted the derecho_mods branch January 17, 2024 17:43
@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

@jedwards4b and @sjsprecious
I am having issues with these WACCMX ERP tests hanging with mpibind.

  ERP_D_Ln9_P256x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
  ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s

This CAM ERP test passes with mpibind:

ERP_D_Ln9_P256x4.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s

The above ERP FX2000 tests pass when I use the binding arguments to mpiexec...

@jedwards4b
Copy link
Contributor

@fvitt I've let Rory know - can you provide any details about where it's hanging?

@sjsprecious
Copy link
Collaborator

Hi @fvitt , when you say "hanging", do you mean your job never runs on Derecho or it dies without a clear error message?

Can you send me the path to the output logs on Derecho so that I can take a look?

In addition, is running <128 CPU cores per node on Derecho still a problem for you?

@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

@jedwards4b and @sjsprecious
The initial run of the test completes okay. The tests hang on the "case2run" portion of the test where the ntasks and nthreads are halved. The second run hangs and errors out after hitting the wall clock limit.

For example see:

/glade/derecho/scratch/fvitt/ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s.cesm22_test/run/case2run

@jedwards4b
Copy link
Contributor

Can you run, for example, an SMS run of this compset with the pelayout of case2?

@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

Can you run, for example, an SMS run of this compset with the pelayout of case2?

Yes, this passed: SMS_Ln9_P256x2.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s

@sjsprecious
Copy link
Collaborator

Hi @fvitt, thanks for sharing the case directory. I looked at the .case.test file and the PBS resources is specified as select=16:ncpus=128:mpiprocss=32:ompthreafds=4. Therefore, when you perform an ERP test and ntasks/nthreads are halved, you will again undersubscribe a full node. This reminds me of the problem you reported before about using 36 CPU cores per node (NCAR/mpibind#4) and according to Rory's reply, mpibind can not handle this case properly. This is my naive explanation of the ERP failure here but @jedwards4b may have more insights. Also this does not explain why F2000climo passes while FX2000 fails. 😕

@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

In addition, is running <128 CPU cores per node on Derecho still a problem for you?

Yes this is still a problem.

@jedwards4b
Copy link
Contributor

I think the < 128 one is on me, I need to modify config_batch.xml to handle this case.

@jedwards4b
Copy link
Contributor

@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.

@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.

This passes now with the latest in maint-5.8_5.8.32 branch.

@fvitt
Copy link
Contributor Author

fvitt commented Jan 17, 2024

Could it be that our setting of OMP_STACKSIZE is not being used by mpibind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants