-
Notifications
You must be signed in to change notification settings - Fork 219
Enable threading and improve IO performance for normal resolutions for cesm2.2 on derecho #4559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
modified: config/cesm/machines/config_machines.xml
| <arg name="label"> --label</arg> | ||
| <arg name="buffer"> --line-buffer</arg> | ||
| <arg name="num_tasks" > -n {{ total_tasks }}</arg> | ||
| <arg name="label">--label</arg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is for cesm2.2 - we have added an mpibind script provided by cisl here. Can you compare and contrast these two methods so that we can move forward with a consistant approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of the mpibind script. I will give that a try.
| <env name="FI_CXI_RX_MATCH_MODE">hybrid</env> | ||
| <env name="FI_MR_CACHE_MONITOR">memhooks</env> | ||
| </environment_variables> | ||
| <!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the performance difference that you saw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is degraded in waccmx quit a bit.
FX2000 at f19 resolution:
with the `MPICH_MPIIO_HINTS` setting:
Model Throughput: 0.19 simulated_years/day
without the `MPICH_MPIIO_HINTS` setting:
Model Throughput: 0.54 simulated_years/day
However, I see MPICH_MPIIO_HINTS does improve performance of F2000climo.
F2000climo at f09 resolution:
with the `MPICH_MPIIO_HINTS` setting:
Model Throughput: 9.30 simulated_years/day
without the `MPICH_MPIIO_HINTS` setting:
Model Throughput: 7.90 simulated_years/day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case is fx2000 using threads while f2000climo is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of those cases are not using threads (NTHRDS=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Francis and Jim, just want to provide some data points from my threading tests for comparison:
F2000climo at f09 resolution, one CPU node on Derecho (128 CPU cores):
1. no threading, use "mpiexec -np 128": 1.349 SYPD
2. no threading, use "mpibind" for 128 MPI tasks: 1.367 SYPD
3. w/ threading (64 MPI tasks and 2 threads per MPI task), use "mpibind": 1.506 SYPD
If Francis is observing a performance degradation here, I wound like to understand whether this is due to compset, resolution, threading or other settings, and provide feedback to CISL later if there is an improvement can be done on CISL's end.
jedwards4b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Francis - thank you for this - Can you please document performance using the PFS test and compare your changes to the mpibind script?
|
@sjsprecious Could this be caused by running several test jobs simultaneously? |
It could be. However, the error message indicates that it is just trying to remove some temporary files from the tmp folder, which should not fail your test directly. I will forward this issue to Rory and see if he has a better insight here. |
|
@fvitt CISL has updated the |
…CAM/CAM-Chem/WACCM(X) at 1 and 2 degrees
modified: config/cesm/machines/config_machines.xml
jedwards4b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
|
@jedwards4b and @sjsprecious This CAM ERP test passes with mpibind: The above ERP FX2000 tests pass when I use the binding arguments to mpiexec... |
|
@fvitt I've let Rory know - can you provide any details about where it's hanging? |
|
Hi @fvitt , when you say "hanging", do you mean your job never runs on Derecho or it dies without a clear error message? Can you send me the path to the output logs on Derecho so that I can take a look? In addition, is running <128 CPU cores per node on Derecho still a problem for you? |
|
@jedwards4b and @sjsprecious For example see: |
|
Can you run, for example, an SMS run of this compset with the pelayout of case2? |
Yes, this passed: |
|
Hi @fvitt, thanks for sharing the case directory. I looked at the |
Yes this is still a problem. |
|
I think the < 128 one is on me, I need to modify config_batch.xml to handle this case. |
|
@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128. |
This passes now with the latest in maint-5.8_5.8.32 branch. |
|
Could it be that our setting of OMP_STACKSIZE is not being used by mpibind? |
Enable threading on derecho.
Remove the default setting of MPICH_MPIIO_HINTS on derecho which seems to degrade performance of IO for normal resolution configurations.
Test suite:
Test baseline:
Test namelist changes: N/A
Test status: bit for bit unchanged