Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from run scripts to compass run command #76

Merged
merged 4 commits into from
Apr 26, 2021

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Apr 18, 2021

To run test cases and steps, rather than ./run.py, you now run:

python -m compass run

If you want to run a suite, from the root of the work directory, run:

python -m compass run <suite>

instead of ./<suite>.py.

The compass run command fits the style of the package much better than ./run.py and does not require writing python scripts from templates.

This merge also updates some docstrings for better clarity on what "inputs" and "outputs" of a step refer to, and on where super().run() needs to get called in a test case's overridden version of the run() method.

The merge also adds docstrings for run_step() and run_test_case(). The three run_*() functions have been moved to the run module, which makes more conceptual sense than having them in the suite, testcase and step modules because they all get called only when compass run gets called.

closes #71

@xylar xylar added clean-up python package DEPRECATED: PRs and Issues involving the python package (master branch) labels Apr 18, 2021
@xylar xylar self-assigned this Apr 18, 2021
@xylar xylar added this to In progress in compass 1.0 via automation Apr 18, 2021
@xylar xylar added in progress This PR is not ready for review or merging and removed in progress This PR is not ready for review or merging labels Apr 18, 2021
@xylar xylar marked this pull request as ready for review April 23, 2021 21:18
@xylar
Copy link
Collaborator Author

xylar commented Apr 23, 2021

@mark-petersen and @vanroekel, this change was suggested by @matthewhoffman in his review but we decided to wait until after #28 to discuss it.

However, it would be really good to have it in soon if we want to make this switch because:

  1. it touches code I want to change in other PRs
  2. we wouldn't want to get too used to the "old" way of running tests if we're going to make this switch.

What I'm looking for in a review is for each of you to run:

  1. a test case
  2. a test suite

Let me know if:

  1. it work as expected
  2. you're happy with the change

@vanroekel
Copy link
Collaborator

@xylar, sure! can do. Will try to do this by early next week.

@vanroekel
Copy link
Collaborator

@xylar just tried this on anvil. the test case worked great this way, but when I tried running the nightly I get an error

python -m compass run nightly
ocean/baroclinic_channel/10km/default
  PASS
ocean/baroclinic_channel/10km/threads_test
  PASS
ocean/baroclinic_channel/10km/decomp_test
  FAIL see: case_outputs/ocean_baroclinic_channel_10km_decomp_test.log
ocean/baroclinic_channel/10km/restart_test
  PASS
ocean/global_ocean/QU240/mesh
qt.qpa.xcb: could not connect to display localhost:18.0
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, minimal, minimalegl, offscreen, vnc, webgl, xcb.

Aborted (core dumped)

Is it expected to try visualize even from a node of anvil? For completeness my setup command is

python -m compass suite -t nightly -f ocn.cfg -w /lcrc/group/e3sm/ac.vanroekel/nightly1 -c ocean -s

@xylar
Copy link
Collaborator Author

xylar commented Apr 24, 2021

Thanks @vanroekel. That's odd because I always run on Anvil compute nodes and I've never seen this before. I'll make sure I can reproduce it.

@xylar
Copy link
Collaborator Author

xylar commented Apr 24, 2021

@vanroekel, I haven't been able to reproduce this. Here is what I did and maybe you can tell me if anything stands out as obviously different.

I'm running with intel18 libraries, following E3SM:

$ cat /home/ac.xylar/mpas-work/compass/add_machine_modules/anvil_intel18_mvapich.sh 
source /lcrc/soft/climate/e3sm-unified/base/etc/profile.d/conda.sh
conda activate test_compass_1.0

module purge
module load cmake/3.14.2-gvwazz3
module load intel/18.0.4-62uvgmb
module load intel-mkl/2018.4.274-jwaeshj
module load netcdf/4.4.1-fijcsqi
module load netcdf-cxx/4.2-cixenix
module load netcdf-fortran/4.4.4-mmtrep3
module load mvapich2/2.2-verbs-m57bia7
module load parallel-netcdf/1.11.0-ny4vo3o

export NETCDF=$(dirname $(dirname $(which nc-config)))
export NETCDFF=$(dirname $(dirname $(which nf-config)))
export PNETCDF=$(dirname $(dirname $(which pnetcdf-config)))

export PIO=/lcrc/soft/climate/compass/anvil/compass-1.0.0/scorpio-1.1.6/intel18/mvapich
export ESMF=/lcrc/soft/climate/compass/anvil/compass-1.0.0/esmf-8.1.0/intel18/mvapich

export I_MPI_CC=icc
export I_MPI_CXX=icpc
export I_MPI_F77=ifort
export I_MPI_F90=ifort
export MV2_ENABLE_AFFINITY=0
export MV2_SHOW_CPU_BINDING=1

export AUTOCLEAN=true
export USE_PIO2=true
export HDF5_USE_FILE_LOCKING=FALSE

I built the code using the submodule in the master branch:

$ pwd
/home/ac.xylar/mpas-work/compass/compass
$ source /home/ac.xylar/mpas-work/compass/add_machine_modules/anvil_intel18_mvapich.sh 
$ git submodule update --init --recursive
$ cd MPAS-Model/ocean/develop/
$ make CORE=ocean ifort

I used git worktree to create a directory for this branch:

$ git worktree add ../switch_to_compass_run
$ cd ../switch_to_compass_run/
$ git reset --hard xylar/compass/switch_to_compass_run

I didn't make a config file because that isn't necessary for supported machines. Instead, I did:

python -m compass suite -c ocean -t nightly -m anvil -w /lcrc/group/e3sm/ac.xylar/compass_1.0/anvil/test_20210424/compass_run -s -p ../compass/MPAS-Model/ocean/develop/

In a new terminal (just to make sure modules and conda environments are reset), I did:

$ cd /lcrc/group/e3sm/ac.xylar/compass_1.0/anvil/test_20210424/compass_run
$ srun -A condo -p acme-small -N 1 -t 4:00:00 --pty bash

On the compute node, I did:

$ source /home/ac.xylar/mpas-work/compass/add_machine_modules/anvil_intel18_mvapich.sh 
python -m compass run nightly

I notice from the log file in your run case_outputs/ocean_baroclinic_channel_10km_decomp_test.log that it seems like you were somehow restricted to 4 processors, so the decomposition test case failed. I wonder if that difference might be related. What was your command for getting the compute node?

This fits the style of the package much better and does not
require writing python scripts from templates.

If you are in a test case or step, you now simply run:
```
python -m compass run
```

If you want to run a suite, from the root of the work directory,
run:
```
python -m compass run <suite>
```
Add some clarity about inputs and outputs to steps, and on
where super().run() needs to get called in a test case's overridden
version of this method.

Add docstrings for run_step() and run_test_case()
This makes more conceptual sense than having them in the suite,
testcase and step modules because they all get called only when
`compass run` gets called.
@vanroekel
Copy link
Collaborator

Thanks for the reply @xylar and your detailed instructions. I was pretty convinced this was on my end. I will try again following your instructions in the comment above later today, but to answer your question, I request nodes as you write in your instructions, i.e.

srun -A condo -p acme-small -N 1 -t 08:00:00 --pty bash

@xylar
Copy link
Collaborator Author

xylar commented Apr 24, 2021

Thanks @vanroekel. Could you share the contents of ocn.cfg in case it matters?

@vanroekel
Copy link
Collaborator

I took another shot at this and did get past the weird errors, which I think were due to a very out of date ocn.cfg file. Following your config file free approach. I pass all baroclinic channel tests, but on QU240, I get this error

  File "/lcrc/soft/climate/e3sm-unified/base/envs/test_compass_1.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/lcrc/soft/climate/e3sm-unified/base/envs/test_compass_1.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/__main__.py", line 58, in <module>
    main()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/__main__.py", line 54, in main
    commands[args.command]()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/run.py", line 150, in main
    run_test_case()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/run.py", line 115, in run_test_case
    test_case.run()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/ocean/tests/global_ocean/init/__init__.py", line 98, in run
    super().run()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/testcase.py", line 149, in run
    self._run_step(step, new_log_file)
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/testcase.py", line 224, in _run_step
    step.run()
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/ocean/tests/global_ocean/init/initial_state.py", line 163, in run
    add_mesh_and_init_metadata(self.outputs, config,
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/ocean/tests/global_ocean/metadata.py", line 72, in add_mesh_and_init_metadata
    metadata = _get_metadata(dsInit, config)
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/ocean/tests/global_ocean/metadata.py", line 167, in _get_metadata
    _get_conda_package_version(package)
  File "/lcrc/group/e3sm/ac.vanroekel/nightlyTest/ocean/global_ocean/QU240/PHC/init/compass/ocean/tests/global_ocean/metadata.py", line 177, in _get_conda_package_version
    if parts[0] == package:
IndexError: list index out of range

This looks like an issue in grabbing the compass version? All the QU240 logs have this error.

I've used the same modules and conda load command as you list above.

The nightly directory is

/lcrc/group/e3sm/ac.vanroekel/nightlyTest

However, this appears unrelated to this PR, I just tried with compass/master and get the same error. This has to be on my end still. I'm happy to approve this PR based on the successful tests, if that is okay with you. I think the python -m compass run is a good change.

@xylar
Copy link
Collaborator Author

xylar commented Apr 25, 2021

@vanroekel, thanks very much for testing again!

The issue you are seeing above was fixed in #79. I verified that I rebased this branch onto master after merging #79. Could you do a fetch an hard reset on this branch (and master) and make sure it works for you now?

@vanroekel
Copy link
Collaborator

@xylar I cleaned my repo and reset to master and then reset to this PR, in both cases the nightly completed successfully. I must have missed a reset --hard in previous testing. I will approve as it worked perfectly. Sorry for the false alarms, and thanks for helping me work through them!

Copy link
Collaborator

@vanroekel vanroekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved based on visual inspection and successful testing on anvil of baroclinic channel and nightly test suite. Thanks @xylar!

Don't try to delete the script file (which we no longer create)
during suite clean-up.
@xylar
Copy link
Collaborator Author

xylar commented Apr 26, 2021

@vanroekel, thanks for reviewing this so promptly, for putting up with the moving target, and for uncovering some odd errors we'll need to keep an eye out for in case they occur again.

@matthewhoffman
Copy link
Member

@xylar , I have one other comment about these changes. I saw you first added the run method to each class and then later moved the run function for each operation (step, testcase, suite) into the run module. I can see arguments for either organization. I think I agree with your choice to move them to the run module. However, the one exception would be if there was ever a reason to have a child class override the default run method. I cannot think of a situation where one might want to do that (and in fact it sounds dangerous/undesirable to allow it). Is that your assessment as well?

@xylar
Copy link
Collaborator Author

xylar commented Apr 26, 2021

However, the one exception would be if there was ever a reason to have a child class override the default run method. I cannot think of a situation where one might want to do that (and in fact it sounds dangerous/undesirable to allow it). Is that your assessment as well?

These are functions, not methods, so they aren't part of the class and are not available to be overridden. I do not think we would ever want that. To me, that is a strong reason to take them out of the respective modules where the base classes are defined because there is a chance folks would expect them to be part of the classes, whereas they are part of the infrastructure not available for altering.

@matthewhoffman
Copy link
Member

Thanks, @xylar . Yes, I see they are functions now - I was just thinking through the alternate possibility of methods in the respective classes. But we are both in agreement that there is use (or desire) for that design choice, so the way you've implemented it here makes sense. I don't have any other comments on the PR.

@xylar
Copy link
Collaborator Author

xylar commented Apr 26, 2021

@matthewhoffman, great, thanks for the discussion. I completely agree that it's worth having. In general, it's really helpful to have you taking a careful look at these design choices and giving feedback. It really makes if feel like I'm not developing in my own bubble and helps to keep the design consistent.

@mark-petersen
Copy link
Collaborator

@xylar, in the current set-up, it is obvious to the user that the run steps are contained in run.py, and it is easy to copy from that script to the command line, or to only run parts of it.

In this new setup, how do we see and alter the steps? Is there a way to do that locally in the run directory?

@xylar
Copy link
Collaborator Author

xylar commented Apr 26, 2021

@xylar, in the current set-up, it is obvious to the user that the run steps are contained in run.py, and it is easy to copy from that script to the command line, or to only run parts of it.

In this new setup, how do we see and alter the steps? Is there a way to do that locally in the run directory?

No there is not, but that isn't something that changed with this PR, that was already true of #28. There was a ./run.py that wasn't easy of a user to understand and didn't let you edit which steps you run.

I agree with you that this is likely to be a feature we want. I don't entirely agree that the old scripts were easy to copy from and edit but that was possible if you wanted to dig into the script. Would you be willing to make an issue with this feature request and we can address it in another PR? I could imagine a few different ways to handle this and we should discuss it before I try implementing anything.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on grizzly. Ran the nightly test suite with ocean/develop. I agree that this format for run fit better into the python command format. Thanks!

@xylar
Copy link
Collaborator Author

xylar commented Apr 26, 2021

Thanks very much, @mark-petersen. Feel free to make that issue I mentioned above.

@xylar xylar merged commit 858ddf3 into MPAS-Dev:master Apr 26, 2021
compass 1.0 automation moved this from In progress to Done Apr 26, 2021
@xylar xylar deleted the switch_to_compass_run branch April 26, 2021 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-up python package DEPRECATED: PRs and Issues involving the python package (master branch)
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

consider switching to compass run instead of ./run.py
4 participants