Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow/hosts.py returns hercules on orion following rocky 9 upgrade #2695

Closed
RussTreadon-NOAA opened this issue Jun 18, 2024 · 5 comments
Closed
Labels
bug Something isn't working triage Issues that are triage

Comments

@RussTreadon-NOAA
Copy link
Contributor

What is wrong?

Following the Orion Rocky 9 upgrade, workflow/hosts.py returns machine-HERCULES

What should have happened?

workflow/hosts.py should return machine=ORION when executed on Orion.

What machines are impacted?

Orion

Steps to reproduce

  1. clone g-w develop on Orion
  2. write short script to execute Host()
#!/usr/bin/env python3
import os
import socket
import platform
from hosts import Host

def main():
    host = Host()
    print(f" ")
    print(f"Host() is {host}")
    print(f" ")

if __name__ == '__main__':

    main()
  1. execute script on Orion and get
orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/develop/workflow$ ./test.py

machine is HERCULES
info is {'BASE_GIT': '/work/noaa/global/glopara/git_rocky9', 'DMPDIR': '/work/noaa/rstprod/dump', 'BASE_CPLIC': '/work/noaa/global/glopara/data/ICSDIR/prototype_ICs', 'PACKAGEROOT': '/work/noaa/global/glopara/nwpara', 'COMINsyn': '/work/noaa/global/glopara/com/gfs/prod/syndat', 'HOMEDIR': '/work/noaa/global/${USER}', 'STMP': '/work/noaa/stmp/${USER}/HERCULES', 'PTMP': '/work/noaa/stmp/${USER}/HERCULES', 'NOSCRUB': '$HOMEDIR', 'SCHEDULER': 'slurm', 'ACCOUNT': 'fv3-cpu', 'ACCOUNT_SERVICE': 'fv3-cpu', 'QUEUE': 'batch', 'QUEUE_SERVICE': 'batch', 'PARTITION_BATCH': 'hercules', 'PARTITION_SERVICE': 'service', 'RESERVATION': '', 'CHGRP_RSTPROD': 'YES', 'CHGRP_CMD': 'chgrp rstprod', 'HPSSARCH': 'NO', 'HPSS_PROJECT': 'emc-global', 'LOCALARCH': 'NO', 'ATARDIR': '${NOSCRUB}/archive_rotdir/${PSLOT}', 'MAKE_NSSTBUFR': 'NO', 'MAKE_ACFTBUFR': 'NO', 'SUPPORTED_RESOLUTIONS': ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48'], 'COMINecmwf': '/work/noaa/global/glopara/data/external_gempak/ecmwf', 'COMINnam': '/work/noaa/global/glopara/data/external_gempak/nam', 'COMINukmet': '/work/noaa/global/glopara/data/external_gempak/ukmet'}
scheduler is slurm

Additional information

The detect method in hosts.py contains

        elif os.path.exists('/work/noaa'):
            if os.path.exists('/apps/other'):
                machine = 'HERCULES'
            else:
                machine = 'ORION'

This logic no longer works on Orion following the Rocky 9 upgrade. Directory /apps/other exists on Orion. Thus, machine is set to HERCULES

Do you have a proposed solution?

g-w ush/detect_machine.sh has similar faulty logic

elif [[ -d /work ]]; then
  # We are on MSU Orion or Hercules
  if [[ -d /apps/other ]]; then
    # We are on Hercules
    MACHINE_ID=hercules
  else
    MACHINE_ID=orion
  fi

Execution of the above on Orion now returns MACHINE_ID=hercules. However, before this section of detect_machine.sh, the script uses hostname -f to also set MACHINE_ID. For Orion and Hercules, the script has the lines

  Orion-login-[1-4].HPC.MsState.Edu) MACHINE_ID=orion ;; ### orion1-4

  [Hh]ercules-login-[1-4].[Hh][Pp][Cc].[Mm]s[Ss]tate.[Ee]du) MACHINE_ID=hercules ;; ### hercules1-4

This section of the script correctly sets MACHINE_ID=orion.

The python socket.gethostname() or platform.node() return the hostname. Add these to the test python script

#!/usr/bin/env python3
import os
import socket
import platform
from hosts import Host

def main():
    host = Host()
    print(f" ")
    print(f"Host() is {host}")
    print(f" ")

    host = socket.gethostname()
    print(f" ")
    print(f"socket.gethostname() is {host}")
    print(f" ")

    host = platform.node()
    print(f" ")
    print(f"platform.node() is {host}")
    print(f" ")

if __name__ == '__main__':

    main()

Execute on Orion and get

machine is HERCULES
info is {'BASE_GIT': '/work/noaa/global/glopara/git_rocky9', 'DMPDIR': '/work/noaa/rstprod/dump', 'BASE_CPLIC': '/work/noaa/global/glopara/data/ICSDIR/prototype_ICs', 'PACKAGEROOT': '/work/noaa/global/glopara/nwpara', 'COMINsyn': '/work/noaa/global/glopara/com/gfs/prod/syndat', 'HOMEDIR': '/work/noaa/global/${USER}', 'STMP': '/work/noaa/stmp/${USER}/HERCULES', 'PTMP': '/work/noaa/stmp/${USER}/HERCULES', 'NOSCRUB': '$HOMEDIR', 'SCHEDULER': 'slurm', 'ACCOUNT': 'fv3-cpu', 'ACCOUNT_SERVICE': 'fv3-cpu', 'QUEUE': 'batch', 'QUEUE_SERVICE': 'batch', 'PARTITION_BATCH': 'hercules', 'PARTITION_SERVICE': 'service', 'RESERVATION': '', 'CHGRP_RSTPROD': 'YES', 'CHGRP_CMD': 'chgrp rstprod', 'HPSSARCH': 'NO', 'HPSS_PROJECT': 'emc-global', 'LOCALARCH': 'NO', 'ATARDIR': '${NOSCRUB}/archive_rotdir/${PSLOT}', 'MAKE_NSSTBUFR': 'NO', 'MAKE_ACFTBUFR': 'NO', 'SUPPORTED_RESOLUTIONS': ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48'], 'COMINecmwf': '/work/noaa/global/glopara/data/external_gempak/ecmwf', 'COMINnam': '/work/noaa/global/glopara/data/external_gempak/nam', 'COMINukmet': '/work/noaa/global/glopara/data/external_gempak/ukmet'}
scheduler is slurm


Host() is <hosts.Host object at 0x7f4107a45fd0>


socket.gethostname() is orion-login-4.hpc.msstate.edu


platform.node() is orion-login-4.hpc.msstate.edu

Can we use socket.gethostname() or platform.node() to return the login hostname and from this set machine accordingly?

@aerorahul
Copy link
Contributor

Thanks @RussTreadon-NOAA
The detection of the machine based on filesystem in detect_machine.sh is to aid on compute nodes. On compute nodes hostname -f does not always return the same string as on a login node.
Using socket is fine in hosts.py.
Would you have a solution for detect_machine.sh part of the script when executed on the compute node?

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @aerorahul for the information. I was unaware of this fact. I do not have a solution for the compute node section of detect_machine.sh

@DavidHuber-NOAA
Copy link
Contributor

We may be able to grep /etc/fstab (or the output of df) for orion-nfs or hercules-nfs to discern between the two machines.

@DavidHuber-NOAA
Copy link
Contributor

Better yet:

[[ $(findmnt -n -o SOURCE /home) =~ "hercules" ]] && echo "Hercules"
[[ $(findmnt -n -o SOURCE /home) =~ "orion" ]] && echo "Orion"

RussTreadon-NOAA added a commit to RussTreadon-NOAA/global-workflow that referenced this issue Jun 20, 2024
RussTreadon-NOAA added a commit to RussTreadon-NOAA/global-workflow that referenced this issue Jun 20, 2024
RussTreadon-NOAA added a commit to RussTreadon-NOAA/global-workflow that referenced this issue Jun 20, 2024
@DavidHuber-NOAA
Copy link
Contributor

Resolved by #2700. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issues that are triage
Projects
None yet
Development

No branches or pull requests

3 participants