Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs preventing slurm reinstall or rebuild #1187

Merged
merged 2 commits into from Jun 23, 2022

Conversation

biocyberman
Copy link
Contributor

@biocyberman biocyberman commented Jun 21, 2022

Bugs:

  1. docker.slurm-exporter.service creates subdirectories
    /usr/loca/bin/{sinfo,sdiag,squeue} when the excutables are absent
    (due to uninstall process). When deepops tries to install the executables
    again, they are put under the unwanted subdirectories (e.g.
    /usr/local/bin/sinfo/sinfo instead of /usr/local/bin/sinfo)

  2. Slurm 'make uninstall' doesn't remove files under "{{ slurm_install_prefix }}/lib/slurm". Therefore the directory needs to be removed. A safer solution
    would be to run the correct 'make uninstall' command (at the correct location)
    to uninstall the files. Without this handling, slurm commands fails to run because of
    library version conflicts.

Bugs:

1. docker.slurm-exporter.service creates subdirectories
`/usr/loca/bin/{sinfo,sdiag,squeue} when the excutables are absent
(due to uninstall process). When deepops tries to install the executables
again, they are put under the unwanted subdirectories (e.g.
/usr/local/bin/sinfo/sinfo instead of /usr/local/bin/sinfo)

2. Slurm 'make uninstall' doesn't remove files under "{{ slurm_install_prefix
}}/lib/slurm". Therefore the directory needs to be removed. A safer solution
would be to run the correct 'make uninstall' command (at the correct location)
to uninstall the files. Without this handling, slurm fails to start because of
library version conflicts.
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR! I can confirm that I can reproduce your bug, and this is a good thing to fix. 😄 However, there are some issues with your Ansible conditions. Please see the inline comments and make the requested changes, then we can re-test and hopefully merge!

service:
name: "docker.slurm-exporter.service"
state: stopped
when: slurm_build and (ansible_distribution == 'Ubuntu') and ("{{ansible_hostname}}" not in groups['slurm-node'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails on initial installs because the exporter hasn't been installed yet. The error looks like this:

fatal: [virtual-login01-2]: FAILED! => changed=false 
  invocation:
    module_args:
      daemon_reexec: false
      daemon_reload: false
      enabled: null
      force: null
      masked: null
      name: docker.slurm-exporter.service
      no_block: false
      scope: system
      state: stopped
  msg: 'Could not find the requested service docker.slurm-exporter.service: host'

You should ensure that service facts are being collected, then add a condition so this only executes when the service exists. I.e., something like this:

- name: gather service facts
  ansible.builtin.service_facts:

- name: stop docker slurm exporter service to prevent subfolder creation
  service:
      name: "docker.slurm-exporter.service"
      state: stopped
  when: 
  - slurm_build
  - ansible_distribution == 'Ubuntu'
  - "{{ansible_hostname}}" not in groups['slurm-node']
  - "docker.slurm-exporter.service" in ansible_facts.services

(I haven't tested this change directly, but I'm pretty sure the syntax looks like that!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajdecon I incorporated your suggestion and also restarted the service back again in the end. Ran a test deployment and also rebased to master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, may be because of older ansible version, the when list doesn't work so I use the explicit logical and.

roles/slurm/tasks/build.yml Show resolved Hide resolved
@biocyberman biocyberman requested a review from ajdecon June 23, 2022 17:11
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! I can confirm our CI tests are passing and my manual tests also look good.

Last thing: please sign off your commits according to our CONTRIBUTING.md doc. Once this is done, I can merge this PR.

Checking for ansible_hostname is unneccessary.

Signed-off-by: Vang Le-Quy <biocyberman@gmail.com>
@biocyberman
Copy link
Contributor Author

This looks great, thank you! I can confirm our CI tests are passing and my manual tests also look good.

Last thing: please sign off your commits according to our CONTRIBUTING.md doc. Once this is done, I can merge this PR.
Sounds great!
I amended the last commit and signed it off. Thanks.

Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajdecon ajdecon merged commit efbf24f into NVIDIA:master Jun 23, 2022
@dholt dholt mentioned this pull request Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants