New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs preventing slurm reinstall or rebuild #1187
Conversation
Bugs: 1. docker.slurm-exporter.service creates subdirectories `/usr/loca/bin/{sinfo,sdiag,squeue} when the excutables are absent (due to uninstall process). When deepops tries to install the executables again, they are put under the unwanted subdirectories (e.g. /usr/local/bin/sinfo/sinfo instead of /usr/local/bin/sinfo) 2. Slurm 'make uninstall' doesn't remove files under "{{ slurm_install_prefix }}/lib/slurm". Therefore the directory needs to be removed. A safer solution would be to run the correct 'make uninstall' command (at the correct location) to uninstall the files. Without this handling, slurm fails to start because of library version conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR! I can confirm that I can reproduce your bug, and this is a good thing to fix. 😄 However, there are some issues with your Ansible conditions. Please see the inline comments and make the requested changes, then we can re-test and hopefully merge!
roles/slurm/tasks/build.yml
Outdated
service: | ||
name: "docker.slurm-exporter.service" | ||
state: stopped | ||
when: slurm_build and (ansible_distribution == 'Ubuntu') and ("{{ansible_hostname}}" not in groups['slurm-node']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails on initial installs because the exporter hasn't been installed yet. The error looks like this:
fatal: [virtual-login01-2]: FAILED! => changed=false
invocation:
module_args:
daemon_reexec: false
daemon_reload: false
enabled: null
force: null
masked: null
name: docker.slurm-exporter.service
no_block: false
scope: system
state: stopped
msg: 'Could not find the requested service docker.slurm-exporter.service: host'
You should ensure that service facts are being collected, then add a condition so this only executes when the service exists. I.e., something like this:
- name: gather service facts
ansible.builtin.service_facts:
- name: stop docker slurm exporter service to prevent subfolder creation
service:
name: "docker.slurm-exporter.service"
state: stopped
when:
- slurm_build
- ansible_distribution == 'Ubuntu'
- "{{ansible_hostname}}" not in groups['slurm-node']
- "docker.slurm-exporter.service" in ansible_facts.services
(I haven't tested this change directly, but I'm pretty sure the syntax looks like that!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajdecon I incorporated your suggestion and also restarted the service back again in the end. Ran a test deployment and also rebased to master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, may be because of older ansible version, the when
list doesn't work so I use the explicit logical and
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you! I can confirm our CI tests are passing and my manual tests also look good.
Last thing: please sign off your commits according to our CONTRIBUTING.md doc. Once this is done, I can merge this PR.
Checking for ansible_hostname is unneccessary. Signed-off-by: Vang Le-Quy <biocyberman@gmail.com>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Bugs:
docker.slurm-exporter.service creates subdirectories
/usr/loca/bin/{sinfo,sdiag,squeue}
when the excutables are absent(due to uninstall process). When deepops tries to install the executables
again, they are put under the unwanted subdirectories (e.g.
/usr/local/bin/sinfo/sinfo instead of /usr/local/bin/sinfo)
Slurm 'make uninstall' doesn't remove files under
"{{ slurm_install_prefix }}/lib/slurm"
. Therefore the directory needs to be removed. A safer solutionwould be to run the correct 'make uninstall' command (at the correct location)
to uninstall the files. Without this handling, slurm commands fails to run because of
library version conflicts.