Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SLUM worker startup more robust and provide more feedback #200

Merged
merged 4 commits into from
Apr 19, 2024

Conversation

oschulz
Copy link
Contributor

@oschulz oschulz commented Apr 19, 2024

Builds on top of #199.

Before (but with fix in #199):

# ... wait (but what's going on?) ...

connecting to worker 1 out of 12
connecting to worker 2 out of 12
connecting to worker 3 out of 12
connecting to worker 4 out of 12
connecting to worker 5 out of 12
connecting to worker 6 out of 12
connecting to worker 7 out of 12
connecting to worker 8 out of 12
connecting to worker 9 out of 12
connecting to worker 10 out of 12
connecting to worker 11 out of 12
connecting to worker 12 out of 12

After:

[ Info: Starting SLURM job julia-26323452: `srun -J julia-26323452 -n 12 -D /homedir/some/dir --cpus-per-task=8 --mem-per-cpu=8G --cpu-bind=cores --mem-bind=local -o /homedir/slurm-julia-output/julia-26323452-12983479872-%4t.out /path/to/bin/julia --project=/homedir/.julia/environments/someenv --threads=8 --heap-size-hint=34359738368 --worker=qy8ZReqHiDfwjq6a`
[ Info: Worker 0 (after 0 s): No output file "/homedir/slurm-julia-output/julia-26323452-12983479872-0000.out" yet
[ Info: Worker 0 (after 1 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 2 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 4 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 6 s): Output file found, but no connection details yet
[ Info: Worker 0 ready after 10 s on host 149.217.13.126, port 9101
[ Info: Worker 1 ready after 10 s on host 149.217.13.126, port 9102
[ Info: Worker 2 ready after 10 s on host 149.217.13.126, port 9103
[ Info: Worker 3 ready after 11 s on host 149.217.13.126, port 9104
[ Info: Worker 4 ready after 11 s on host 149.217.13.126, port 9105
[ Info: Worker 5 ready after 11 s on host 149.217.13.126, port 9106
[ Info: Worker 6 ready after 12 s on host 149.217.13.126, port 9107
[ Info: Worker 7 ready after 12 s on host 149.217.13.126, port 9108
[ Info: Worker 8 ready after 12 s on host 149.217.13.126, port 9109
[ Info: Worker 9 ready after 12 s on host 149.217.13.126, port 9110
[ Info: Worker 10 ready after 12 s on host 149.217.13.126, port 9111
[ Info: Worker 11 ready after 12 s on host 149.217.13.126, port 9112

@Moelf
Copy link
Collaborator

Moelf commented Apr 19, 2024

LGTM

Copy link
Collaborator

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine with this 👍

@kescobo kescobo mentioned this pull request Apr 19, 2024
@oschulz
Copy link
Contributor Author

oschulz commented Apr 19, 2024

Good to merge from my side (I'm looking into ElasticManager next, following advice from @JBlaschke ).

@kescobo kescobo merged commit f41deaf into JuliaParallel:master Apr 19, 2024
0 of 2 checks passed
@kescobo
Copy link
Collaborator

kescobo commented Apr 19, 2024

Should we cut a release on this, or do you have more to do before that?

@oschulz
Copy link
Contributor Author

oschulz commented Apr 19, 2024

Should we cut a release on this, or do you have more to do before that?

Thanks, yes. I have some people who need to use this, I think it's good for now.

@kescobo
Copy link
Collaborator

kescobo commented Apr 19, 2024

@oschulz
Copy link
Contributor Author

oschulz commented Apr 19, 2024

Merci @kescobo !

@oschulz oschulz deleted the slurm-improvements branch April 22, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants