New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker launches on Cori seem to fail from broken ENV #27

Closed
yadudoc opened this Issue Jan 29, 2018 · 2 comments

Comments

Projects
None yet
1 participant
@yadudoc
Member

yadudoc commented Jan 29, 2018

Reported by Michael Wang over slack. Relevant snippet from chat :
Before trying it out on Cori, I first verified it would work on my desktop. So I installed SLURM on my desktop to try to get the setup as close as I could to the one at NERSC. Everything worked just fine. But when I tried it out on Cori, it was in a PENDING state in the debug queue for quite a while before disappearing from the queue even though the Parsl didn't seem to have completed yet -- i.e. my command "python3 -d wltest-clk.py" hadn't returned. When I look at the parsl*.submit.stderr file, I see:/var/spool/slurmd/job9779002/slurm_script: line 11: module: command not found /var/spool/slurmd/job9779002/slurm_script: line 12: activate: No such file or directory

This seems to indicate that at the time of loading the conda module the module command was missing, which is likely from the env variables being wiped. While the local channel works on the local machine, there might be something more nuanced about the specific machines that is breaking our env copy methods. Similar issue has been seen on theta.alcf.anl.gov, and a potential fix is in the theta branch.

yadudoc added a commit that referenced this issue Jan 29, 2018

Merge branch 'theta' for updates to local channel for env copy bug #27.
This is only a potential fix. Root cause of the env copy followed by
update breaking the env is not known, nor is why this manifests only
on certain specific systems.

@yadudoc yadudoc self-assigned this Jan 29, 2018

@yadudoc yadudoc added the bug label Jan 29, 2018

@yadudoc yadudoc added this to the 0.3.0 milestone Jan 29, 2018

yadudoc added a commit that referenced this issue Jan 29, 2018

Adding regression tests for issue: #27 and env setting.
This is a regression test for #27
Also tests env setting via execute_wait.
@yadudoc

This comment has been minimized.

Member

yadudoc commented Jan 29, 2018

The root cause of this failure is this line :

self.envs = local_env.update(envs)

Calling local_env.update() does an inplace update on local_env, and always returns None.
So, when the channel launches workers it basically wiped out the env (also PATH vars) which explains why the submit scripts failed with "command not found" errors. Keeping this open till fix is confirmed on Cori.

yadudoc added a commit that referenced this issue Jan 29, 2018

yadudoc added a commit to Parsl/parsl that referenced this issue Jan 29, 2018

@yadudoc

This comment has been minimized.

Member

yadudoc commented Jan 30, 2018

Mike Wang reports that this fix worked for him.

@yadudoc yadudoc closed this Jan 30, 2018

yadudoc added a commit that referenced this issue Jan 30, 2018

Bumping version from 0.2.6 -> 0.3.0
Several updates and cleanups for @benhg for the Azure provider
Removed haikunator dep
Fixes for issue #27 and regression tests
Support for cobalt+aprun on theta
SGE support (untested) from @benhg
Fix for #26
Updating to use scriptDir instead of script_dir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment