Skip to content

fix: correct environment setup in container SLURM template#22

Merged
Neonkraft merged 4 commits intomainfrom
fix/slurm-container-template
Apr 30, 2026
Merged

fix: correct environment setup in container SLURM template#22
Neonkraft merged 4 commits intomainfrom
fix/slurm-container-template

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

Summary

Three correctness fixes to job_trl_container.sh.jinja:

  • Remove CUDA_DEVICE_MAX_CONNECTIONS=1: this is a Megatron-LM tensor-parallel flag that serializes CUDA streams and hurts ZeRO-2's overlap_comm — removed
  • Fix PYTHONPATH: was set to "", causing the container to use its baked-in post_training package instead of the local src/ checkout. Now set to {{ repo_dir }}/src
  • Add WANDB_DIR: set to the run directory so offline WandB runs persist to Lustre-backed shared storage rather than being lost when the compute node is released

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

…R in container template

CUDA_DEVICE_MAX_CONNECTIONS=1 is a Megatron-LM tensor-parallel flag that
serializes CUDA streams and hurts ZeRO-2 overlap_comm — removed.

PYTHONPATH was cleared to an empty string, causing the container to use
its baked-in post_training package instead of the local src/ checkout.
Now set to repo_dir/src so local changes take effect.

WANDB_DIR is set to the run directory so that offline WandB runs persist
to Lustre-backed shared storage rather than being lost on node teardown.
@Neonkraft Neonkraft requested a review from KonstiNik April 29, 2026 14:12
Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three bugs, all good fixes. Two notes:

  1. The PYTHONPATH="" issue also exists in job_llamafactory.sh.jinja:73. Maybe worth considering here too.

  2. WANDB_DIR={{ repo_dir }}/{{ run_dir }} only works because run_dir is relative by default. It's constructed as Path(output_base) / run_name at paths.py:57-59, and output_base is just a string (config.py:207), but nothing prevents a user from setting it to an absolute path like /scratch/foo. If they do, wandb fails to write (or writes somewhere unexpected, since that path likely isn't mounted in the container).

    Potential fix: compute an absolute run_dir at the launcher boundary and pass that to the template — e.g. at launcher.py:117, pass run_dir=str(Path.cwd() / run_dir if not run_dir.is_absolute() else run_dir), then template WANDB_DIR={{ run_dir }} directly.

WANDB_DIR was constructed as repo_dir/run_dir in job_trl_container.sh.jinja,
which breaks when output_base is an absolute path. Pass run_dir.resolve() from
the launcher so templates always receive an absolute path, then simplify
WANDB_DIR to use run_dir directly. Same resolve() applied to the LlamaFactory
renderer for consistency.
@Neonkraft
Copy link
Copy Markdown
Collaborator Author

Fixed.

Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ready from my side

@Neonkraft Neonkraft merged commit e3f7152 into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants