Skip to content

Add zero_to_hero scripts from previous branch#159

Merged
ibaldin merged 5 commits intov0.3.1-wipfrom
z2h
Feb 24, 2026
Merged

Add zero_to_hero scripts from previous branch#159
ibaldin merged 5 commits intov0.3.1-wipfrom
z2h

Conversation

@ibaldin
Copy link
Collaborator

@ibaldin ibaldin commented Feb 23, 2026

Transferred scripts/zero_to_hero directory with all documentation, examples, and utilities from the zero_to_hero branch.

Transferred scripts/zero_to_hero directory with all documentation,
examples, and utilities from the zero_to_hero branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ibaldin ibaldin requested a review from yakyakyak February 23, 2026 19:20
@ibaldin ibaldin self-assigned this Feb 23, 2026
@ibaldin ibaldin added enhancement New feature or request 0.3.0 v0.3.0 labels Feb 23, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# SLURM copies the script to a temp dir, but we can find the original location
# Use the original script's directory, not the submit directory
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
# In SLURM environment - find the actual script directory
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded user-specific path — scripts will fail for any other user (blocking)

SCRIPT_DIR="/global/homes/y/yak/E2SAR/scripts/zero_to_hero"

This path is hard-coded to the yak account on Perlmutter. Any other user who submits this job will immediately get a "No such file or directory" error when the script tries to invoke "$SCRIPT_DIR/minimal_receiver.sh" or "$SCRIPT_DIR/minimal_reserve.sh".

The challenge is real: SLURM copies the batch script to a tmpdir, so ${BASH_SOURCE[0]} points there at runtime — but the simplest portable fix is to require that SCRIPT_DIR be set in the environment before submission, with a clear error if it is not:

if [[ -z "${E2SAR_SCRIPTS_DIR:-}" ]]; then
    echo "ERROR: E2SAR_SCRIPTS_DIR must be set to the zero_to_hero directory"
    echo "  export E2SAR_SCRIPTS_DIR=/path/to/E2SAR/scripts/zero_to_hero"
    exit 1
fi
SCRIPT_DIR="$E2SAR_SCRIPTS_DIR"

This is consistent with the E2SAR_SCRIPTS_DIR variable already exported by setup_env.sh.

# Parse node list
NODE_ARRAY=($(scontrol show hostname $SLURM_JOB_NODELIST))

# Port stride matches receive thread count: each receiver binds RECV_THREADS consecutive ports
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hardcoded user-specific path as in perlmutter_slurm.sh (blocking)

SCRIPT_DIR="/global/homes/y/yak/E2SAR/scripts/zero_to_hero"

Same issue: this script is non-functional for any user other than yak. Apply the same fix — require E2SAR_SCRIPTS_DIR to be set in the environment and validate it, rather than hardcoding an absolute path.

# Run lbadm --reserve and save output to INSTANCE_URI
# Note: lbadm --reserve skips SSL cert validation internally regardless of --novalidate;
# passing --novalidate interferes with this and causes failures, so it is intentionally omitted.
podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --reserve --lbname "yk_test" --export > "$INSTANCE_URI_FILE"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded lb name yk_test limits portability

podman-hpc run ... lbadm --reserve --lbname "yk_test" --export > "$INSTANCE_URI_FILE"

yk_test appears to be a personal test name. Users may want to use their own reservation names (especially on shared load balancers where names might collide). Consider adding a --lbname option with a sensible generic default:

LB_NAME="${LB_NAME:-e2sar_test}"
# ...
lbadm --reserve --lbname "$LB_NAME" --export

Or expose it as a CLI argument: ./minimal_reserve.sh --lbname my_test.

echo "Found $INSTANCE_URI_FILE, validating..."

# Try to run lbadm --overview to check if the reservation is valid
if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --overview &>/dev/null; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing-reservation check uses admin URI instead of instance URI — logic is incorrect

if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --overview &>/dev/null; then
    echo "Existing reservation is valid, skipping reserve"
    exit 0

At this point $EJFAT_URI is the admin URI from the caller's environment, not the session URI in INSTANCE_URI. lbadm --overview with the admin token will succeed as long as the load balancer is reachable — regardless of whether the session stored in INSTANCE_URI is still valid.

The check should source INSTANCE_URI first, then verify the session token is still active:

if [[ -f "$INSTANCE_URI_FILE" ]]; then
    echo "Found $INSTANCE_URI_FILE, validating..."
    # Use the instance URI (not the admin URI) to check session validity
    INSTANCE_EJFAT_URI=$(. "$INSTANCE_URI_FILE" && echo "$EJFAT_URI")
    if podman-hpc run -e EJFAT_URI="$INSTANCE_EJFAT_URI" --rm --network host \
        "$E2SAR_IMAGE" lbadm --overview &>/dev/null; then
        echo "Existing reservation is valid, skipping reserve"
        exit 0
    fi
fi

As written, a stale INSTANCE_URI with an expired session will never be replaced because the admin-URI --overview always passes.

# Resolve LB hostname to IP
if [[ "$USE_IPV6" == "true" ]]; then
LB_IP=$(getent ahostsv6 "$LB_HOST" | head -1 | awk '{print $1}')
else
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EJFAT_URI auth token printed to stdout/log unredacted

echo "EJFAT_URI: $EJFAT_URI"

The full URI including the bearer token is written to minimal_sender.log (and to stdout). The SLURM scripts correctly redact the token with:

EJFAT_URI_REDACTED=$(echo "$EJFAT_URI" | sed -E 's|(://)(.{4})[^@]*(.{4})@|\1\2---\3@|')
echo "EJFAT_URI: $EJFAT_URI_REDACTED"

Apply the same pattern here. Same issue exists in minimal_receiver.sh and minimal_free.sh.

LBADM_CMD+=(--free)

if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 "${LBADM_CMD[@]}"; then
echo "Reservation freed successfully"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container image not configurable — hardcoded instead of using $E2SAR_IMAGE

podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 "${LBADM_CMD[@]}"

minimal_sender.sh and minimal_receiver.sh both expose --image / $E2SAR_IMAGE to allow overriding the container image. minimal_free.sh hardcodes ibaldin/e2sar:0.3.1a3 instead, which breaks image-override workflows. Add:

E2SAR_IMAGE="${E2SAR_IMAGE:-ibaldin/e2sar:0.3.1a3}"
# ...
podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host "$E2SAR_IMAGE" "${LBADM_CMD[@]}"

@ibaldin
Copy link
Collaborator Author

ibaldin commented Feb 23, 2026

Code Review Summary

Overall verdict: Request Changes


Blocking issues (must fix before merge)

  1. Hardcoded `/global/homes/y/yak/` path in both SLURM scripts — `perlmutter_slurm.sh:134` and `perlmutter_multi_slurm.sh:~205` both hard-code a path under the `yak` user account on Perlmutter. Any other user who `sbatch`es these scripts will immediately fail with "No such file or directory" when the script tries to invoke sibling scripts through `$SCRIPT_DIR`. Fix: validate and use `E2SAR_SCRIPTS_DIR` from the environment (which `setup_env.sh` already exports), rather than a hard-coded path.

  2. `minimal_reserve.sh` reservation-validity check uses the wrong URI (line 33) — The check runs `lbadm --overview` with the admin `$EJFAT_URI` from the caller's environment, not the session URI stored inside `INSTANCE_URI`. This means a stale `INSTANCE_URI` file with an expired session is never detected and never replaced — the script always exits early claiming the reservation is valid. The check must source `INSTANCE_URI` first and pass its session URI to `lbadm`.


Non-blocking suggestions (style/portability)

  • Hardcoded `--lbname "yk_test"` in `minimal_reserve.sh:46` — looks like a personal test name. Consider making it configurable via `--lbname` flag or `$LB_NAME` env var with a generic default (e.g. `e2sar_test`).
  • EJFAT_URI auth token logged unredacted in `minimal_sender.sh`, `minimal_receiver.sh`, and `minimal_free.sh` — the SLURM scripts correctly use a `EJFAT_URI_REDACTED` sed expression before printing. Apply the same pattern in the minimal scripts so bearer tokens are not written to log files.
  • `minimal_free.sh` hardcodes image tag instead of using `${E2SAR_IMAGE:-ibaldin/e2sar:0.3.1a3}` — breaks `--image` override workflows that work in the sender/receiver scripts.
  • `END_TIME`/`EXIT_CODE` written twice to log — both an explicit block and the `EXIT` trap call `write_end_time`, resulting in duplicate entries at the end of each run log.

What looks good

  • All scripts use `set -euo pipefail` consistently.
  • Container commands are built as arrays, which handles arguments with spaces correctly.
  • The `PIPESTATUS[0]` capture pattern for getting the container exit code through a `tee` pipeline is correct and well-commented.
  • The SLURM scripts properly redact the bearer token before printing `EJFAT_URI`.
  • `perlmutter_multi_slurm.sh` gracefully shuts down receivers with SIGTERM followed by SIGKILL — a good pattern.
  • Documentation is thorough and covers the tricky SSL cert expiry edge case (including the admin-token workaround) clearly.
  • The `setup_env.sh` zsh/bash dual-compatibility approach is reasonable.

Review generated by /review-pr skill — verify all inline comments before merging.

Yatish Kumar and others added 3 commits February 23, 2026 17:22
This commit resolves all blocking and non-blocking issues identified in the
PR #159 code review.

Blocking fixes:
- Remove hardcoded /global/homes/y/yak/ paths from SLURM scripts
  (perlmutter_slurm.sh, perlmutter_multi_slurm.sh) and require
  E2SAR_SCRIPTS_DIR environment variable instead
- Fix reservation validity check in minimal_reserve.sh to use instance URI
  instead of admin URI, ensuring stale sessions are properly detected

Configuration improvements:
- Add LB_NAME variable and --lbname option to minimal_reserve.sh for
  configurable reservation names (replaces hardcoded "yk_test")
- Make container image configurable via E2SAR_IMAGE variable in
  minimal_reserve.sh and minimal_free.sh

Security and logging:
- Redact EJFAT_URI tokens in output logs for minimal_sender.sh,
  minimal_receiver.sh, and minimal_free.sh
- Remove duplicate END_TIME/EXIT_CODE log entries in minimal_sender.sh
  and minimal_receiver.sh (trap handler already logs these)

Documentation:
- Update CLAUDE.md with E2SAR_SCRIPTS_DIR requirement for SLURM scripts
- Document LB_NAME environment variable and --lbname option
- Add prerequisites section for SLURM batch processing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cleanup traps to prevent orphaned LB reservations on job cancellation,
replace unsafe source commands with grep-based extraction to prevent code
execution, and hide authentication tokens from process listings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes remaining security vulnerabilities in minimal_reserve.sh and minimal_free.sh:

- Replace unsafe source command with grep-based URI extraction in minimal_reserve.sh
- Use --env instead of -e for EJFAT_URI to prevent token exposure in process listings
- Properly handle temporary URI swapping for validation while preserving admin URI

These changes complete the security hardening started in commit 8a314d4, ensuring
all EJFAT_URI tokens are safely passed through environment variables rather than
command-line arguments visible in ps output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ibaldin ibaldin merged commit c4daa94 into v0.3.1-wip Feb 24, 2026
@ibaldin ibaldin deleted the z2h branch February 24, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

0.3.0 v0.3.0 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant