Add zero_to_hero scripts from previous branch#159
Conversation
Transferred scripts/zero_to_hero directory with all documentation, examples, and utilities from the zero_to_hero branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| # SLURM copies the script to a temp dir, but we can find the original location | ||
| # Use the original script's directory, not the submit directory | ||
| if [[ -n "${SLURM_JOB_ID:-}" ]]; then | ||
| # In SLURM environment - find the actual script directory |
There was a problem hiding this comment.
Hardcoded user-specific path — scripts will fail for any other user (blocking)
SCRIPT_DIR="/global/homes/y/yak/E2SAR/scripts/zero_to_hero"This path is hard-coded to the yak account on Perlmutter. Any other user who submits this job will immediately get a "No such file or directory" error when the script tries to invoke "$SCRIPT_DIR/minimal_receiver.sh" or "$SCRIPT_DIR/minimal_reserve.sh".
The challenge is real: SLURM copies the batch script to a tmpdir, so ${BASH_SOURCE[0]} points there at runtime — but the simplest portable fix is to require that SCRIPT_DIR be set in the environment before submission, with a clear error if it is not:
if [[ -z "${E2SAR_SCRIPTS_DIR:-}" ]]; then
echo "ERROR: E2SAR_SCRIPTS_DIR must be set to the zero_to_hero directory"
echo " export E2SAR_SCRIPTS_DIR=/path/to/E2SAR/scripts/zero_to_hero"
exit 1
fi
SCRIPT_DIR="$E2SAR_SCRIPTS_DIR"This is consistent with the E2SAR_SCRIPTS_DIR variable already exported by setup_env.sh.
| # Parse node list | ||
| NODE_ARRAY=($(scontrol show hostname $SLURM_JOB_NODELIST)) | ||
|
|
||
| # Port stride matches receive thread count: each receiver binds RECV_THREADS consecutive ports |
There was a problem hiding this comment.
Same hardcoded user-specific path as in perlmutter_slurm.sh (blocking)
SCRIPT_DIR="/global/homes/y/yak/E2SAR/scripts/zero_to_hero"Same issue: this script is non-functional for any user other than yak. Apply the same fix — require E2SAR_SCRIPTS_DIR to be set in the environment and validate it, rather than hardcoding an absolute path.
| # Run lbadm --reserve and save output to INSTANCE_URI | ||
| # Note: lbadm --reserve skips SSL cert validation internally regardless of --novalidate; | ||
| # passing --novalidate interferes with this and causes failures, so it is intentionally omitted. | ||
| podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --reserve --lbname "yk_test" --export > "$INSTANCE_URI_FILE" |
There was a problem hiding this comment.
Hardcoded lb name yk_test limits portability
podman-hpc run ... lbadm --reserve --lbname "yk_test" --export > "$INSTANCE_URI_FILE"yk_test appears to be a personal test name. Users may want to use their own reservation names (especially on shared load balancers where names might collide). Consider adding a --lbname option with a sensible generic default:
LB_NAME="${LB_NAME:-e2sar_test}"
# ...
lbadm --reserve --lbname "$LB_NAME" --exportOr expose it as a CLI argument: ./minimal_reserve.sh --lbname my_test.
| echo "Found $INSTANCE_URI_FILE, validating..." | ||
|
|
||
| # Try to run lbadm --overview to check if the reservation is valid | ||
| if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --overview &>/dev/null; then |
There was a problem hiding this comment.
Existing-reservation check uses admin URI instead of instance URI — logic is incorrect
if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 lbadm --overview &>/dev/null; then
echo "Existing reservation is valid, skipping reserve"
exit 0At this point $EJFAT_URI is the admin URI from the caller's environment, not the session URI in INSTANCE_URI. lbadm --overview with the admin token will succeed as long as the load balancer is reachable — regardless of whether the session stored in INSTANCE_URI is still valid.
The check should source INSTANCE_URI first, then verify the session token is still active:
if [[ -f "$INSTANCE_URI_FILE" ]]; then
echo "Found $INSTANCE_URI_FILE, validating..."
# Use the instance URI (not the admin URI) to check session validity
INSTANCE_EJFAT_URI=$(. "$INSTANCE_URI_FILE" && echo "$EJFAT_URI")
if podman-hpc run -e EJFAT_URI="$INSTANCE_EJFAT_URI" --rm --network host \
"$E2SAR_IMAGE" lbadm --overview &>/dev/null; then
echo "Existing reservation is valid, skipping reserve"
exit 0
fi
fiAs written, a stale INSTANCE_URI with an expired session will never be replaced because the admin-URI --overview always passes.
| # Resolve LB hostname to IP | ||
| if [[ "$USE_IPV6" == "true" ]]; then | ||
| LB_IP=$(getent ahostsv6 "$LB_HOST" | head -1 | awk '{print $1}') | ||
| else |
There was a problem hiding this comment.
EJFAT_URI auth token printed to stdout/log unredacted
echo "EJFAT_URI: $EJFAT_URI"The full URI including the bearer token is written to minimal_sender.log (and to stdout). The SLURM scripts correctly redact the token with:
EJFAT_URI_REDACTED=$(echo "$EJFAT_URI" | sed -E 's|(://)(.{4})[^@]*(.{4})@|\1\2---\3@|')
echo "EJFAT_URI: $EJFAT_URI_REDACTED"Apply the same pattern here. Same issue exists in minimal_receiver.sh and minimal_free.sh.
| LBADM_CMD+=(--free) | ||
|
|
||
| if podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 "${LBADM_CMD[@]}"; then | ||
| echo "Reservation freed successfully" |
There was a problem hiding this comment.
Container image not configurable — hardcoded instead of using $E2SAR_IMAGE
podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host ibaldin/e2sar:0.3.1a3 "${LBADM_CMD[@]}"minimal_sender.sh and minimal_receiver.sh both expose --image / $E2SAR_IMAGE to allow overriding the container image. minimal_free.sh hardcodes ibaldin/e2sar:0.3.1a3 instead, which breaks image-override workflows. Add:
E2SAR_IMAGE="${E2SAR_IMAGE:-ibaldin/e2sar:0.3.1a3}"
# ...
podman-hpc run -e EJFAT_URI="$EJFAT_URI" --rm --network host "$E2SAR_IMAGE" "${LBADM_CMD[@]}"
Code Review SummaryOverall verdict: Request Changes Blocking issues (must fix before merge)
Non-blocking suggestions (style/portability)
What looks good
|
This commit resolves all blocking and non-blocking issues identified in the PR #159 code review. Blocking fixes: - Remove hardcoded /global/homes/y/yak/ paths from SLURM scripts (perlmutter_slurm.sh, perlmutter_multi_slurm.sh) and require E2SAR_SCRIPTS_DIR environment variable instead - Fix reservation validity check in minimal_reserve.sh to use instance URI instead of admin URI, ensuring stale sessions are properly detected Configuration improvements: - Add LB_NAME variable and --lbname option to minimal_reserve.sh for configurable reservation names (replaces hardcoded "yk_test") - Make container image configurable via E2SAR_IMAGE variable in minimal_reserve.sh and minimal_free.sh Security and logging: - Redact EJFAT_URI tokens in output logs for minimal_sender.sh, minimal_receiver.sh, and minimal_free.sh - Remove duplicate END_TIME/EXIT_CODE log entries in minimal_sender.sh and minimal_receiver.sh (trap handler already logs these) Documentation: - Update CLAUDE.md with E2SAR_SCRIPTS_DIR requirement for SLURM scripts - Document LB_NAME environment variable and --lbname option - Add prerequisites section for SLURM batch processing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cleanup traps to prevent orphaned LB reservations on job cancellation, replace unsafe source commands with grep-based extraction to prevent code execution, and hide authentication tokens from process listings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes remaining security vulnerabilities in minimal_reserve.sh and minimal_free.sh: - Replace unsafe source command with grep-based URI extraction in minimal_reserve.sh - Use --env instead of -e for EJFAT_URI to prevent token exposure in process listings - Properly handle temporary URI swapping for validation while preserving admin URI These changes complete the security hardening started in commit 8a314d4, ensuring all EJFAT_URI tokens are safely passed through environment variables rather than command-line arguments visible in ps output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Transferred scripts/zero_to_hero directory with all documentation, examples, and utilities from the zero_to_hero branch.