Change unknown job device #978

sbryngelson · 2025-08-08T22:28:09Z

User description

Make it so phoenix knows to run with proper device (GPU enabled possible)

PR Type

Bug fix

Description

Replace device variable with job_device for consistency
Simplify SLURM job submission scripts
Remove complex job polling and cleanup logic
Clean up test script temporary directory handling

Diagram Walkthrough

flowchart LR
  A["Variable Rename"] --> B["device → job_device"]
  C["Script Simplification"] --> D["Remove job polling"]
  C --> E["Simplify SLURM directives"]
  F["Test Cleanup"] --> G["Remove temp directory logic"]

File Walkthrough

Relevant files

Bug fix

bench.sh `Update device variable name` .github/workflows/phoenix/bench.sh Replace `$device` variable references with `$job_device` Remove debug echo statement for device variable	+2/-3

Enhancement

submit-bench.sh `Simplify SLURM job submission script` .github/workflows/phoenix/submit-bench.sh Replace complex job submission with simplified SLURM script Remove job polling and exit code handling logic Change variable from `device` to `job_device` Add `-W` flag for synchronous job execution	+39/-82
submit.sh `Simplify SLURM job submission script` .github/workflows/phoenix/submit.sh Replace complex job submission with simplified SLURM script Remove job polling and exit code handling logic Change variable from `device` to `job_device` Add `-W` flag for synchronous job execution	+43/-79
test.sh `Clean up test script structure` .github/workflows/phoenix/test.sh Remove temporary directory creation and cleanup logic Move `n_test_threads` variable declaration after usage Simplify script structure and remove cleanup commands	+3/-13

codecov · 2025-08-09T00:36:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.23%. Comparing base (8173368) to head (d81a0fd).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #978   +/-   ##
=======================================
  Coverage   43.23%   43.23%           
=======================================
  Files          70       70           
  Lines       20109    20109           
  Branches     2516     2513    -3     
=======================================
  Hits         8695     8695           
  Misses       9877     9877           
  Partials     1537     1537

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

qodo-merge-pro · 2025-08-09T12:22:50Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue When job_device is not "gpu", n_ranks and device_opts may be unset or inconsistent (device_opts is only defined inside the GPU branch). Validate defaults for CPU runs to avoid passing an empty variable or wrong rank count. if [ "$job_device" = "gpu" ]; then n_ranks=$(nvidia-smi -L \| wc -l) # number of GPUs on node gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1 device_opts="--gpu -g $gpu_ids" fi Ordering Risk n_test_threads is hard-coded to 8 before possibly being recomputed for GPU; ensure this value is appropriate for CPU paths and that the earlier dry-run uses the intended thread count. ./mfc.sh test --dry-run -j 8 $build_opts n_test_threads=8 if [ "$job_device" = "gpu" ]; then gpu_count=$(nvidia-smi -L \| wc -l) # number of GPUs on node gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1 device_opts="-g $gpu_ids" n_test_threads=`expr $gpu_count \* 2` Env Consistency The submitted script exports job_device but mfc.sh is invoked with -m $2; verify downstream scripts consistently use job_device and not the previous device variable to prevent mismatches. job_slug="$job_slug" job_device="$2" . ./mfc.sh load -c p -m $2 $sbatch_script_contents

Copilot

Pull Request Overview

This PR simplifies the SLURM job submission scripts for the Phoenix cluster by removing complex job polling logic and making device detection more consistent. The changes aim to ensure Phoenix properly recognizes GPU-enabled environments for MFC testing and benchmarking.

Key changes:

Simplified SLURM job submission by removing job polling and exit code handling
Standardized device variable naming from device to job_device
Cleaned up temporary directory management in test scripts

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`.github/workflows/phoenix/test.sh`	Removed temporary directory setup and cleanup, simplified test execution
`.github/workflows/phoenix/submit.sh`	Simplified SLURM submission by removing job polling logic and using synchronous execution
`.github/workflows/phoenix/submit-bench.sh`	Applied same simplification as submit.sh for benchmarking jobs
`.github/workflows/phoenix/bench.sh`	Updated device variable reference from `device` to `job_device`

Copilot · 2025-08-09T12:23:27Z

.github/workflows/phoenix/test.sh

+./mfc.sh test --dry-run -j 8 $build_opts
+
+n_test_threads=8



The variable n_test_threads is defined after it's used on line 8. This creates a logical inconsistency where the hardcoded value 8 is used before the variable is defined with the same value.

Suggested change

./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

n_test_threads=8

./mfc.sh test --dry-run -j $n_test_threads $build_opts

Copilot · 2025-08-09T12:23:27Z

.github/workflows/phoenix/submit.sh

-
-# read the body of the user script
-sbatch_body=$(<"$sbatch_script")
+if [ ! -z "$1" ]; then


[nitpick] Use -n instead of ! -z for checking non-empty strings. The condition should be if [ -n "$1" ]; then for better readability and following shell scripting best practices.

Suggested change

if [ ! -z "$1" ]; then

if [ -n "$1" ]; then

Copilot · 2025-08-09T12:23:28Z

.github/workflows/phoenix/submit-bench.sh

-#SBATCH -o $job_slug.out                        # stdout+stderr
-#SBATCH --mem-per-cpu=2G                        # default mem (overridden below)
-"
+if [ ! -z "$1" ]; then


[nitpick] Use -n instead of ! -z for checking non-empty strings. The condition should be if [ -n "$1" ]; then for better readability and following shell scripting best practices.

Suggested change

if [ ! -z "$1" ]; then

if [ -n "$1" ]; then

Copilot · 2025-08-09T12:23:28Z

.github/workflows/phoenix/submit.sh

+#SBATCH -t 03:00:00                # Duration of the job (Ex: 15 mins)
+#SBATCH -q embers                  # QOS Name
+#SBATCH -o$job_slug.out            # Combined output and error messages file
+#SBATCH -W                         # Do not exit until the submitted job terminates.


The -W flag makes sbatch wait for job completion, but this removes the ability to handle job failures gracefully. The original polling mechanism provided better error handling and cleanup capabilities.

Suggested change

#SBATCH -W # Do not exit until the submitted job terminates.

Copilot · 2025-08-09T12:23:28Z

.github/workflows/phoenix/submit-bench.sh

+#SBATCH -t 02:00:00                # Duration of the job (Ex: 15 mins)
+#SBATCH -q embers                  # QOS Name
+#SBATCH -o$job_slug.out            # Combined output and error messages file
+#SBATCH -W                         # Do not exit until the submitted job terminates.


The -W flag makes sbatch wait for job completion, but this removes the ability to handle job failures gracefully. The original polling mechanism provided better error handling and cleanup capabilities.

Suggested change

#SBATCH -W # Do not exit until the submitted job terminates.

qodo-merge-pro · 2025-08-09T12:23:54Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Add robust GPU detection fallback If `nvidia-smi` is unavailable or fails, `n_ranks` becomes empty and arithmetic/seq will error. Add a fallback to set `n_ranks=1` and skip GPU flags when GPUs are not detected to prevent script failure on CPU nodes. .github/workflows/phoenix/bench.sh [5-9] if [ "$job_device" = "gpu" ]; then - n_ranks=$(nvidia-smi -L \| wc -l) # number of GPUs on node - gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1 - device_opts="--gpu -g $gpu_ids" + if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi -L >/dev/null 2>&1; then + n_ranks=$(nvidia-smi -L \| wc -l) + gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) + device_opts="--gpu -g $gpu_ids" + else + echo "Warning: GPU requested but nvidia-smi not available; falling back to CPU settings." + n_ranks=1 + device_opts="" + fi fi Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that the script will fail if `nvidia-smi` is not found or fails, and provides a robust fix to handle this case gracefully.	Medium
Possible issue	Align thread count usage The dry-run uses a hardcoded thread count before `n_test_threads` is set and later possibly adjusted for GPUs, causing inconsistent configuration between dry-run and actual run. Initialize `n_test_threads` first and use it for both invocations to keep behavior consistent, especially when GPU logic changes it. .github/workflows/phoenix/test.sh [8-10] -./mfc.sh test --dry-run -j 8 $build_opts - n_test_threads=8 +build_opts="" +if [ "$job_device" = "gpu" ]; then + build_opts="--gpu" +fi + +./mfc.sh test --dry-run -j $n_test_threads $build_opts + +if [ "$job_device" = "gpu" ]; then + gpu_count=$(nvidia-smi -L \| wc -l) # number of GPUs on node + gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1 + device_opts="-g $gpu_ids" + n_test_threads=$(expr $gpu_count \* 2) +fi + +./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix + Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies an inconsistency in the number of threads used between the dry run and the actual test run, improving the script's reliability.	Low
More

Update test.sh

2c81b47

sbryngelson added 5 commits August 8, 2025 20:41

fix

c8f1780

fix again

39d817a

blah

230bf05

fix again

a3c741a

back to old way

d81a0fd

sbryngelson marked this pull request as ready for review August 9, 2025 12:22

Copilot AI review requested due to automatic review settings August 9, 2025 12:22

sbryngelson merged commit e56d6dc into MFlowCode:master Aug 9, 2025
30 of 31 checks passed

qodo-merge-pro bot added the Review effort 2/5 label Aug 9, 2025

Copilot AI reviewed Aug 9, 2025

View reviewed changes

sbryngelson deleted the fix-phoenix branch August 10, 2025 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change unknown job device #978

Change unknown job device #978

Uh oh!

sbryngelson commented Aug 8, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

codecov bot commented Aug 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

qodo-merge-pro bot commented Aug 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 9, 2025

Uh oh!

Copilot AI Aug 9, 2025

Uh oh!

Copilot AI Aug 9, 2025

Uh oh!

Copilot AI Aug 9, 2025

Uh oh!

Copilot AI Aug 9, 2025

Uh oh!

qodo-merge-pro bot commented Aug 9, 2025

Uh oh!

Uh oh!

Change unknown job device #978

Change unknown job device #978

Uh oh!

Conversation

sbryngelson commented Aug 8, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

codecov bot commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

qodo-merge-pro bot commented Aug 9, 2025

PR Reviewer Guide 🔍

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

qodo-merge-pro bot commented Aug 9, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

sbryngelson commented Aug 8, 2025 •

edited by qodo-merge-pro bot

Loading

codecov bot commented Aug 9, 2025 •

edited

Loading