Skip to content

Conversation

sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Aug 8, 2025

User description

Make it so phoenix knows to run with proper device (GPU enabled possible)


PR Type

Bug fix


Description

  • Replace device variable with job_device for consistency

  • Simplify SLURM job submission scripts

  • Remove complex job polling and cleanup logic

  • Clean up test script temporary directory handling


Diagram Walkthrough

flowchart LR
  A["Variable Rename"] --> B["device → job_device"]
  C["Script Simplification"] --> D["Remove job polling"]
  C --> E["Simplify SLURM directives"]
  F["Test Cleanup"] --> G["Remove temp directory logic"]
Loading

File Walkthrough

Relevant files
Bug fix
bench.sh
Update device variable name                                                           

.github/workflows/phoenix/bench.sh

  • Replace $device variable references with $job_device
  • Remove debug echo statement for device variable
+2/-3     
Enhancement
submit-bench.sh
Simplify SLURM job submission script                                         

.github/workflows/phoenix/submit-bench.sh

  • Replace complex job submission with simplified SLURM script
  • Remove job polling and exit code handling logic
  • Change variable from device to job_device
  • Add -W flag for synchronous job execution
+39/-82 
submit.sh
Simplify SLURM job submission script                                         

.github/workflows/phoenix/submit.sh

  • Replace complex job submission with simplified SLURM script
  • Remove job polling and exit code handling logic
  • Change variable from device to job_device
  • Add -W flag for synchronous job execution
+43/-79 
test.sh
Clean up test script structure                                                     

.github/workflows/phoenix/test.sh

  • Remove temporary directory creation and cleanup logic
  • Move n_test_threads variable declaration after usage
  • Simplify script structure and remove cleanup commands
+3/-13   

Copy link

codecov bot commented Aug 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.23%. Comparing base (8173368) to head (d81a0fd).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #978   +/-   ##
=======================================
  Coverage   43.23%   43.23%           
=======================================
  Files          70       70           
  Lines       20109    20109           
  Branches     2516     2513    -3     
=======================================
  Hits         8695     8695           
  Misses       9877     9877           
  Partials     1537     1537           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson marked this pull request as ready for review August 9, 2025 12:22
@Copilot Copilot AI review requested due to automatic review settings August 9, 2025 12:22
@sbryngelson sbryngelson merged commit e56d6dc into MFlowCode:master Aug 9, 2025
30 of 31 checks passed
Copy link

qodo-merge-pro bot commented Aug 9, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

When job_device is not "gpu", n_ranks and device_opts may be unset or inconsistent (device_opts is only defined inside the GPU branch). Validate defaults for CPU runs to avoid passing an empty variable or wrong rank count.

if [ "$job_device" = "gpu" ]; then
    n_ranks=$(nvidia-smi -L | wc -l)        # number of GPUs on node
    gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1
    device_opts="--gpu -g $gpu_ids"
fi
Ordering Risk

n_test_threads is hard-coded to 8 before possibly being recomputed for GPU; ensure this value is appropriate for CPU paths and that the earlier dry-run uses the intended thread count.

./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

if [ "$job_device" = "gpu" ]; then
    gpu_count=$(nvidia-smi -L | wc -l)        # number of GPUs on node
    gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
    device_opts="-g $gpu_ids"
    n_test_threads=`expr $gpu_count \* 2`
Env Consistency

The submitted script exports job_device but mfc.sh is invoked with -m $2; verify downstream scripts consistently use job_device and not the previous device variable to prevent mismatches.

job_slug="$job_slug"
job_device="$2"

. ./mfc.sh load -c p -m $2

$sbatch_script_contents

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR simplifies the SLURM job submission scripts for the Phoenix cluster by removing complex job polling logic and making device detection more consistent. The changes aim to ensure Phoenix properly recognizes GPU-enabled environments for MFC testing and benchmarking.

Key changes:

  • Simplified SLURM job submission by removing job polling and exit code handling
  • Standardized device variable naming from device to job_device
  • Cleaned up temporary directory management in test scripts

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
.github/workflows/phoenix/test.sh Removed temporary directory setup and cleanup, simplified test execution
.github/workflows/phoenix/submit.sh Simplified SLURM submission by removing job polling logic and using synchronous execution
.github/workflows/phoenix/submit-bench.sh Applied same simplification as submit.sh for benchmarking jobs
.github/workflows/phoenix/bench.sh Updated device variable reference from device to job_device

Comment on lines +8 to 11
./mfc.sh test --dry-run -j 8 $build_opts

n_test_threads=8

Copy link
Preview

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable n_test_threads is defined after it's used on line 8. This creates a logical inconsistency where the hardcoded value 8 is used before the variable is defined with the same value.

Suggested change
./mfc.sh test --dry-run -j 8 $build_opts
n_test_threads=8
n_test_threads=8
./mfc.sh test --dry-run -j $n_test_threads $build_opts

Copilot uses AI. Check for mistakes.


# read the body of the user script
sbatch_body=$(<"$sbatch_script")
if [ ! -z "$1" ]; then
Copy link
Preview

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use -n instead of ! -z for checking non-empty strings. The condition should be if [ -n "$1" ]; then for better readability and following shell scripting best practices.

Suggested change
if [ ! -z "$1" ]; then
if [ -n "$1" ]; then

Copilot uses AI. Check for mistakes.

#SBATCH -o $job_slug.out # stdout+stderr
#SBATCH --mem-per-cpu=2G # default mem (overridden below)
"
if [ ! -z "$1" ]; then
Copy link
Preview

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use -n instead of ! -z for checking non-empty strings. The condition should be if [ -n "$1" ]; then for better readability and following shell scripting best practices.

Suggested change
if [ ! -z "$1" ]; then
if [ -n "$1" ]; then

Copilot uses AI. Check for mistakes.

#SBATCH -t 03:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -q embers # QOS Name
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH -W # Do not exit until the submitted job terminates.
Copy link
Preview

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -W flag makes sbatch wait for job completion, but this removes the ability to handle job failures gracefully. The original polling mechanism provided better error handling and cleanup capabilities.

Suggested change
#SBATCH -W # Do not exit until the submitted job terminates.

Copilot uses AI. Check for mistakes.

#SBATCH -t 02:00:00 # Duration of the job (Ex: 15 mins)
#SBATCH -q embers # QOS Name
#SBATCH -o$job_slug.out # Combined output and error messages file
#SBATCH -W # Do not exit until the submitted job terminates.
Copy link
Preview

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -W flag makes sbatch wait for job completion, but this removes the ability to handle job failures gracefully. The original polling mechanism provided better error handling and cleanup capabilities.

Suggested change
#SBATCH -W # Do not exit until the submitted job terminates.

Copilot uses AI. Check for mistakes.

Copy link

qodo-merge-pro bot commented Aug 9, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add robust GPU detection fallback

If nvidia-smi is unavailable or fails, n_ranks becomes empty and arithmetic/seq
will error. Add a fallback to set n_ranks=1 and skip GPU flags when GPUs are not
detected to prevent script failure on CPU nodes.

.github/workflows/phoenix/bench.sh [5-9]

 if [ "$job_device" = "gpu" ]; then
-    n_ranks=$(nvidia-smi -L | wc -l)        # number of GPUs on node
-    gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1))) # 0,1,2,...,gpu_count-1
-    device_opts="--gpu -g $gpu_ids"
+    if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi -L >/dev/null 2>&1; then
+        n_ranks=$(nvidia-smi -L | wc -l)
+        gpu_ids=$(seq -s ' ' 0 $(($n_ranks-1)))
+        device_opts="--gpu -g $gpu_ids"
+    else
+        echo "Warning: GPU requested but nvidia-smi not available; falling back to CPU settings."
+        n_ranks=1
+        device_opts=""
+    fi
 fi
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly points out that the script will fail if nvidia-smi is not found or fails, and provides a robust fix to handle this case gracefully.

Medium
Align thread count usage

The dry-run uses a hardcoded thread count before n_test_threads is set and later
possibly adjusted for GPUs, causing inconsistent configuration between dry-run
and actual run. Initialize n_test_threads first and use it for both invocations
to keep behavior consistent, especially when GPU logic changes it.

.github/workflows/phoenix/test.sh [8-10]

-./mfc.sh test --dry-run -j 8 $build_opts
-
 n_test_threads=8
 
+build_opts=""
+if [ "$job_device" = "gpu" ]; then
+    build_opts="--gpu"
+fi
+
+./mfc.sh test --dry-run -j $n_test_threads $build_opts
+
+if [ "$job_device" = "gpu" ]; then
+    gpu_count=$(nvidia-smi -L | wc -l)        # number of GPUs on node
+    gpu_ids=$(seq -s ' ' 0 $(($gpu_count-1))) # 0,1,2,...,gpu_count-1
+    device_opts="-g $gpu_ids"
+    n_test_threads=$(expr $gpu_count \* 2)
+fi
+
+./mfc.sh test --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
+
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies an inconsistency in the number of threads used between the dry run and the actual test run, improving the script's reliability.

Low
  • More

@sbryngelson sbryngelson deleted the fix-phoenix branch August 10, 2025 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant