Skip to content

Conversation

@bvandermoon
Copy link
Collaborator

Description

Use ~/xpk/xpk.py as the default xpk_path instead of a path with the full package directory like /home/bvandermoon/workspace/maxtext/~/xpk/xpk.py. This change is needed to make the benchmark_runner work after a recent refactor.

I was getting an error when running the benchmark runner without this change:

Waiting for `bvand-mixtral-8x7b-1-041000-m6p`, for 0 seconds
python3: can't open file '/home/bvandermoon/workspace/maxtext/~/xpk/xpk.py': [Errno 2] No such file or directory
Task: `bvand-mixtral-8x7b-1-041000-m6p` terminated with code `2`
No monitoring threads found for workload 'bvand-mixtral-8x7b-1-041000-m6p'.
Unable to run xpk workload: {xpk_workload_name}

@shauryagup there are a few pathways files that look like they have this same issue. You may want to check if these files are still working for you and try this change if they are not:

Tests

The benchmark runner schedules the workload after this change and I get a link to logs in the XPK output.

Seeing a different issue now where the actual workload does not run properly. I will keep debugging that.

python3 benchmarks/benchmark_runner.py xpk \
    --project=${PROJECT} \
    --zone=${ZONE} \
    --device_type=v6e-256 \
    --num_slices=1 \
    --cluster_name=${CLUSTER_NAME} \
    --base_output_directory=${OUTPUT_DIR} \
    --model_name="mixtral_8x7b_dropped" \
    --base_docker_image=maxtext_base_image

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few files with a similar issue, could you update them too ?

benchmarks/recipes/pw_long_running_recipe.py
benchmarks/recipes/pw_mcjax_benchmark_recipe.py
benchmarks/recipes/pw_mcjax_checkpoint_benchmark_recipe.py

@bvandermoon
Copy link
Collaborator Author

There are a few files with a similar issue, could you update them too ?

benchmarks/recipes/pw_long_running_recipe.py benchmarks/recipes/pw_mcjax_benchmark_recipe.py benchmarks/recipes/pw_mcjax_checkpoint_benchmark_recipe.py

Thanks @shralex. I had pinged @shauryagup in the description on these files. I don't know how these files are used/I don't have a current Pathways setup in my environment to test that these changes work. So I was wanting the Pathways team to test it out on their side. Let me know if you think it's better to just make the changes here and then the Pathways team can check if they are working on their side later

@lukebaumann
Copy link
Collaborator

These can be updated in the same way. I will ping our chat space letting our team know of the change.

pw_long_running_recipe.py
pw_mcjax_checkpoint_benchmark_recipe.py
pw_mcjax_benchmark_recipe.py

@lukebaumann
Copy link
Collaborator

These can be updated in the same way. I will ping our chat space letting our team know of the change.

In fact, these should use the default value for xpk_path. Let me run a quick test.

@lukebaumann
Copy link
Collaborator

There seems to be several updates needed for the Pathways benchmark recipes after the refactor. @SujeethJinesh will be addressing them in the coming days/weeks and will incorporate the changes to the XPK path in that refactor as well.

TLDR the pathways benchmark recipes are broken and should not block this PR IMO.

@SamuelMarks
Copy link
Collaborator

I tried making it os.path.join(os.path.expanduser("~"), "directory") in one of my earlier commits but that doesn't work. Explicitly someone made a mistake earlier on by not expanding ~ with os.path.expanduser so it actually expects in the maxtext directory a folder named ~ with the xpk folder inside. Otherwise the tests fail on CI.

@SujeethJinesh
Copy link
Collaborator

Pathways tests will need a bit of a refactor that is coming, but we explicitly set the XPK path in our recipes. Changing the default shouldn't fix the pathways recipes regardless.

@bvandermoon bvandermoon force-pushed the bvandermoon-xpk-path branch from fd6cbdf to 13f7722 Compare April 10, 2025 18:43
@bvandermoon
Copy link
Collaborator Author

There seems to be several updates needed for the Pathways benchmark recipes after the refactor. @SujeethJinesh will be addressing them in the coming days/weeks and will incorporate the changes to the XPK path in that refactor as well.

TLDR the pathways benchmark recipes are broken and should not block this PR IMO.

Thanks @lukebaumann . Went ahead and updated the XPK paths for the Pathways files. Pretty sure the previous path wouldn't work. @SujeethJinesh as FYI for when you go to refactor/get the files to actually work

@bvandermoon
Copy link
Collaborator Author

I tried making it os.path.join(os.path.expanduser("~"), "directory") in one of my earlier commits but that doesn't work. Explicitly someone made a mistake earlier on by not expanding ~ with os.path.expanduser so it actually expects in the maxtext directory a folder named ~ with the xpk folder inside. Otherwise the tests fail on CI.

@SamuelMarks not quite following this. I think it is just expecting the directory to be located in the home directory, right? That is how I have it setup on my local workspace and the XPK jobs kicks off successfully

@copybara-service copybara-service bot merged commit e86e6a8 into main Apr 10, 2025
27 checks passed
@copybara-service copybara-service bot deleted the bvandermoon-xpk-path branch April 10, 2025 20:25
@SamuelMarks
Copy link
Collaborator

SamuelMarks commented Apr 10, 2025

@bvandermoon If it works for your in CI that's great. Obviously it intends to expand to the user directory. Just saying it was flaky when I tried it on Monday…

But if it works: merge! - Would like it to be properly referenced.

EDIT: Quote #1561 (comment)

Yeah I register 9 files;rg -Fl 'xpk"':

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants