Skip to content

Add pyproject.toml with legacy build backend to keep most logic in setup.py #7033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 76 commits into
base: master
Choose a base branch
from

Conversation

loadams
Copy link
Collaborator

@loadams loadams commented Feb 13, 2025

  • Test all build cases this needs to support.
  • Test Windows builds/support. - Successfully built deepspeed-0.16.5+1d869d1f-cp311-cp311-win_amd64.whl
  • Confirm pre-compiling ops - works with --no-build-isolation
  • Test commit hashes are added to dev builds (when building wheels, error from python -m build - Successfully built deepspeed-0.16.5+1d869d1f.tar.gz and deepspeed-0.16.5+unknown-py3-none-any.whl
  • Add pyproject.toml to path triggers similar to setup.py.

The main goal of this effort is to become compliant with the coming changes to pip in 25.1 listed here which will break editable installs. Future PRs will fully move from setup.py to pyproject.toml

Fixes: #7031

MII equivalent PR: deepspeedai/DeepSpeed-MII#555
DS-Kernels equivalent PR: deepspeedai/DeepSpeed-Kernels#20

@jeffra
Copy link
Collaborator

jeffra commented Feb 14, 2025

@mrwyattii we just went through some of this with arctic training. If it’s helpful @loadams let’s discuss on slack a bit. There’s a ton that’s currently happening in setup.py, this could be a big lift? But I agree, needs to happen!

@loadams loadams changed the title Add pyproject.toml Add pyproject.toml with legacy build backend to keep most logic in setup.py Feb 19, 2025
@loadams
Copy link
Collaborator Author

loadams commented Feb 19, 2025

Edit: this is no longer correct with latest changes.

The current problem is that the logic inside setup.py aside from the call to setup() isn't run. This means that we don't append cupy into the requirements file, and tests fail as a result. This impacts other parts of the build experience, so we will need to do more work to switch to a modern build backend.

@loadams loadams marked this pull request as ready for review February 25, 2025 18:37
rraminen and others added 22 commits March 25, 2025 08:51
This change is required to successfully build fp_quantizer extension on
ROCm.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
cc @tjruwase @jomayeri

---------

Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Fix #7029 
- Add Chinese blog for deepspeed windows
- Fix format in README.md

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Adding compile support for AIO library on AMD GPUs.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Make trace cache warnings configurable, and disabled by default. 

Fix #6985, #4081, #5033, #5006, #5662

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Update CUDA compute capability for cross compile according to wiki page.
https://en.wikipedia.org/wiki/CUDA#GPUs_supported

---------

Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
…ently, so we aren't seeing cupy installed.

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Propagate API change.

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
- add zero2 test
- minor fix with transformer version update & ds master merge.

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range

Signed-off-by: shaomin <wukon1992@gmail.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.4
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
@jeffra and I fixed this many years ago, so bringing this doc to a
correct state.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <siqi@tecorigin.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
A-transformer and others added 12 commits March 25, 2025 08:51
Keeps lines within PEP 8 length limits.
Enhances readability with a single, concise expression.
Preserves original functionality.

---------

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: Liang Cheng <astarxp777@gmail.com>
Signed-off-by: A-transformer <astarxp777@gmail.com>
Co-authored-by: Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: A-transformer <astarxp777@gmail.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Unpin transformers version for all workflows except
`nv-torch-latest-v100` as this still has a tolerance issue with some
quantization tests.

Signed-off-by: Logan Adams <loadams@microsoft.com>
Resolves #6997 

This PR conditionally quotes environment variable values—only wrapping
those containing special characters (like parentheses) that could
trigger bash errors. Safe values remain unquoted.

---------

Signed-off-by: Saurabh <saurabhkoshatwar1996@gmail.com>
Signed-off-by: Saurabh Koshatwar <saurabhkoshatwar1996@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Correct the BACKWARD_PREFETCH_SUBMIT mismatch
FORWARD_PREFETCH_SUBMIT = 'forward_prefetch_submit'

---------

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: A-transformer <astarxp777@gmail.com>
Co-authored-by: Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
…Tests (#7146)

Enhancing  ci/nightly coverage for gaudi2 device
Tests added :
        test_autotp_training.py
        test_ulysses.py
	test_linear::TestLoRALinear and test_linear::TestBasicLinear
	test_ctx::TestEngine
these provide coverage for model_parallesim and linear feature.
The tests are stable. 10/10 runs pass.
New tests addition is expected to increase ci time by 3-4 mins and
nightly job time by 15 min.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
@tjruwase Don't merge yet, I will leave a comment when it is ready for
merge. Thank you.

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
)

This PR is a continuation of the efforts to improve DeepSpeed
performance when using PyTorch compile.

Dynamo breaks the graph because `flat_tensor.requires_grad = False`:

* Is a side-effecting operation on tensor metadata
* Occurs in a context where Dynamo expects static tensor properties for
tracing

`flat_tensor.requires_grad` is redundant and can be safely removed
because:
* `_allgather_params()` function is already decorated with
`@torch.no_grad()` which ensures the desired property
* `flat_tensor` is created using the `torch.empty()` which sets the
`requires_grad=False` by default.

---------

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
ZeRO3 requires explicit cleaning in tests when reusing the environment.
This PR adds `destroy` calls to the tests to free memory and avoid
potential errors due to memory leaks.

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: c8ef <c8ef@outlook.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Comment on lines +4 to +6
"setuptools>=64",
"torch",
"wheel"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you depend on setuptools 70.1 or later, you won't need wheel.

Suggested change
"setuptools>=64",
"torch",
"wheel"
"setuptools>=70.1",
"torch"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix - Update DeepSpeed to be PEP517 compliant, update to pyproject.toml