forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[release/2.8] update related_commit #2431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date
(Changes selected from ef226be and pytorch@fadc936)
Cherry-pick of #2130 Validation: http://rocm-ci.amd.com/job/rocm-pytorch-manylinux-wheel-builder-lw/155/ --------- Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 2c220b2)
…ersion since we need to rebuild magma for all supported architectures, we cannot use upstream magma tarball anyway
From upstream PR pytorch#154900 Resolves:https://ontrack-internal.amd.com/browse/SWDEV-536994 After following steps to reproduce in container **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16231_ubuntu22.04_py3.10_pytorch_lw_release2.7_no_user_66a18277**: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch# history 1 cd /var/lib/jenkins/pytorch 2 TEST_CONFIG=default CONTINUE_THROUGH_ERROR=True .ci/pytorch/test.sh Name: gfx90a Marketing Name: AMD Instinct MI250X/MI250 + MAYBE_ROCM=rocm/ + [[ rocm == *xpu* ]] + [[ rocm != *-bazel-* ]] + pip_install ninja==1.10.2 + pip_install_pkg='python3 -m pip install --progress-bar off' + python3 -m pip install --progress-bar off ninja==1.10.2 Collecting ninja==1.10.2 Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) Installing collected packages: ninja Attempting uninstall: ninja Found existing installation: ninja 1.11.1.4 Uninstalling ninja-1.11.1.4: Successfully uninstalled ninja-1.11.1.4 Successfully installed ninja-1.10.2 + export PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + [[ rocm == *aarch64* ]] + [[ rocm == *asan* ]] + [[ rocm == *-debug* ]] + [[ rocm != *-bazel-* ]] + echo 'We are not in debug mode: rocm. Expect the assertion to pas ``` http://rocm-ci.amd.com/job/mainline-pytorch2.7-manylinux-wheels/126/ --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 0bd4030)
(cherry picked from upstream commit 04bd7e6)
…to enable PyTorch build on ROCm7.0
This PR enables NHWC batchnorm on MIOpen in release/2.6 branch `ROCm version >= 6.5` and `PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1` environment variable required to enable nhwc batchnorm This PR branch for `release/2.6` was built and tested using docker image: `compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15845_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_8190c80`. New batchnorm tests introduced: train: ``` test_batchnorm_train_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.040s) test_batchnorm_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.007s) test_batchnorm_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.005s) test_batchnorm_train_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.089s) test_batchnorm_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.020s) test_batchnorm_train_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.006s) test_batchnorm_train_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.006s) test_batchnorm_train_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s) ``` inference: ``` test_batchnorm_inference_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.025s) test_batchnorm_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.005s) test_batchnorm_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.102s) test_batchnorm_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.018s) test_batchnorm_inference_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s) ``` --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 45896ac) (cherry picked from commit 7010d60)
(cherry picked from commit 03c7da0)
NHWC batchnorm enabled by default if ROCm>=7.0 (cherry picked from commit e0afc3a)
remove `xfail` from `batch_norm_with_update` op in `test_grad` and `test_vmap_autograd_grad` these tests are passed since ROCm6.4 Fixes https://ontrack-internal.amd.com/browse/SWDEV-529820 (cherry picked from commit 99b0758) (cherry picked from commit a7044a4) (cherry picked from commit 3fc00a8)
Ported mx fp8 part from #2046 Current test stats (accounting only blockwise scale tests) PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 225 tests in 8.256s FAILED (failures=1, skipped=150) _74 test pass_ **fp8 mx data type sample test case.** test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda (__main__.TestFP8MatmulCudaCUDA) hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128 --ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0 --alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3 --scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r --compute_type f32_r --algo_method index --solution_index -2146957310 --rotating 0 --cold_iters 0 --iters 0 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit d17e222)
…ilure (#2204) - Previously expected values were calculated on GPU using same dtype as result values - Now expected values are calculated on CPU using Float32 dtype - This fixes a test failure that was observed on Navi48 where difference between Eager mode (expected) and Inductor / Triton (result) did not meet the error tolerance when sum was evaluated on an array of Float16 values Co-authored-by: pnikolic-amd <pnikolic@amd.com> (cherry picked from commit 8fe3cdd) (cherry picked from commit 34f3b3e)
fixes https://ontrack-internal.amd.com/browse/SWDEV-522391 for PT 2.7 (cherry picked from commit df38cca)
Related to c7a1e32 Fixes https://ontrack-internal.amd.com/browse/SWDEV-537835 Not a Navi specific failure: ``` File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1412, in only_fn return fn(slf, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1671, in test_cuda_tensor_pow_scalar_tensor self._test_pow(base, exp) File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1482, in _test_pow self.assertEqual(actual, expected) File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4052, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64. ``` Using .to(actual) without specifying dtype/device assumes actual is a tensor or tensor-like, which may fail silently or promote. Fixed by explicitly matching dtype and device. Going from pytorch#107302 Fix: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch# TEST_CONFIG=default HIP_VISIBLE_DEVICES=0 PYTORCH_TEST_WITH_ROCM=1 python test/test_binary_ufuncs.py TestBinaryUfuncsCUDA.test_cuda_tensor_pow_scalar_tensor_cuda /opt/conda/envs/py_3.12/lib/python3.12/site-packages/hypothesis/entry_points.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources Running tests... ---------------------------------------------------------------------- . ---------------------------------------------------------------------- Ran 1 test in 0.141s OK Generating XML reports... root@ubb4-rack-22:/var/lib/jenkins/pytorch# pip list | grep numpy numpy 2.1.2 ``` (cherry picked from commit a4d60fa)
This PR fixes the unit test,
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]
```
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.
This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.
JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
…ly for a custo… (#2419) …m op with multiple mutated inputs and None return type. (pytorch#157133) This is an attempt to fix a memory allocation issue when using `torch.compile` with a custom layernorm kernel in vllm: ```C++ // In-place fused Add and RMS Normalization. ops.def( "fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, " "float epsilon) -> ()"); ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm); ``` We observed abnormal extra memory allocations with this op enabled using `torch.compile`: <img width="738" alt="{374E9FCF-FB46-4750-8B60-D31E3ADCE00A}" src="https://github.com/user-attachments/assets/6c45e1aa-ccde-4c56-99dc-bf4776d699d5" /> and without this op: <img width="738" alt="{9BB08EFE-FFE3-4D06-82C0-C70BBE6ADD56}" src="https://github.com/user-attachments/assets/56e2ee43-ab87-492d-834c-69e9cafbb0df" /> After investigation, we found that this is because the compiler considers the two buffers for the two mutated inputs `Tensor input` and `Tensor residual` should share a same dependency list, which makes it can not reuse the buffer of `Tensor input`. ``` buf1.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] buf16.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] ``` ``` op13: ExternKernelSchedulerNode(FallbackKernel) op13.writes = [ StarDep(name='buf17', mode=None), StarDep(name='buf18', mode=None), StarDep(name='buf19', mode=None)] op13.unmet_dependencies = [ StarDep(name='buf13', mode=None), StarDep(name='buf16', mode=None), WeakDep(name='buf11', mutating_buf='buf18'), WeakDep(name='buf12', mutating_buf='buf18'), WeakDep(name='buf13', mutating_buf='buf18'), WeakDep(name='buf2', mutating_buf='buf18'), WeakDep(name='buf3', mutating_buf='buf18')] op13.met_dependencies = [StarDep(name='arg11_1', mode=None)] op13.outputs = [ buf17: FallbackKernel buf17.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf17.aliases = ['buf16', 'buf1'] buf17.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] buf18: MutationOutput buf18.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf18.mutations = ['buf16'] buf18.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op14'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=True), ] buf19: MutationOutput buf19.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf19.mutations = ['buf1'] buf19.users = [NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False)] ] op13.node.kernel = torch.ops._C.fused_add_rms_norm.default ``` Here we can see `buf16` shares the same dependency list with `buf1` because `buf16` and `buf1` are in the aliases list of `buf17`. This is incorrect since those two are two separate tensors. And this makes the compiler could not reuse `buf16` for subsequent ops. Pull Request resolved: pytorch#157133 Approved by: https://github.com/jansel (cherry picked from commit 02724b5) Fixes #ISSUE_NUMBER Co-authored-by: charlifu <charlifu@amd.com>
Commit Messages: - reset parameters for FusedDenseGeluDense similar to FusedDense to make the test_gelu pass (#269) (#270) Co-authored-by: Sriram Kumar <skishore@amd.com> PRs: - ROCm/apex#270 Fixes: - https://ontrack-internal.amd.com/browse/SWDEV-540029
|
Jenkins build for 9a1a0857c46d5d5addf03d9087d1799a59a32f04 commit finished as FAILURE |
3b41cb5 to
a0ffd4f
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Commit Messages:
Co-authored-by: Sriram Kumar skishore@amd.com
PRs:
Fixes: