Added Dockerfile for CI images & Upgrate CI to ROCm 7.2#195
Added Dockerfile for CI images & Upgrate CI to ROCm 7.2#195VeeraRajasekhar merged 11 commits intodevfrom
Conversation
wenchenvincent
left a comment
There was a problem hiding this comment.
Please address the comments.
ipanfilo
left a comment
There was a problem hiding this comment.
Why conversations are marked as resolved w/o any actual action?
Some of them, I have resolved, some I have currently resolved in my local, just to keep track I will mark them resolved. |
|
@VeeraRajasekhar Is this PR still needed? |
|
@VeeraRajasekhar Could you remind me of what we had decided on this PR? It seemed that it is no longer relevant and we should close it. |
c4913b2 to
e49b365
Compare
|
Hi @ipanfilo, @wangye805 I have updated this PR with latest 7.2 docker file and moved to .github/scripts. Let me know if I need to add an action to automate docker build and upload to our artifactory? Thanks. |
e49b365 to
bdc75a2
Compare
|
I had to force push to include new FA 2.8.3 support commit and my changes for 7.2 support to run the CI. Thanks. |
|
@Micky774, please review the following, Analysis on testing on Jax & xla 0.8.2(Not Supported) jax.nn.scaled_matmul (MXFP8) on ROCm crashes with a segmentation fault when the contracting dimension (K) is less than 64. |
|
https://github.com/ROCm/TransformerEngine/actions/runs/22024090773/job/63637830333 Level=3 testing had no issues. |
This is a failure on certain configs for hipblaslt, which we already have tickets open for. We don't support these configs in TE anyways, so it's a known issue and not a blocker. Shouldn't be a problem for this PR. |
|
@VeeraRajasekhar you'll need to merge w/ dev to fix CI |
869c260 to
f75f218
Compare
ROCm's jax.nn.scaled_matmul kernels require the contracting dimension (K) to be at least 64. Without this validation, backward pass GEMMs with K < 64 cause segmentation faults. Added K >= 64 check in _check_mxfp8_gemm_support() for JAX GEMM on ROCm. Fixes: test_dense_grad_fp8[MXFP8_1D_SCALING-with_jax_gemm_True-64-32-64]
f75f218 to
17d20ed
Compare
Description
Added the dockerfile, which can be used to create the ci-artifactory images.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: