Skip to content

Fix flash-attention build env in CI deps Dockerfile#565

Merged
VeeraRajasekhar merged 3 commits intodevfrom
veergopu/update_ci_image
Apr 29, 2026
Merged

Fix flash-attention build env in CI deps Dockerfile#565
VeeraRajasekhar merged 3 commits intodevfrom
veergopu/update_ci_image

Conversation

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor

…CI image tag

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor Author

@sudhu2k
Copy link
Copy Markdown
Contributor

sudhu2k commented Apr 28, 2026

LGTM.

Comment thread .github/scripts/Dockerfile.ci.deps Outdated

# AITER (submodule: composable_kernel / CK). Submodule update after checkout aligns CK with this commit.
# Install AITER
RUN git clone --recursive https://github.com/ROCm/aiter.git \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need --recursive if later calling git submodule update --init --recursive. Doing that you checkout submodules twice.
And if doing checkout later, makes sense to add --no-checkout to clone

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread .github/scripts/Dockerfile.ci.deps Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove PIP_FIND_LINKS and PIP_PRE. The used AITER commit does not depend of FlyDSL and there is no FlyDSL distribution for Python 3.11 anyway

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment thread .github/scripts/Dockerfile.ci.deps Outdated
Comment thread ci/ci_config.json Outdated
{
"docker_images": {
"default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf",
"default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf_v1",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GHA the name can be reused, AFAIK - there is no image caching there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I already pushed an image with "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.2_ubuntu22.04_py3.11_pytorch_release-2.8_08d38866_jax_0.8.0_fa_2.8.1_aiter_77455e3ecf", and later created a new image.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that image not be overwritten?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to overwritten. In the works case old one can be deleted. With Jenkins we had to use uniq tags because there was image caching on workers that did not look at hash but only name/tag

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the older tag

Comment thread .github/scripts/Dockerfile.ci.deps Outdated
Comment on lines +25 to +26
&& export GPU_ARCHS="gfx950;gfx942" \
&& FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE FLASH_ATTENTION_SKIP_CK_BUILD=FALSE python setup.py install \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an export instead of feeding it as an env var before the python setup.py install?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably not needed at all for installation - nothing is compiled there

Comment thread .github/scripts/Dockerfile.ci.deps Outdated
Comment thread .github/scripts/Dockerfile.ci.deps Outdated
Copy link
Copy Markdown
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VeeraRajasekhar @sudhu2k can either of you confirm that the image as it is now passes the failing tests for MXFP4 we're seeing on dev?

Edit: The current running CI for this PR is using the updated image source.

@sudhu2k
Copy link
Copy Markdown
Contributor

sudhu2k commented Apr 29, 2026

@VeeraRajasekhar @sudhu2k can either of you confirm that the image as it is now passes the failing tests for MXFP4 we're seeing on dev?

Edit: The current running CI for this PR is using the updated image source.

https://github.com/ROCm/TransformerEngine/actions/runs/25014847020/job/73259852413
The level 3 tests all passed along with the mxfp4 GEMM tests when I ran it yesterday before merging the MXFP4 PR with the updated image.

@VeeraRajasekhar VeeraRajasekhar changed the title Fix flash-attention build env in CI deps Dockerfile; bump default TE image tag Fix flash-attention build env in CI deps Dockerfile Apr 29, 2026
@VeeraRajasekhar VeeraRajasekhar force-pushed the veergopu/update_ci_image branch from 892dc54 to ba241bf Compare April 29, 2026 16:51
@VeeraRajasekhar VeeraRajasekhar merged commit 732e5ab into dev Apr 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants