Skip to content

xFail known bad tests on H100 and fix CVEs#549

Merged
gagank1 merged 5 commits into
mainfrom
gkaushik/gh200-hotfix-main
Dec 19, 2024
Merged

xFail known bad tests on H100 and fix CVEs#549
gagank1 merged 5 commits into
mainfrom
gkaushik/gh200-hotfix-main

Conversation

@gagank1
Copy link
Copy Markdown
Contributor

@gagank1 gagank1 commented Dec 18, 2024

No description provided.

Known issue on H100 (and GH200) with loading checkpoints. Also fixing
CVE in ARM container
@gagank1
Copy link
Copy Markdown
Contributor Author

gagank1 commented Dec 18, 2024

/build-ci

Comment thread Dockerfile.arm
Comment thread sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py Outdated
Copy link
Copy Markdown
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve but please make the test only xfail when not on H100 (or see my suggestion about also testing for cudnn version since I think it should work again with newer cuDNN, eg when we upgrade the base pytorch container to 24.10-py3.)

@gagank1
Copy link
Copy Markdown
Contributor Author

gagank1 commented Dec 19, 2024

/build-ci

@gagank1 gagank1 enabled auto-merge (squash) December 19, 2024 21:06
@gagank1 gagank1 merged commit e9ed8cf into main Dec 19, 2024
@gagank1 gagank1 deleted the gkaushik/gh200-hotfix-main branch December 19, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants