Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MasterSolutionLibrary indexing for multiple architecture build #1888

Merged
merged 6 commits into from
Feb 28, 2024

Conversation

yenong-amd
Copy link
Contributor

Implements a fix for solution index collision in multi-architecture build by

  • partitioning indices such that each architecture has a separate count by putting architecture type in high 16 bits and using low 16 bits for solution index.
  • fallback uses 0000 in high 16 bits.
  • optionally write out number of kernels per architecture post build in csv format.

Copy link
Contributor

@TorreZuk TorreZuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it to Tensile team to approve, but I would make as soon as you can add trivial test using your csv output to validate both the csv, and see that counts or something else match actual code object data. Could be post_build cmake step, or just a python test.

Comment on lines 38 to 39
solutionIndexMap = {architectureName:int(offset*pow(2,16))
for architectureName,offset in zip(architectureList,range(len(architectureList)))}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you debug you might just want to use the gfx number as is hex for the high 16bits. Solution indices aren't going to be preserved across release I thought so if there is ever trouble you can drop down to sequence int.

Copy link
Contributor Author

@yenong-amd yenong-amd Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you prefer the upper 16 bits to be, for example, 0x90a for gfx90a?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but if you think you'll need > 65536 solutions for a given gfx soon you may need to drop down to fewer bits for the gfx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was just while analyzing and debugging you don't have to look at that order list to figure out the sequence table to gfx number from high 4 bytes, you can just look at it in hex.

Comment on lines +496 to +514
newSolutions[curIndex] = s
curIndex += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add guard for chosen bucket size overflow, can use constant or 65536

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket sizes are now 262144. Instead, I added a check for architecture clobbering.

Tensile/SolutionLibrary.py Outdated Show resolved Hide resolved
@nakajee
Copy link
Contributor

nakajee commented Feb 21, 2024

I see gfx906 test failed with the following error message.

terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi

I am not sure if this is caused by your change or not.
I did not see gfx906 CI test for a long time (recently re-enabled?)
Would you please check the error?

@nakajee
Copy link
Contributor

nakajee commented Feb 22, 2024

Some merged code from develop is showing up in your change list.
Could you update your branch properly?

@nakajee
Copy link
Contributor

nakajee commented Feb 26, 2024

I do not have any further comments.
I hope we have some more reviews from other Tensile members (especially from solution selection team).
I will wait for a while.

Copy link
Contributor

@TorreZuk TorreZuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well as you may have failures in develop now will approve with the hope of pushing other reviewers. The validation scan is manually run and passed I presume.
Can it be run on the failing pipeline?

@TorreZuk
Copy link
Contributor

@bragadeesh or @AlexBrownAMD who is the scrum master for this sprint? You should want to get reviews withing a day or so, or @yenong-amd you should assign the people who must review before merge to help push it along.

@yenong-amd
Copy link
Contributor Author

@bragadeesh or @AlexBrownAMD who is the scrum master for this sprint? You should want to get reviews withing a day or so, or @yenong-amd you should assign the people who must review before merge to help push it along.

I think @lringham was going to test this on some of his tickets.

@yenong-amd
Copy link
Contributor Author

@nakajee Can you please help me merge? I don't have merge privilege. Thanks!

@nakajee nakajee merged commit 3070a11 into ROCm:develop Feb 28, 2024
20 checks passed
@nakajee
Copy link
Contributor

nakajee commented Feb 28, 2024

@nakajee Can you please help me merge? I don't have merge privilege. Thanks!

Done

@yenong-amd yenong-amd deleted the index_collison_fix branch March 5, 2024 14:32
@yenong-amd yenong-amd restored the index_collison_fix branch March 5, 2024 14:32
nakajee pushed a commit to nakajee/Tensile that referenced this pull request Mar 15, 2024
nakajee added a commit to nakajee/Tensile that referenced this pull request Mar 15, 2024
yenong-amd added a commit to yenong-amd/Tensile that referenced this pull request Apr 2, 2024
nakajee added a commit that referenced this pull request Apr 18, 2024
Hotfix: Fix MasterSolutionLibrary indexing for multiple architecture build (#1888)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants