Fix MasterSolutionLibrary indexing for multiple architecture build #1888

yenong-amd · 2024-02-16T20:09:48Z

Implements a fix for solution index collision in multi-architecture build by

partitioning indices such that each architecture has a separate count by putting architecture type in high 16 bits and using low 16 bits for solution index.
fallback uses 0000 in high 16 bits.
optionally write out number of kernels per architecture post build in csv format.

TorreZuk

Leave it to Tensile team to approve, but I would make as soon as you can add trivial test using your csv output to validate both the csv, and see that counts or something else match actual code object data. Could be post_build cmake step, or just a python test.

TorreZuk · 2024-02-16T20:56:18Z

Tensile/SolutionLibrary.py

+solutionIndexMap = {architectureName:int(offset*pow(2,16)) 
+                    for architectureName,offset in zip(architectureList,range(len(architectureList)))}


While you debug you might just want to use the gfx number as is hex for the high 16bits. Solution indices aren't going to be preserved across release I thought so if there is ever trouble you can drop down to sequence int.

Do you mean that you prefer the upper 16 bits to be, for example, 0x90a for gfx90a?

Yes but if you think you'll need > 65536 solutions for a given gfx soon you may need to drop down to fewer bits for the gfx.

The idea was just while analyzing and debugging you don't have to look at that order list to figure out the sequence table to gfx number from high 4 bytes, you can just look at it in hex.

TorreZuk · 2024-02-16T20:57:17Z

Tensile/SolutionLibrary.py

+                    newSolutions[curIndex] = s
+                    curIndex += 1


Should add guard for chosen bucket size overflow, can use constant or 65536

Bucket sizes are now 262144. Instead, I added a check for architecture clobbering.

Tensile/TensileCreateLibrary.py

Tensile/Source/lib/source/UserDrivenTuningParser.cpp

Tensile/SolutionLibrary.py

nakajee · 2024-02-21T17:08:13Z

I see gfx906 test failed with the following error message.

terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi

I am not sure if this is caused by your change or not.
I did not see gfx906 CI test for a long time (recently re-enabled?)
Would you please check the error?

nakajee · 2024-02-22T17:05:19Z

Some merged code from develop is showing up in your change list.
Could you update your branch properly?

Tensile/Utilities/validate_library.py

nakajee · 2024-02-26T16:49:36Z

I do not have any further comments.
I hope we have some more reviews from other Tensile members (especially from solution selection team).
I will wait for a while.

TorreZuk

Well as you may have failures in develop now will approve with the hope of pushing other reviewers. The validation scan is manually run and passed I presume.
Can it be run on the failing pipeline?

TorreZuk · 2024-02-27T16:00:16Z

@bragadeesh or @AlexBrownAMD who is the scrum master for this sprint? You should want to get reviews withing a day or so, or @yenong-amd you should assign the people who must review before merge to help push it along.

yenong-amd · 2024-02-27T16:15:59Z

@bragadeesh or @AlexBrownAMD who is the scrum master for this sprint? You should want to get reviews withing a day or so, or @yenong-amd you should assign the people who must review before merge to help push it along.

I think @lringham was going to test this on some of his tickets.

yenong-amd · 2024-02-28T14:15:39Z

@nakajee Can you please help me merge? I don't have merge privilege. Thanks!

nakajee · 2024-02-28T16:24:51Z

@nakajee Can you please help me merge? I don't have merge privilege. Thanks!

Done

…OCm#1888)

…build (ROCm#1888)" This reverts commit 22e9481.

…OCm#1888)

Hotfix: Fix MasterSolutionLibrary indexing for multiple architecture build (#1888)

yenong-amd requested review from lringham and TorreZuk February 16, 2024 20:09

yenong-amd requested review from babakpst, yoichiyoshida, bragadeesh, AlexBrownAMD and nakajee as code owners February 16, 2024 20:09

TorreZuk reviewed Feb 16, 2024

View reviewed changes

nakajee reviewed Feb 21, 2024

View reviewed changes

Tensile/TensileCreateLibrary.py Outdated Show resolved Hide resolved

Tensile/Source/lib/source/UserDrivenTuningParser.cpp Show resolved Hide resolved

nakajee reviewed Feb 21, 2024

View reviewed changes

Tensile/SolutionLibrary.py Outdated Show resolved Hide resolved

yenong-amd added 4 commits February 22, 2024 11:05

Fix MasterSolutionLibrary indexing for multiple architecture build

38caadb

Fix batched gemm problem override strides

12469b3

Represented architecture as upper 14 bits

cf1367f

Check for architecture duplicate

6267093

yenong-amd force-pushed the index_collison_fix branch from a9a58d0 to 6267093 Compare February 22, 2024 19:12

Post build Tensile library validation

d08cc2d

nakajee reviewed Feb 22, 2024

View reviewed changes

Tensile/Utilities/validate_library.py Show resolved Hide resolved

Add copyright statement

7348da4

TorreZuk approved these changes Feb 27, 2024

View reviewed changes

AlexBrownAMD approved these changes Feb 27, 2024

View reviewed changes

nakajee approved these changes Feb 28, 2024

View reviewed changes

nakajee merged commit 3070a11 into ROCm:develop Feb 28, 2024
20 checks passed

GZGavinZhao mentioned this pull request Mar 5, 2024

Use fallback libraries for archs without optimized logic (v2) #1897

Merged

yenong-amd deleted the index_collison_fix branch March 5, 2024 14:32

yenong-amd restored the index_collison_fix branch March 5, 2024 14:32

nakajee pushed a commit to nakajee/Tensile that referenced this pull request Mar 15, 2024

Fix MasterSolutionLibrary indexing for multiple architecture build (R…

22e9481

…OCm#1888)

nakajee mentioned this pull request Mar 15, 2024

Hotfix: Fix WorkspaceCheck implementation when used in rocBLAS #1902

Merged

nakajee added a commit to nakajee/Tensile that referenced this pull request Mar 15, 2024

Revert "Fix MasterSolutionLibrary indexing for multiple architecture …

72781da

…build (ROCm#1888)" This reverts commit 22e9481.

yenong-amd added a commit to yenong-amd/Tensile that referenced this pull request Apr 2, 2024

Fix MasterSolutionLibrary indexing for multiple architecture build (R…

2b55ccf

…OCm#1888)

ony mentioned this pull request Apr 8, 2024

Build failure: rocmPackages_5.rocblas NixOS/nixpkgs#302412

Closed

nakajee added a commit that referenced this pull request Apr 18, 2024

Merge pull request #1905 from yenong-amd/release/rocm-rel-6.1

bf05992

Hotfix: Fix MasterSolutionLibrary indexing for multiple architecture build (#1888)

GZGavinZhao mentioned this pull request Apr 25, 2024

Cherry-pick RDNA1 fix into 6.1 release #1916

Closed

lamikr mentioned this pull request Jul 18, 2024

[Feature]: support for gfx1103 #1922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MasterSolutionLibrary indexing for multiple architecture build #1888

Fix MasterSolutionLibrary indexing for multiple architecture build #1888

yenong-amd commented Feb 16, 2024

TorreZuk left a comment

TorreZuk Feb 16, 2024

yenong-amd Feb 21, 2024 •

edited

Loading

TorreZuk Feb 21, 2024

TorreZuk Feb 21, 2024

TorreZuk Feb 16, 2024

yenong-amd Feb 22, 2024

nakajee commented Feb 21, 2024

nakajee commented Feb 22, 2024

nakajee commented Feb 26, 2024

TorreZuk left a comment

TorreZuk commented Feb 27, 2024

yenong-amd commented Feb 27, 2024

yenong-amd commented Feb 28, 2024

nakajee commented Feb 28, 2024

		solutionIndexMap = {architectureName:int(offset*pow(2,16))
		for architectureName,offset in zip(architectureList,range(len(architectureList)))}

Fix MasterSolutionLibrary indexing for multiple architecture build #1888

Fix MasterSolutionLibrary indexing for multiple architecture build #1888

Conversation

yenong-amd commented Feb 16, 2024

TorreZuk left a comment

Choose a reason for hiding this comment

TorreZuk Feb 16, 2024

Choose a reason for hiding this comment

yenong-amd Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

TorreZuk Feb 21, 2024

Choose a reason for hiding this comment

TorreZuk Feb 21, 2024

Choose a reason for hiding this comment

TorreZuk Feb 16, 2024

Choose a reason for hiding this comment

yenong-amd Feb 22, 2024

Choose a reason for hiding this comment

nakajee commented Feb 21, 2024

nakajee commented Feb 22, 2024

nakajee commented Feb 26, 2024

TorreZuk left a comment

Choose a reason for hiding this comment

TorreZuk commented Feb 27, 2024

yenong-amd commented Feb 27, 2024

yenong-amd commented Feb 28, 2024

nakajee commented Feb 28, 2024

yenong-amd Feb 21, 2024 •

edited

Loading