Skip to content

Commit

Permalink
Sync with amdadvtech/Orochi (#73)
Browse files Browse the repository at this point in the history
* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* remove space after -I (#33)

* Feature/oro 0 gpuopen merge 2 (#32)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Allow usage of libhiprtc64.so if exists

* [ORO-0] Fix linux loading of libhiprtc.so

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>

* Feature/oro 0 radix sort stream (#34)

* Initial commit

* Streams to the configuration

* Mutex in OrochiUtils

* Feature/oro 0 radix sort mutex baking (#36)

* Locking other methods in OrochiUtils

* Removing mutex from static methods

* Making mutex and map static

* Removing static from OrochiUtils

* Removing static from OrochiUtils

* Support Precompiled Kernels in Orochi (#37)

* Add bitcode support: getFunctionFromPrecompiledBinary

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add bitcode and the script to generate it.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* rewrite OROASSERT. Fix include file order.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Use string instead of const char*


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Rename the option from bitcode to precompiled


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Add bitcode script for nvidia fatbin

* [ORO-0] CUDA - hipfb->fatbin rename

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>

* Feature/oro 0 resource limits (#38)

* Adding limit functions

* Removing enum

* Removing enum

* Limit enum

* char string Windows API (#39)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math

* [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math

* [ORO-0] Function pointer test. (#40)

* [ORO-0] Function pointer test.

* [ORO-0] launch2d.

* [ORO-0] Event, OroStopwatch.

* Implement GpuMemory to handle device memory operations.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <takahiroharada@gmail.com>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix script

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* fix

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: PixelClear <pariku@amd.com>

* [ORO-0] Remove unnecessary template.

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46)

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation.

* [ORO-0] hipsdk should be next to orochi dir.

* Update ParallelPrimitives/RadixSortKernels.h

Remove commented line

---------

Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] add automatic arch selection (#47)

* [ORO-0] add automatic arch selection

* [ORO-0] Refactor and error output when it cannot find llc.

---------

Co-authored-by: takahiroharada <takahiroharada@gmail.com>

* Feature/oro 0 flexible rtc error handling cherrypick (#48)

* add a handler for RTC load failure case on cuda.

* [ORO-0] add a handler for RTC load failure case on hip.

* [ORO-0] add cuda 12.0 sdk in nvrtc path

* [ORO-0] Remove non bundled bitcode tests. Clean up.

* [ORO-0] Clean up.

* [ORO-0] Add hiprtcGetBitcodeSize back.

* Update Orochi.cpp

* Update Orochi.cpp

* [ORO-0] Fix for multi-GPU/iGPU

* [HIPSDK-0] compute-22.40-osdb/36/

* [ORO-0] compute-23.10-osdb/9/

* [ORO-0] Update dll names

* [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup

* [ORO-0] fix compile issues

* [ORO-0] fix declaration of oroManagedMalloc

* [ORO-0] change streaming kernel

* [ORO-0] enable it on windows too

* [ORO-0] add more asserts

* [ORO-0] update kernel

* [ORO-0] add host copy times

* [ORO-0] add malloc times

* Refactor Count

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Refactor Radix Sort class:

- Now the tmp buffer is allocated internally.
- All GPU memory buffers are changed to the GpuMemory class
- `configure` will now calculate the total number of GPU blocks for the count and the scan kernel
- The client does not need to call configure explicitly
- Refactor function parameters
- Remove count reference kernel



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Add `const`




Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Thid commit does the followings:

- Support setting the the number of thread per block (a.k.a block size) dynamically
- Refactor `exclusiveScanCpu`
- Extend `printKernelInfo`.



Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* The 1st working example for the radix sort optimization


Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel

Compute the optimal number of inputs for each block to handle.

Refactor the usage of stopwatch

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>

* [ORO-0] add hiprtc future dll names in hiprtc path

* Add linux paths and dll names (#66)

* [ORO-0] Change path and rtc dll names

* [ORO-0] Make scripts executable

* [ORO-0] Add hiprtc path

* [ORO-0] Remove ParallelPrimitives, test/radix sort

* [ORO-0] Edit premake

---------

Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com>
Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com>
Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
Co-authored-by: takahiroharada <takahiroharada@gmail.com>
Co-authored-by: Daniel Meister <daniel.meister@amd.com>
Co-authored-by: NevesLucas <neves.lucas.m@gmail.com>
Co-authored-by: PixelClear <pariku@amd.com>
Co-authored-by: Richard Geslot <richard.geslot@amd.com>
Co-authored-by: Atsushi Yoshimura <51312299+AtsushiYoshimura0302@users.noreply.github.com>
Co-authored-by: Atsushi.Yoshimura <Atsushi.Yoshimura@amd.com>
  • Loading branch information
11 people committed Sep 20, 2023
1 parent df8a401 commit b209cc1
Show file tree
Hide file tree
Showing 16 changed files with 287 additions and 50 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.dll filter=lfs diff=lfs merge=lfs -text
5 changes: 5 additions & 0 deletions Orochi/Orochi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,11 @@ oroError OROAPI oroMemAllocPitch(oroDeviceptr* dptr, size_t* pPitch, size_t Widt
{
return oroErrorUnknown;
}
oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, oroManagedMemoryAttachFlags flags)
{
__ORO_FUNC1( MemAllocManaged((CUdeviceptr*)dptr, bytesize, (CUmemAttach_flags_enum)flags), MallocManaged( dptr, bytesize, (HIPmemAttach_flags_enum)flags ) );
return oroErrorUnknown;
}
oroError OROAPI oroFree(oroDeviceptr dptr)
{
__ORO_FUNC1( MemFree( dptr ), Free( dptr ) );
Expand Down
9 changes: 8 additions & 1 deletion Orochi/Orochi.h
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,13 @@ typedef enum OROmem_range_attribute_enum {
ORO_MEM_RANGE_ATTRIBUTE_LAST_PREFETCH_LOCATION = 4,
} PPmem_range_attribute;

typedef enum oroManagedMemoryAttachFlags
{
oroMemAttachGlobal = 0x1,
oroMemAttachHost = 0x2,
oroMemAttachSingle = 0x4,
}oroManagedMemoryAttachFlags;

typedef enum oroJitOption {
oroJitOptionMaxRegisters = 0,
oroJitOptionThreadsPerBlock,
Expand Down Expand Up @@ -641,6 +648,7 @@ oroError OROAPI oroMemGetInfo(size_t* free, size_t* total);
oroError OROAPI oroMalloc(oroDeviceptr* dptr, size_t bytesize);
oroError OROAPI oroMalloc2(oroDeviceptr* dptr, size_t bytesize);
oroError OROAPI oroMemAllocPitch(oroDeviceptr* dptr, size_t* pPitch, size_t WidthInBytes, size_t Height, unsigned int ElementSizeBytes);
oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, oroManagedMemoryAttachFlags flags);
oroError OROAPI oroFree(oroDeviceptr dptr);
oroError OROAPI oroFree2(oroDeviceptr dptr);
//oroError OROAPI oroMemGetAddressRange(oroDeviceptr* pbase, size_t* psize, oroDeviceptr dptr);
Expand All @@ -650,7 +658,6 @@ oroError OROAPI oroFree2(oroDeviceptr dptr);
oroError OROAPI oroHostRegister(void* p, size_t bytesize, unsigned int Flags);
oroError OROAPI oroHostGetDevicePointer(oroDeviceptr* pdptr, void* p, unsigned int Flags);
//oroError OROAPI oroHostGetFlags(unsigned int* pFlags, void* p);
//oroError OROAPI oroMallocManaged(oroDeviceptr* dptr, size_t bytesize, unsigned int flags);
//oroError OROAPI oroDeviceGetByPCIBusId(hipDevice_t* dev, const char* pciBusId);
//oroError OROAPI oroDeviceGetPCIBusId(char* pciBusId, int len, hipDevice_t dev);
oroError OROAPI oroHostUnregister(void* p);
Expand Down
4 changes: 0 additions & 4 deletions Orochi/OrochiUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -374,10 +374,6 @@ struct OrochiUtilsImpl
static std::string getCacheName( const std::string& path, const std::string& kernelname ) noexcept { return path + kernelname; }
};

OrochiUtils::OrochiUtils() { m_cacheDirectory = "./cache/"; }

OrochiUtils::~OrochiUtils() {}

bool OrochiUtils::readSourceCode( const std::string& path, std::string& sourceCode, std::vector<std::string>* includes ) { return OrochiUtilsImpl::readSourceCode( path, sourceCode, includes ); }

oroFunction OrochiUtils::getFunctionFromFile( oroDevice device, const char* path, const char* funcName, std::vector<const char*>* optsIn )
Expand Down
19 changes: 15 additions & 4 deletions Orochi/OrochiUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,12 @@ class OrochiUtils
int x, y, z, w;
};

OrochiUtils();
~OrochiUtils();
OrochiUtils() = default;
OrochiUtils(const OrochiUtils&) = delete;
OrochiUtils& operator=(const OrochiUtils&) = delete;
OrochiUtils(OrochiUtils&&) = delete;
OrochiUtils& operator=(OrochiUtils&&) = delete;
~OrochiUtils() = default;

oroFunction getFunctionFromPrecompiledBinary( const std::string& path, const std::string& funcName );

Expand All @@ -50,12 +54,19 @@ class OrochiUtils
static void launch2D( oroFunction func, int nx, int ny, const void** args, int wgSizeX = 8, int wgSizeY = 8, unsigned int sharedMemBytes = 0, oroStream stream = 0 );

template<typename T>
static void malloc( T*& ptr, int n )
static void malloc( T*& ptr, size_t n )
{
oroError e = oroMalloc( (oroDeviceptr*)&ptr, sizeof( T ) * n );
OROASSERT( e == oroSuccess, 0 );
}

template<typename T>
static void mallocManaged( T*& ptr, size_t n, oroManagedMemoryAttachFlags flags )
{
oroError e = oroMallocManaged( (oroDeviceptr*)&ptr, sizeof( T ) * n, flags );
OROASSERT( e == oroSuccess, 0 );
}

template<typename T>
static void free( T* ptr )
{
Expand Down Expand Up @@ -123,7 +134,7 @@ class OrochiUtils
}

public:
std::string m_cacheDirectory;
std::string m_cacheDirectory = "./cache/";
std::recursive_mutex m_mutex;
std::unordered_map<std::string, oroFunction> m_kernelMap;
};
Expand Down
Loading

0 comments on commit b209cc1

Please sign in to comment.