Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop stream 2024-06-26 #575

Merged
merged 174 commits into from
Aug 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
174 commits
Select commit Hold shift + click to select a range
3c0d51e
feat(device_transform): add tuning benchmarks and config generation f…
Naraenda Aug 2, 2024
58a8b02
perf(device_transform): tuned device transform algorithm for better p…
Naraenda Aug 2, 2024
eb18249
docs(changelog.md): add 'device_transform' improvements to changelog
Naraenda Aug 2, 2024
64e1fe2
feat(ConfigAutotuneSettings.cmake): allow benchmark_device_transform …
Naraenda Aug 2, 2024
0f8c746
fix(benchmark_device_transform.cpp): fix unused type warning when com…
Naraenda Aug 2, 2024
62ef870
perf(device_transform.hpp): updated configs for device transform whic…
Naraenda Aug 2, 2024
0aa581d
fix(transform_config_template): added missing '::type' in general cas…
Naraenda Aug 2, 2024
a6f7784
refactor(benchmark_device_transform.cpp): remove duplicated code with…
Naraenda Aug 2, 2024
d62a8af
docs(changelog.md): removed 'slightly' in device transform performanc…
Naraenda Aug 2, 2024
fda70fd
fix(benchmark_device_transform): fix various build errors and warnings
Naraenda Aug 2, 2024
1cf9a23
test(test_device_batch_memcpy.cpp): add simple batch copy test
Naraenda Aug 2, 2024
e7c11a7
fix(device_batch_memcpy.hpp): use dereference instead of 'rocprim::th…
Naraenda Aug 2, 2024
94f1b7a
revert test(test_device_batch_memcpy.cpp): add simple batch copy test
Naraenda Aug 2, 2024
9222c1a
test(benchmark_device_adjacent_difference.cpp): increased the default…
Naraenda Aug 2, 2024
52592ca
docs(changelog.md): update changelog with benchmark changes
Naraenda Aug 2, 2024
484d7c5
Added overload for match_any
jblok27 Aug 2, 2024
08d1641
Replaced section with match_any() call
jblok27 Aug 2, 2024
d6ce082
Fixed copyright date
jblok27 Aug 2, 2024
45647a6
Fixed formatting
jblok27 Aug 2, 2024
74be978
change match_any to runtime dispatch
jblok27 Aug 2, 2024
c61a05f
docs(intrinsics/warp): name the correct label_bits in match_any docum…
Aug 2, 2024
8a47190
unified wavefront definition
nolmoonen Aug 2, 2024
a6ea925
build: Remove force-inline workaround on windows
Aug 2, 2024
d2343a2
ci: enable debug builds on windows
Aug 2, 2024
2f5b048
docs: Add CHANGELOG for removing force-inline workaround
Aug 2, 2024
59cbb43
fix clang format
parbenc Aug 2, 2024
ab61ca9
fix(tests): Add saturating casts and use them for random data generation
Aug 2, 2024
557dd5e
fix(benchmark_device_adjacent_difference): fixe size in bytes instead…
Beanavil Aug 2, 2024
02fd1fe
Update contributing guidelines
Beanavil Aug 2, 2024
6c4bcce
specify benchmark seed via command line
nolmoonen Aug 2, 2024
d7421d4
refactor lookback sleep dispatch
nolmoonen Aug 2, 2024
d9dcd9a
add config to tests
nolmoonen Aug 2, 2024
691dc37
add config tuning for partition
nolmoonen Aug 2, 2024
3e8f81f
generic tuning
nolmoonen Aug 2, 2024
e08b551
add tuned configurations
nolmoonen Aug 2, 2024
19ac4a6
Fix "warning: loop not unrolled" with CMAKE_BUILD_TYPE=MinSizeRel (-Os)
ex-rzr Aug 2, 2024
93827f4
fix(device_partition): re-added workaround for the device_partition f…
mfep Aug 2, 2024
4b404e1
Using .lint:clang-format
mfep Aug 2, 2024
36704ce
refactor(intrinsics/thread.hpp): remove 'memory_fence_device' workaro…
Naraenda Aug 2, 2024
0d62eda
ci(.gitlab-ciy.yml): disable debug builds in cmake-minimum due to exc…
Naraenda Aug 2, 2024
fe15107
fix(docs): Fixed documentation for thread subdir
mfep Aug 2, 2024
ada912c
fix(docs): Fixed documentation for the types subdir
mfep Aug 2, 2024
ec3f614
fix undefined behavior in test data generation
nolmoonen Aug 2, 2024
a71833e
Deprecate thread_load/thread_store
NB4444 Aug 2, 2024
3b5d460
Ignore thread_load and thread_store deprecation warnings
NB4444 Aug 2, 2024
20d84ed
Deprecated raw_storage and replaced by uninitialized_array in a few l…
mfep Aug 2, 2024
392f09b
unsigned char storage in raw_storage to prevent undefined aliasing
mfep Aug 2, 2024
139993c
Added ROCPRIM_DONT_SUPPRESS_DEPRECATIONS
mfep Aug 2, 2024
2d07e14
Resolve "Improve rocPRIM test logs"
Aug 2, 2024
cc9f2ba
improve documentation for configuration tuning
nolmoonen Aug 2, 2024
0bb05fd
Refactor device_scan, use is_sleep_scan_state_used and with_scan_stat…
ex-rzr Aug 2, 2024
e72d33f
Use device of the current stream in is_sleep_scan_state_used
ex-rzr Aug 2, 2024
6e2681a
Do not build kernels with sleep in lookback state on devices that don…
ex-rzr Aug 2, 2024
f41f1cf
Resolve "Add thread headers to rocprim.hpp and document thread-level …
Aug 2, 2024
48c7c58
Resolve "Batch memcpy: disable BENCHMARK_BATCH_MEMCPY_NAIVE"
Aug 2, 2024
eea8806
Resolve "Fix under- and overflow in minimum and maximum for input dat…
Aug 2, 2024
a24ddf6
Resolve "CMake build consistency"
Aug 2, 2024
68620f4
Resolve "Benchmark utility for random segments generates segments of …
Aug 2, 2024
67cdd25
Adapt device segmented_reduce for large indices within a segment
Beanavil Aug 2, 2024
ede8b05
Add large indices test
Beanavil Aug 2, 2024
5735213
Update CHANGELOG
Beanavil Aug 2, 2024
eac62d4
reduce by key tuning
nolmoonen Aug 2, 2024
28a7530
First commit nth element
NB4444 Aug 2, 2024
1ae80f7
Tests nth element
NB4444 Aug 2, 2024
6878eb4
Simplified working version nth element on one block
NB4444 Aug 2, 2024
9d7fd29
Added output check for correctness
NB4444 Aug 2, 2024
3bde5aa
nth element sizes larger then 64
NB4444 Aug 2, 2024
58e7a66
Added equality buckets to nth_element logic
NB4444 Aug 2, 2024
5845cc4
Added multiple blocks for nth element
NB4444 Aug 2, 2024
976e6fe
Added test to see if elements did not change
NB4444 Aug 2, 2024
4d01270
Debugging synchronization
NB4444 Aug 2, 2024
3674577
Nth element working version only for key with comperator greater and
NB4444 Aug 2, 2024
978e81d
Nth element implemented for key with tests
NB4444 Aug 2, 2024
8d26d5d
Fixed issue for custom types in nth element and added tests
NB4444 Aug 2, 2024
f3cc3dc
Added input and output itterators for nth element
NB4444 Aug 2, 2024
32356e4
Added some benchmarks for nth element
NB4444 Aug 2, 2024
e9cc712
Small optimizations nth element
NB4444 Aug 2, 2024
fa514c0
Debug code nth element
NB4444 Aug 2, 2024
459e205
Made seperate kernel for block offset calculations nth element
NB4444 Aug 2, 2024
1146b68
Small optimizations nth element
NB4444 Aug 2, 2024
2ee4b9d
Moved all block offset calculation to other kernel nth element
NB4444 Aug 2, 2024
65ea5b9
Optimization nth element
NB4444 Aug 2, 2024
53b1dee
Make use of radix_rank instead of multiple scans
NB4444 Aug 2, 2024
75d15cb
Start of adding multiple items per thread nth element
NB4444 Aug 2, 2024
b474a4f
Nth element using less shared memory
NB4444 Aug 2, 2024
7403a43
Nth element small optimizations and cleanup code
NB4444 Aug 2, 2024
d9863c6
Fixed benchmark break nth element after rebase
NB4444 Aug 2, 2024
37ec173
nth element local oracle for buckets_store
NB4444 Aug 2, 2024
5ef6cc9
Cleanup nth element
NB4444 Aug 2, 2024
75f58ea
Nth element update tests with random nth element
NB4444 Aug 2, 2024
ea383b7
Addition of configs for nth element
NB4444 Aug 2, 2024
20dc554
Add lookbackstates to nth element
NB4444 Aug 2, 2024
d69b533
Cleanup and extra comments in nth_element
NB4444 Aug 2, 2024
d3c9896
Removed unnecesarry test cases and choose nth_element based on seed_v…
NB4444 Aug 2, 2024
a4726a4
Added nth_element to changelog
NB4444 Aug 2, 2024
21c105c
Updated benchmark of nth_element based on feedback
NB4444 Aug 2, 2024
a62b86f
Nth_element updated tests and config based on review
NB4444 Aug 2, 2024
e874f16
Documentation updated for nth_element
NB4444 Aug 2, 2024
2523f3a
Cleanup code nth element
NB4444 Aug 2, 2024
e803294
Nth element changes based on review
NB4444 Aug 2, 2024
bd43291
Add documentation spinx doc
NB4444 Aug 2, 2024
50b52e9
Changed nth element to a while loop
NB4444 Aug 2, 2024
f8aa873
Nth element asserts in device code
NB4444 Aug 2, 2024
1bacc52
Nth element documentation fixes
NB4444 Aug 2, 2024
d60de33
nth element docs crash fix
NB4444 Aug 2, 2024
6dc847b
nth element lookback state reset
NB4444 Aug 2, 2024
28afd01
Nth element changes based on review
NB4444 Aug 2, 2024
a548d00
Replaced raw storage with unitialized_array in nth element
NB4444 Aug 2, 2024
de43f1c
Changed Nth element to be able to be used with iterators
NB4444 Aug 2, 2024
e1bad2f
nth element fix small mistakes
NB4444 Aug 2, 2024
1b1e75a
Added config for in place nth element
NB4444 Aug 2, 2024
220167c
Changes based on review
NB4444 Aug 2, 2024
56d360a
Added c++17 tests nth_element
NB4444 Aug 2, 2024
d78f677
Make use of internal merge_path also fix bug with unsigned types for …
NB4444 Aug 2, 2024
16ca95f
Added test for public merge_path_search
NB4444 Aug 2, 2024
b0a8de5
Fixed thread_load and thread_store bug with float and double
NB4444 Aug 2, 2024
3723161
Made review changes
NB4444 Aug 2, 2024
a027de6
Add bug fixes to changelog
NB4444 Aug 2, 2024
daf6a8b
ci: remove trailing newlines in gitlab-ci.yml
Snektron Aug 2, 2024
361c50f
ci: compress autotune artifacts using zstd
Snektron Aug 2, 2024
a5b95c7
Removed oracles array from nth element
NB4444 Aug 2, 2024
aba13fe
Remove constraint of 256 for number of buckets nth element
NB4444 Aug 2, 2024
23a3829
Apply 1 suggestion(s) to 1 file(s)
NB4444 Aug 2, 2024
30ba72b
clang-format: trick clang-format into always breaking after c-style f…
Snektron Aug 2, 2024
d538a5a
add ctz intrinsic
Snektron Aug 2, 2024
9c8afb3
lookback scan: remove HIP-CPU bits
Snektron Aug 2, 2024
15cbd89
lookback scan: reformat
Snektron Aug 2, 2024
3c48cdf
lookback scan: add reproducibility test
Snektron Aug 2, 2024
1dc10dc
test: print floats as hexfloat in assert_bit_eq
Snektron Aug 2, 2024
60502f0
add warp_readfirstlane and warp_readlane intrinsics
Snektron Aug 2, 2024
c5533d6
lookback scan: add deterministic implementation
Snektron Aug 2, 2024
6139baf
scan: add deterministic overload
Snektron Aug 2, 2024
8290563
scan_by_key: add deterministic overload
Snektron Aug 2, 2024
ef5fd4b
reduce_by_key: add deterministic overload
Snektron Aug 2, 2024
82e35ea
add char and short atomic load/store overloads
Snektron Aug 2, 2024
5139e95
lookback scan: change flag to be always one byte
Snektron Aug 2, 2024
4de4860
lookback scan: swap flag and prefix, allow fast scan for values up to…
Snektron Aug 2, 2024
db676c9
nara nit f32
Snektron Aug 2, 2024
258cd3e
update changelog with mention of deterministic algorithms
Snektron Aug 2, 2024
4ea5ea6
lookback reproducibility test: allocate temporary memory with the rig…
Snektron Aug 2, 2024
642c68d
lookback scan: avoid caching large types
Snektron Aug 2, 2024
e1f5576
remove assertions in lookbacn scan, they don't compile properly in de…
Snektron Aug 2, 2024
0ee4a1b
lookback reproducibility test: use same functor for both tests
Snektron Aug 2, 2024
8cce0d3
lookback scan: rotate prefix rather than block_prefix
Snektron Aug 2, 2024
ea287e9
lookback scan: also test deterministic in normal tests
Snektron Aug 2, 2024
d0f77bb
naive implementation
nolmoonen Aug 2, 2024
c4b6566
partial sort benchmark
NB4444 Aug 2, 2024
e8e48be
Made partial_sort in place and created partial_sort_copy
NB4444 Aug 2, 2024
f72156c
Add and fix documentation partial_sort
NB4444 Aug 2, 2024
26939ae
Test partial_sort with iterator
NB4444 Aug 2, 2024
ce29440
Add partial_sort and partial_sort_copy to the changelog
NB4444 Aug 2, 2024
736c7db
Moved partial sort to own file
NB4444 Aug 2, 2024
e2460c4
Added partial_sort_config
NB4444 Aug 2, 2024
82cb298
Merge with nth_element_remove_oracle branch
NB4444 Aug 2, 2024
dabd835
Created c++17 test for partial_sort
NB4444 Aug 2, 2024
0aabab0
Cleanup code based on nth_element review
NB4444 Aug 2, 2024
ff1ed5d
Review adaptations
NB4444 Aug 2, 2024
48e92b0
Added benchmark for partial_sort
NB4444 Aug 2, 2024
8f68114
Fixed bug with inplicit casting in partial sort
NB4444 Aug 2, 2024
5e156b2
add static_cast to fix compiler warning
parbenc Aug 2, 2024
4d51ed1
Restored tests for device histogram_even for half/bfloat16 types
Melirius Aug 2, 2024
f5d8432
Removed unused variable and formatting
NB4444 Aug 2, 2024
c450a74
ci: Enable debug builds excluding test_block_adjacent_difference/disc…
ex-rzr Aug 2, 2024
0cec9f8
test(test_device_batch_memcpy.cpp): fix invalid calls being made to g…
Naraenda Aug 2, 2024
04122d2
test(test_device_batch_memcpy.cpp): standardize test names
Naraenda Aug 2, 2024
cb98d2b
test(test_intrinsics.cpp): fix invalid calls being made to test_utils…
Naraenda Aug 2, 2024
78720b5
ci(.gitlab-ci.yml): add hardened libc++ assertions when building test…
Naraenda Aug 2, 2024
62857dc
docs: update changelog
Naraenda Aug 2, 2024
1677174
docs: fix doxygen errors and warnings
Naraenda Aug 2, 2024
285369a
build(cmake/Dependencies.cmake): build rocm-cmake depedency during po…
Naraenda Aug 2, 2024
951ea9d
refactor(benchmark_config_dispatch.cpp): fix unused variable and func…
Naraenda Aug 2, 2024
099afd3
chore: bump version to 3.3.0
Naraenda Aug 2, 2024
eaddb40
Reduce items_per_thread for merge_sort to one for large types
NB4444 Aug 2, 2024
87abd8c
Reduce block_size for device_merge with large types
NB4444 Aug 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: Yes
AttributeMacros: ['ROCPRIM_DEVICE', 'ROCPRIM_HOST', 'ROCPRIM_HOST_DEVICE', 'ROCPRIM_SHARED_MEMORY', 'ROCPRIM_KERNEL', 'ROCPRIM_INLINE']
BinPackArguments: false
BinPackParameters: false
BitFieldColonSpacing: Both
Expand Down Expand Up @@ -135,4 +134,40 @@ SpacesInConditionalStatement: false
SpacesInContainerLiterals: true
SpacesInParentheses: false
SpacesInSquareBrackets: false

AttributeMacros:
- __host__
- __device__
- __global__
- __forceinline__
- __shared__
- __launch_bounds__
- ROCPRIM_DEVICE
- ROCPRIM_HOST
- ROCPRIM_HOST_DEVICE
- ROCPRIM_SHARED_MEMORY
- ROCPRIM_KERNEL
- ROCPRIM_INLINE
- ROCPRIM_FORCE_INLINE
- ROCPRIM_LAUNCH_BOUNDS

# Trick clang into thinking that our C-style attributes are C++-style attributes
# Make sure that the sizes line up for linebreaks etc
Macros:
- __host__=[[host]]
- __device__=[[device]]
- __global__=[[global]]
- __forceinline__=[[forceinline]]
- __shared__=[[shared]]
- __launch_bounds__(x)=[[launch_bounds(x)]]
- __attribute__(x)=[[attribute(x)]]
- ROCPRIM_DEVICE=[[DEVICE____]]
- ROCPRIM_HOST=[[HOST____]]
- ROCPRIM_HOST_DEVICE=[[HOST_DEVICE____]]
- ROCPRIM_SHARED_MEMORY=[[SHARED_MEMORY____]]
- ROCPRIM_KERNEL=[[KERNEL____]]
- ROCPRIM_INLINE=[[INLINE____]]
- ROCPRIM_FORCE_INLINE=[FORCE_INLINE____]]
- ROCPRIM_LAUNCH_BOUNDS(x)=[[launch_bounds(x)____]]
BreakAfterAttributes: Always
---
86 changes: 41 additions & 45 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ include:
- /defaults.yaml
- /deps-cmake.yaml
- /deps-docs.yaml
- /deps-format.yaml
- /deps-rocm.yaml
- /deps-vcpkg.yaml
- /deps-windows.yaml
Expand All @@ -46,20 +47,7 @@ variables:

clang-format:
extends:
- .deps:rocm
stage: lint
needs: []
tags:
- build
variables:
CLANG_FORMAT: "/opt/rocm/llvm/bin/clang-format"
GIT_CLANG_FORMAT: "/opt/rocm/llvm/bin/git-clang-format"
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
script:
- cd $CI_PROJECT_DIR
- git config --global --add safe.directory $CI_PROJECT_DIR
- scripts/code-format/check-format.sh $CI_MERGE_REQUEST_DIFF_BASE_SHA --binary "$CLANG_FORMAT"
- .lint:clang-format

copyright-date:
extends:
Expand Down Expand Up @@ -162,15 +150,22 @@ build:cmake-minimum-apt:
extends:
- .gpus:rocm-gpus
- .rules:build
variables:
EXTRA_CMAKE_CXX_FLAGS: ""
script:
- mkdir -p $BUILD_DIR
- cd $BUILD_DIR
- | # Add hardened libc++ assertions for tests only
if [[ $BUILD_TARGET == "TEST" ]]; then
echo "Configuring with hardened libc++!"
EXTRA_CMAKE_CXX_FLAGS+=" -D_GLIBCXX_ASSERTIONS=ON"
fi
- cmake
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror $EXTRA_CMAKE_CXX_FLAGS"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
-D BUILD_$BUILD_TARGET=ON
-D BUILD_$BUILD_TARGET=ON
-D BUILD_EXAMPLE=ON
-D GPU_TARGETS=$GPU_TARGETS
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
Expand All @@ -197,9 +192,11 @@ build:cmake-latest:
extends:
- .cmake-latest
- .build:common
variables:
BUILD_TYPE: Release
BUILD_TARGET: TEST
parallel:
# Debug builds disabled due to excessive build times for debug test builds
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: [BENCHMARK, TEST]

build:cmake-minimum:
needs: []
Expand Down Expand Up @@ -246,20 +243,19 @@ build:windows:
- .deps:visual-studio-devshell
parallel:
matrix:
# Debug is disabled due to extensive link times, tracked in issue 679.
- BUILD_TYPE: [Release]
- BUILD_TYPE: [Debug, Release]
BUILD_TARGET: [BENCHMARK, TEST]
script:
- mkdir -p $CI_PROJECT_DIR/build
- cmake -G Ninja
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build
-D BUILD_$BUILD_TARGET=ON
- cmake -G Ninja
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build
-D BUILD_$BUILD_TARGET=ON
-D GPU_TARGETS=$GPU_TARGET
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
- cmake --build "$CI_PROJECT_DIR/build"
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
- cmake --build "$CI_PROJECT_DIR/build"
artifacts:
paths:
- $CI_PROJECT_DIR/build/test/test_*
Expand All @@ -281,6 +277,10 @@ autotune:build:
- .cmake-minimum
- .gpus:rocm-gpus
- .rules:benchmark
before_script:
- !reference [".cmake-minimum", before_script]
- $SUDO_CMD apt-get update -qq
- $SUDO_CMD apt-get install -qq -y zstd
variables:
BENCHMARK_TARGETS: benchmark_config_tuning
script:
Expand All @@ -301,22 +301,13 @@ autotune:build:
-D GPU_TARGETS=$GPU_TARGETS
- cmake --build . --target $BENCHMARK_TARGETS
- 'rm -rf $BUILD_DIR/benchmark/benchmark*.parallel'
# remove benchmark executables if their size together is too large for gitlab ci to handle
- |
total_size_bytes=0
while read -r file_size; do
total_size_bytes=$((total_size_bytes + file_size))
done < <(stat --format="%s" benchmark/benchmark*)
total_size_gib="$(numfmt --round=down --to-unit=Gi "$total_size_bytes")"
if [ "$total_size_gib" -ge 3 ]; then
printf "Total size: %s (%d bytes) > 3GiB, skipping benchmark executables from the artifact.\n" \
"$(numfmt --to=iec-i "$total_size_bytes")" "$total_size_bytes"
rm benchmark/benchmark*
fi
# The autotune benchmarks get very large, above GitLabs upload limit. Fortunately they compress well.
# We'll put them all in a single archive to compress them to a few hundred MB.
- find benchmark -type f -executable -print0 | tar -I zstd -cvf benchmarks.tar.zstd --null -T -

artifacts:
paths:
- $BUILD_DIR/benchmark/benchmark*
- $BUILD_DIR/benchmarks.tar.zstd
- $BUILD_DIR/.ninja_log
- $BUILD_DIR/deps/googlebenchmark/
expire_in: 1 week
Expand Down Expand Up @@ -360,7 +351,7 @@ test:
- cd $CI_PROJECT_DIR/build
- ctest --output-on-failure

# Disabled due to extensive link times.
# Disabled due to extensive link times.
# This is tracked in issue 679
# test-windows-debug:
# extends:
Expand Down Expand Up @@ -535,9 +526,13 @@ autotune:execute-tuning:
artifacts:
paths:
- ${AUTOTUNE_RESULT_DIR}/*.json
before_script:
- !reference [".cmake-minimum", before_script]
- $SUDO_CMD apt-get update -qq
- $SUDO_CMD apt-get install -qq -y zstd
script:
- >
cd "${CI_PROJECT_DIR}"
- cd "${CI_PROJECT_DIR}"
- tar -I zstd -xvf "${BUILD_DIR}/benchmarks.tar.zstd" -C "${BUILD_DIR}/"
- |
if [ ! -d "${BUILD_DIR}/benchmark" ]; then
echo "There are no benchmark executables. Run the build job with a BUILD_TARGET."
Expand All @@ -553,6 +548,7 @@ autotune:execute-tuning:
--benchmark_filter_regex="${AUTOTUNE_ALGORITHM_REGEX}"
--size="${AUTOTUNE_SIZE}"
--trials="${AUTOTUNE_TRIALS}"
--seed=82589933

autotune:generate-config:
image: python:3.10.5-buster
Expand Down
13 changes: 10 additions & 3 deletions .gitlab/run_benchmarks.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python3

# Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -28,7 +28,7 @@
import subprocess
import sys

BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials'])
BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials', 'seed'])

def run_benchmarks(benchmark_context):
def is_benchmark_executable(filename):
Expand Down Expand Up @@ -61,6 +61,8 @@ def is_benchmark_executable(filename):
args += ['--size', benchmark_context.size]
if benchmark_context.trials:
args += ['--trials', benchmark_context.trials]
if benchmark_context.seed:
args += ['--seed', benchmark_context.seed]
try:
subprocess.check_call(args)
except subprocess.CalledProcessError as error:
Expand Down Expand Up @@ -97,6 +99,10 @@ def main():
help='Controls the number of trial iterations for each benchmark case',
default='',
required=False)
parser.add_argument('--seed',
help='Controls the seed for random number generation for each benchmark case',
default='',
required=False)

args = parser.parse_args()

Expand All @@ -107,7 +113,8 @@ def main():
args.benchmark_filename_regex,
args.benchmark_filter_regex,
args.size,
args.trials)
args.trials,
args.seed)

benchmark_run_successful = run_benchmarks(benchmark_context)

Expand Down
32 changes: 29 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,36 @@ Documentation for rocPRIM is available at

## Unreleased rocPRIM-3.3.0 for ROCm 6.3.0

### Fixes
### Additions

* Option `--seed` to benchmarks to specify a seed for the generation of random inputs. The default behavior is to keep using a random seed per benchmark measurement.
* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and device reduce by key (`rocprim::reduce_by_key`) for improved performance on selected architectures.
* Added `rocprim::uninitialized_array` which provides uninitialized storage in local memory for user-defined types.
* Added large segment support for `rocprim:segmented_reduce`.
* Added a parallel `nth_element` device function similar to `std::nth_element`, this function rearranges elements smaller than the n-th before and bigger than the n-th after the n-th element.
* Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
* Added a parallel `partial_sort` and `partial_sort_copy` device function similar to `std::partial_sort` and `std::partial_sort_copy`, these functions rearranges elements such that the elements are the same as a sorted list up to and including the middle index.

### Changes

* Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
* Changed the default seed for `device_benchmark_segmented_reduce`.

### Fixes

* Fixed an issue where while running rtest.py on windows and passing in an absolute path to `--install_dir` causes a `FileNotFound` error.
* rocPRIM functions are no longer forcefully inlined on Windows, significantly reducing the build
time in debug builds.
* `block_load`, `block_store`, `block_shuffle`, `block_exchange` and `warp_exchange` now use placement `new` instead of copy
assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected.
* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would output wrong results.
* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not casted to the correct type resulting in wrong results.
* Fix tests failing when compiling with `-D_GLIBCXX_ASSERTIONS=ON`.

### Deprecations

* `rocprim::thread_load` and `rocprim::thread_store`, use dereference instead. Not all of those functions are available on every device architecture, and their usage can hurt performance, because inline assembly inhibits optimizations.

## Unreleased rocPRIM-3.2.0 for ROCm 6.2.0

Expand All @@ -28,10 +53,10 @@ Documentation for rocPRIM is available at
* New `rocprim::batch_copy` function added. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
* Added more test cases, to better cover supported data types.
* Updated some tests to work with supported data types.
* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
of references to arithmetic types from the key.
* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
It can be instantiated with:
* `rocprim::make_predicate_iterator`
Expand All @@ -43,6 +68,7 @@ Documentation for rocPRIM is available at

* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
* Improved the performance of `device_transform`.

### Fixes

Expand Down
3 changes: 2 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ endif()
# Build options
option(BUILD_TEST "Build tests (requires googletest)" OFF)
option(BUILD_BENCHMARK "Build benchmarks" OFF)
option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
option(BUILD_EXAMPLE "Build examples" OFF)
option(BUILD_DOCS "Build documentation (requires sphinx)" OFF)
option(USE_HIP_CPU "Prefer HIP-CPU runtime instead of HW acceleration" OFF)
Expand Down Expand Up @@ -130,7 +131,7 @@ if(USE_HIP_CPU)
endif()

# Setup VERSION
set(VERSION_STRING "3.2.0")
set(VERSION_STRING "3.3.0")
rocm_setup_version(VERSION ${VERSION_STRING})

# Print configuration summary
Expand Down
Loading