RPP Phase on HIP and HOST (#267)

* RPP RICAP Tensor for HOST and HIP (#213) * Initial commit - Ricap HOST Tensor Includes testsuite changes * Add QA tests for RICAP Used three_images_224x224_src1 folder to create golden outputs * Add three_images_224x224_src1 into TEST_IMAGES * Support HIP Backend for RICAP * Fix HIP pkd3->pkd3 variant * regenerated golden outputs for RICAP minor changes in HOST shell script for handling RICAP in QA mode * minor bug fix in RICAP HIP kernels * Improve readability and Cleanup * Additional cleanup * Cleanup testsuite Includes new golden outputs * Additional testuite fixes * Minor cleanup * Fix codacy warnings * Address other codacy warnings * Update ricap.hpp with reference paper * Add RICAP dataset path in readme * Make changes to error codes returned * Modify roi crop region for unit and perf tests * RPP Tensor Water Augmentation on HOST and HIP (#181) * added water HOST and HIP codes * added water case in test suite * added golden outputs for water * added omp thread changes for water augmentation * experimental changes * fixed output issue with AVX2 instructions * added AVX2 support for PKD3 load function minor changes in PLN variant load functions * nwc commit - added avx2 changes for u8 layout toggle variants but need to add store functions for completion * Add Avx2 implementation for F32 and U8 toggle variants * Add AVX2 support for u8 pkd3-pln3 and i8 pkd3-pln3 for water augmentation * change F32 load and store logic * optimized the store function for F32 PLN3-PKD3 * reverted back irrelevant changes * minor change * optimized load and store functions for water U8 and F32 variants in host removed commented code * removed golden outputs for water * minor changes * renamed few functions and removed unused functions updated i8 pln1 load as per the optimized u8 pln1 load * fixed bug in i8 load function * changed cast to c++ style resolved spacing issues and added comments for AVX codes for better understanding made changes to handle cases where QA Tests are not supported * added golden outputs for water * updated golden outputs with latest changes * modified the u8, i8 pkd3-pln3 function and added comments for the vectorized code * fixed minor bug in I8 variants * made to changes to resolve codacy warnings * changed cast to c++ style in hip kernel * changed generic nn F32 loads using gather and setr instructions * added comments for latest changes * minor change * added definition for storing 32 and 64 bits from a 128bit register --------- Co-authored-by: sampath1117 <sampath.rachumallu@multicorewareinc.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> * Fix build error * CMakeLists - Version Update 1.5.0 - TOT Version * CHANGELOG Updates Version 1.5.0 placeholder * Boost deps fix for test suite --------- Co-authored-by: Snehaa Giridharan <snehaa@multicorewareinc.com> Co-authored-by: sampath1117 <sampath.rachumallu@multicorewareinc.com> Co-authored-by: Snehaa-Giridharan <118163708+snehaa8@users.noreply.github.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> * Documentation - Readme & changelog updates (#251) * readme and changelog updates for 6.0 * minor update * Documentation - Bump rocm-docs-core[api_reference] from 0.26.0 to 0.27.0 in /docs/sphinx (#253) Bumps [rocm-docs-core[api_reference]](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.26.0 to 0.27.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.26.0...v0.27.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * RPP Resize Mirror Normalize Bugfix (#252) * added fix for hipMemset * remove pixel check for U8-F32 and U8-F16 for HOST codes --------- Co-authored-by: sampath1117 <sampath.rachumallu@multicorewareinc.com> * Sphinx - updates (#257) * Sphinx - updates * Doxygen - Updates * Docs - Remove index.md * Initial commit - Phase HOST Tensor * Add QA outputs * Update QA reference outputs for phase kernel Also fixes mismatch * Replace SSE optimization with AVX Fixes QA reference outputs * Fix f16 and i8 datatype variants * Initial commit - Phase HIP Tensor * Fix pixel mismatch between HIP and HOST * Use reinrerpret_cast instead of typecast * Optimize HIP kernel * Cleanup and few optimizations for HOST * Replace all occurences of static_cast with reinterpret_cast Includes using cvtps_ph instead of direct typecasting * Optimize HIP pkd3 loads by loading 24 pixels in one shot * Fix f16 PLN3->PKD3 variant * Merge branch 'master' of https://github.com/r-abishek/rpp into sn/phase_tensor * Remove ricap mods from phase PR branch * Remove ricap changes from phase PR branch * Change to small f * minor bug fix for phase hip pkd kernel * Minor docs fix * Add fix for warnings on hip build * Fix static_cast used in f16 HOST backend * Cleanup Enable AVX optimization only when AVX flag is enabled * Fix constant definition --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Snehaa Giridharan <snehaa@multicorewareinc.com> Co-authored-by: sampath1117 <sampath.rachumallu@multicorewareinc.com> Co-authored-by: Snehaa-Giridharan <118163708+snehaa8@users.noreply.github.com> Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: Lisa <lisajdelaney@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ROCm · Dec 22, 2023 · 8999c22 · 8999c22
1 parent 81823d5
commit 8999c22
Show file tree

Hide file tree

Showing 17 changed files with 1,603 additions and 15 deletions.
diff --git a/include/rppt_tensor_geometric_augmentations.h b/include/rppt_tensor_geometric_augmentations.h
@@ -398,6 +398,50 @@ RppStatus rppt_rotate_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dst
 RppStatus rppt_rotate_gpu(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, Rpp32f *angle, RpptInterpolationType interpolationType, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);
 #endif // GPU_SUPPORT
 
+/*! \brief Phase augmentation on HOST backend for a NCHW/NHWC layout tensor
+ * \details The phase augmentation computes phase of corresponding pixels for a batch of RGB(3 channel) / greyscale(1 channel) images with an NHWC/NCHW tensor layout.<br>
+ * - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
+ * - dstPtr depth ranges - Will be same depth as srcPtr.
+ * \image html img150x150.jpg Sample Input1
+ * \image html img150x150_2.jpg Sample Input2
+ * \image html phase_augmentation_img150x150.jpg Sample Output
+ * \param [in] srcPtr1 source1 tensor in HOST memory
+ * \param [in] srcPtr2 source2 tensor in HOST memory
+ * \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
+ * \param [out] dstPtr destination tensor in HOST memory
+ * \param [in] dstDescPtr destination tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = same as that of srcDescPtr)
+ * \param [in] roiTensorSrc ROI data in HOST memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y))
+ * \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
+ * \param [in] rppHandle RPP HOST handle created with <tt>\ref rppCreateWithBatchSize()</tt>
+ * \return A <tt> \ref RppStatus</tt> enumeration.
+ * \retval RPP_SUCCESS Successful completion.
+ * \retval RPP_ERROR* Unsuccessful completion.
+ */
+RppStatus rppt_phase_host(RppPtr_t srcPtr1, RppPtr_t srcPtr2, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);
+
+#ifdef GPU_SUPPORT
+/*! \brief Phase augmentation on HIP backend for a NCHW/NHWC layout tensor
+ * \details The phase augmentation computes phase of corresponding pixels for a batch of RGB(3 channel) / greyscale(1 channel) images with an NHWC/NCHW tensor layout.<br>
+ * - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
+ * - dstPtr depth ranges - Will be same depth as srcPtr.
+ * \image html img150x150.jpg Sample Input1
+ * \image html img150x150_2.jpg Sample Input2
+ * \image html phase_augmentation_img150x150.jpg Sample Output
+ * \param [in] srcPtr1 source1 tensor in HIP memory
+ * \param [in] srcPtr2 source2 tensor in HIP memory
+ * \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
+ * \param [out] dstPtr destination tensor in HIP memory
+ * \param [in] dstDescPtr destination tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = same as that of srcDescPtr)
+ * \param [in] roiTensorSrc ROI data in HIP memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y))
+ * \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
+ * \param [in] rppHandle RPP HIP handle created with <tt>\ref rppCreateWithStreamAndBatchSize()</tt>
+ * \return A <tt> \ref RppStatus</tt> enumeration.
+ * \retval RPP_SUCCESS Successful completion.
+ * \retval RPP_ERROR* Unsuccessful completion.
+ */
+RppStatus rppt_phase_gpu(RppPtr_t srcPtr1, RppPtr_t srcPtr2, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);
+#endif // GPU_SUPPORT
+
 /*! \brief Slice augmentation HOST
  * \details This function performs slice augmentation on a generic 4D tensor.
  *          Slice augmentation involves selecting a region of interest (ROI) from the source tensor

diff --git a/src/include/cpu/rpp_cpu_common.hpp b/src/include/cpu/rpp_cpu_common.hpp
@@ -69,6 +69,8 @@ typedef halfhpp Rpp16f;
 #define RPP_2POW32_INV_DIV_2            1.164153218e-10f    // RPP_2POW32_INV / 2
 #define RPP_2POW32_INV_MUL_2PI          1.46291812e-09f     // (1 / 2^32) * 2PI
 #define RPP_2POW32_INV_MUL_2PI_DIV_2    7.3145906e-10f      // RPP_2POW32_INV_MUL_2PI / 2
+#define RPP_255_OVER_1PT57              162.3380757272f     // (255 / 1.570796) - multiplier used in phase computation
+#define ONE_OVER_1PT57                  0.6366199048f       // (1 / 1.570796) i.e. 2/pi - multiplier used in phase computation
 
 const __m128 xmm_p2Pow32 = _mm_set1_ps(RPP_2POW32);
 const __m128 xmm_p2Pow32Inv = _mm_set1_ps(RPP_2POW32_INV);

diff --git a/src/include/cpu/rpp_cpu_simd.hpp b/src/include/cpu/rpp_cpu_simd.hpp
@@ -1891,13 +1891,13 @@ static const __m128 _ps_coscof_p1 = _mm_set1_ps(-1.388731625493765E-003f);
 static const __m128 _ps_coscof_p2 = _mm_set1_ps( 4.166664568298827E-002f);
 static const __m128 _ps_cephes_FOPI = _mm_set1_ps(1.27323954473516f); // 4 / M_PI
 
-static const __m256 _ps_1_avx = _mm256_set1_ps(1.f);
 static const __m256 _ps_0p5_avx = _mm256_set1_ps(0.5f);
 static const __m256 _ps_n0p5_avx = _mm256_set1_ps(-0.5f);
 static const __m256 _ps_1p5_avx = _mm256_set1_ps(1.5f);
 static const __m256 _ps_min_norm_pos_avx = set1_ps_hex_avx(0x00800000);
 static const __m256 _ps_inv_mant_mask_avx = set1_ps_hex_avx(~0x7f800000);
 static const __m256 _ps_sign_mask_avx = set1_ps_hex_avx(0x80000000);
+static const __m256 _ps_inv_sign_mask_avx = set1_ps_hex_avx(~0x80000000);
 
 static const __m256i _pi32_1_avx = _mm256_set1_epi32(1);
 static const __m256i _pi32_inv1_avx = _mm256_set1_epi32(~1);
@@ -1969,7 +1969,7 @@ static inline void sincos_ps(__m256 x, __m256 *s, __m256 *c)
     y = _mm256_mul_ps(y, _mm256_mul_ps(z, z));
     __m256 tmp = _mm256_mul_ps(z, _ps_0p5_avx);
     y = _mm256_sub_ps(y, tmp);
-    y = _mm256_add_ps(y, _ps_1_avx);
+    y = _mm256_add_ps(y, avx_p1);
 
     // Evaluate the second polynom  (Pi/4 <= x <= 0)
 
@@ -2083,9 +2083,9 @@ static const __m128 _ps_cephes_PIO2F = _mm_set1_ps(1.5707963267948966192);
 static const __m128 _ps_cephes_PIO4F = _mm_set1_ps(0.7853981633974483096);
 
 static const __m128 _ps_atancof_p0 = _mm_set1_ps(8.05374449538e-2);
-static const __m128 _ps_atancof_p1 = _mm_set1_ps(1.38776856032E-1);
-static const __m128 _ps_atancof_p2 = _mm_set1_ps(1.99777106478E-1);
-static const __m128 _ps_atancof_p3 = _mm_set1_ps(3.33329491539E-1);
+static const __m128 _ps_atancof_p1 = _mm_set1_ps(1.38776856032e-1);
+static const __m128 _ps_atancof_p2 = _mm_set1_ps(1.99777106478e-1);
+static const __m128 _ps_atancof_p3 = _mm_set1_ps(3.33329491539e-1);
 
 static inline __m128 atan_ps( __m128 x )
 {
@@ -2195,12 +2195,129 @@ static inline __m128 atan2_ps( __m128 y, __m128 x )
     return result;
 }
 
+static const __m256 _ps_atanrange_hi_avx = _mm256_set1_ps(2.414213562373095);
+static const __m256 _ps_atanrange_lo_avx = _mm256_set1_ps(0.4142135623730950);
+static const __m256 _ps_cephes_PIF_avx = _mm256_set1_ps(3.141592653589793238);
+static const __m256 _ps_cephes_PIO2F_avx = _mm256_set1_ps(1.5707963267948966192);
+static const __m256 _ps_cephes_PIO4F_avx = _mm256_set1_ps(0.7853981633974483096);
+
+static const __m256 _ps_atancof_p0_avx = _mm256_set1_ps(8.05374449538e-2);
+static const __m256 _ps_atancof_p1_avx = _mm256_set1_ps(1.38776856032e-1);
+static const __m256 _ps_atancof_p2_avx = _mm256_set1_ps(1.99777106478e-1);
+static const __m256 _ps_atancof_p3_avx = _mm256_set1_ps(3.33329491539e-1);
+
+// AVX2 version of the atan_ps() SSE version
+static inline __m256 atan_ps(__m256 x)
+{
+    __m256 sign_bit, y;
+
+    sign_bit = x;
+    // Take the absolute value
+    x = _mm256_and_ps(x, _ps_inv_sign_mask_avx);
+    // Extract the sign bit (upper one)
+    sign_bit = _mm256_and_ps(sign_bit, _ps_sign_mask_avx);
+
+    // Range reduction, init x and y depending on range
+    // x > 2.414213562373095
+    __m256 cmp0 = _mm256_cmp_ps(x, _ps_atanrange_hi_avx, _CMP_GT_OS);
+    // x > 0.4142135623730950
+    __m256 cmp1 = _mm256_cmp_ps(x, _ps_atanrange_lo_avx, _CMP_GT_OS);
+
+    // x > 0.4142135623730950 && !(x > 2.414213562373095)
+    __m256 cmp2 = _mm256_andnot_ps(cmp0, cmp1);
+
+    // -(1.0/x);
+    __m256 y0 = _mm256_and_ps(cmp0, _ps_cephes_PIO2F_avx);
+    __m256 x0 = _mm256_div_ps(avx_p1, x);
+    x0 = _mm256_xor_ps(x0, _ps_sign_mask_avx);
+
+    __m256 y1 = _mm256_and_ps(cmp2, _ps_cephes_PIO4F_avx);
+    // (x-1.0)/(x+1.0)
+    __m256 x1_o = _mm256_sub_ps(x, avx_p1);
+    __m256 x1_u = _mm256_add_ps(x, avx_p1);
+    __m256 x1 = _mm256_div_ps(x1_o, x1_u);
+
+    __m256 x2 = _mm256_and_ps(cmp2, x1);
+    x0 = _mm256_and_ps(cmp0, x0);
+    x2 = _mm256_or_ps(x2, x0);
+    cmp1 = _mm256_or_ps(cmp0, cmp2);
+    x2 = _mm256_and_ps(cmp1, x2);
+    x = _mm256_andnot_ps(cmp1, x);
+    x = _mm256_or_ps(x2, x);
+
+    y = _mm256_or_ps(y0, y1);
+
+    __m256 zz = _mm256_mul_ps(x, x);
+    __m256 acc = _ps_atancof_p0_avx;
+    acc = _mm256_fmsub_ps(acc, zz, _ps_atancof_p1_avx);
+    acc = _mm256_fmadd_ps(acc, zz, _ps_atancof_p2_avx);
+    acc = _mm256_fmsub_ps(acc, zz, _ps_atancof_p3_avx);
+    acc = _mm256_mul_ps(acc, zz);
+    acc = _mm256_fmadd_ps(acc, x, x);
+    y = _mm256_add_ps(y, acc);
+
+    // Update the sign
+    y = _mm256_xor_ps(y, sign_bit);
+
+    return y;
+}
+
+// AVX2 version of the atan2_ps() SSE version
+static inline __m256 atan2_ps(__m256 y, __m256 x)
+{
+    __m256 x_eq_0 = _mm256_cmp_ps(x, avx_p0, _CMP_EQ_OQ);
+    __m256 x_gt_0 = _mm256_cmp_ps(x, avx_p0, _CMP_GT_OS);
+    __m256 x_le_0 = _mm256_cmp_ps(x, avx_p0, _CMP_LE_OS);
+    __m256 y_eq_0 = _mm256_cmp_ps(y, avx_p0, _CMP_EQ_OQ);
+    __m256 x_lt_0 = _mm256_cmp_ps(x, avx_p0, _CMP_LT_OS);
+    __m256 y_lt_0 = _mm256_cmp_ps(y, avx_p0, _CMP_LT_OS);
+
+    // Computes a zero mask, set if either both x=y=0 or y=0&x>0
+    __m256 zero_mask = _mm256_and_ps(x_eq_0, y_eq_0);
+    __m256 zero_mask_other_case = _mm256_and_ps(y_eq_0, x_gt_0);
+    zero_mask = _mm256_or_ps(zero_mask, zero_mask_other_case);
+
+    // Computes pio2 intermediate result, set if (y!0 and x=0) & (pi/2 XOR (upper bit y<0))
+    __m256 pio2_mask = _mm256_andnot_ps(y_eq_0, x_eq_0);
+    __m256 pio2_mask_sign = _mm256_and_ps(y_lt_0, _ps_sign_mask_avx);
+    __m256 pio2_result = _ps_cephes_PIO2F_avx;
+    pio2_result = _mm256_xor_ps(pio2_result, pio2_mask_sign);
+    pio2_result = _mm256_and_ps(pio2_mask, pio2_result);
+
+    // Computes pi intermediate result, set if y=0&x<0 and pi
+    __m256 pi_mask = _mm256_and_ps(y_eq_0, x_le_0);
+    __m256 pi_result = _mm256_and_ps(pi_mask, _ps_cephes_PIF_avx);
+
+    // Computes swap_sign_mask_offset, set if x<0 & y<0 of sign bit(uppermost bit)
+    __m256 swap_sign_mask_offset = _mm256_and_ps(x_lt_0, y_lt_0);
+    swap_sign_mask_offset = _mm256_and_ps(swap_sign_mask_offset, _ps_sign_mask_avx);
+
+     // Computes offset, set based on pi, swap_sign_mask_offset and x<0
+    __m256 offset0 = _mm256_xor_ps(_ps_cephes_PIF_avx, swap_sign_mask_offset);
+    __m256 offset = _mm256_andnot_ps(x_lt_0, avx_p0);
+    offset = _mm256_and_ps(x_lt_0, offset0);
+
+    // Computes division of x,y
+    __m256 arg = _mm256_div_ps(y, x);
+    __m256 atan_result = atan_ps(arg);
+    atan_result = _mm256_add_ps(atan_result, offset);
+
+    // Select between zero_result, pio2_result and atan_result
+    __m256 result = _mm256_andnot_ps(zero_mask, pio2_result);
+    atan_result = _mm256_andnot_ps(pio2_mask, atan_result);
+    atan_result = _mm256_andnot_ps(pio2_mask, atan_result);
+    result = _mm256_or_ps(result, atan_result);
+    result = _mm256_or_ps(result, pi_result);
+
+    return result;
+}
+
 // Modified AVX2 version of the original SSE version at https://github.com/RJVB/sse_mathfun/blob/master/sse_mathfun.h
 static inline __m256 log_ps(__m256 x)
 {
     __m256 e;
     __m256i emm0;
-    __m256 one = *(__m256 *)&_ps_1_avx;
+    __m256 one = *(__m256 *)&avx_p1;
     __m256 invalid_mask = _mm256_cmp_ps(x, avx_p0, _CMP_LE_OQ);
 
     // cut off denormalized stuff

diff --git a/src/include/hip/rpp_hip_common.hpp b/src/include/hip/rpp_hip_common.hpp
@@ -136,6 +136,8 @@ struct RPPTensorFunctionMetaData
 #define RPP_2POW32_INV_DIV_2            1.164153218e-10f    // RPP_2POW32_INV / 2
 #define RPP_2POW32_INV_MUL_2PI          1.46291812e-09f     // (1 / 2^32) * 2PI
 #define RPP_2POW32_INV_MUL_2PI_DIV_2    7.3145906e-10f      // RPP_2POW32_INV_MUL_2PI / 2
+#define RPP_255_OVER_1PT57              162.3380757272f     // (255 / 1.570796) - multiplier used in phase computation
+#define ONE_OVER_1PT57                  0.6366199048f       // (1 / 1.570796) i.e. 2/pi - multiplier used in phase computation
 #define SMEM_LENGTH_X                   128                 // Shared memory length of 128 cols to efficiently utilize all 16 LOCAL_THREADS_X as 16 * 8-byte vectorized global read/writes per thread = 128 bytes, fitting in 32 banks 4 byte wide
 #define SMEM_LENGTH_Y_1C                16                  // Shared memory length of 16 rows to efficiently utilize all 16 LOCAL_THREADS_Y as 1 128-byte-long row per thread (single channel greyscale)
 #define SMEM_LENGTH_Y_3C                48                  // Shared memory length of 48 rows to efficiently utilize all 16 LOCAL_THREADS_Y as 3 128-byte-long rows per thread (three channel rgb)
@@ -218,10 +220,15 @@ __device__ __forceinline__ void rpp_hip_pixel_check_and_store(float pixel, half*
 
 __device__ __forceinline__ float4 rpp_hip_pixel_check_0to255(float4 src_f4)
 {
-    return make_float4(fminf(fmaxf(src_f4.x, 0), 255),
-                       fminf(fmaxf(src_f4.y, 0), 255),
-                       fminf(fmaxf(src_f4.z, 0), 255),
-                       fminf(fmaxf(src_f4.w, 0), 255));
+    return make_float4(fminf(fmaxf(src_f4.x, 0.0f), 255.0f),
+                       fminf(fmaxf(src_f4.y, 0.0f), 255.0f),
+                       fminf(fmaxf(src_f4.z, 0.0f), 255.0f),
+                       fminf(fmaxf(src_f4.w, 0.0f), 255.0f));
+}
+
+__device__ __forceinline__ float rpp_hip_pixel_check_0to255(float src_f1)
+{
+    return fminf(fmaxf(src_f1, 0), 255);
 }
 
 // float4 pixel check for 0-1 range

diff --git a/src/modules/cpu/host_tensor_geometric_augmentations.hpp b/src/modules/cpu/host_tensor_geometric_augmentations.hpp
@@ -30,6 +30,7 @@ THE SOFTWARE.
 #include "kernel/resize_mirror_normalize.hpp"
 #include "kernel/resize_crop_mirror.hpp"
 #include "kernel/warp_affine.hpp"
+#include "kernel/phase.hpp"
 #include "kernel/slice.hpp"
 
 #endif // HOST_TENSOR_GEOMETRIC_AUGMENTATIONS_HPP