Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPP Tensor Mean and Stddev on HOST and HIP #348

Merged
merged 110 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
4ec11b0
Initial commit - Image mean Reduction HOST kernel
snehaa8 Jul 10, 2023
2860cfd
Implement PKD3 and PLN3 variants for HOST u8
snehaa8 Jul 13, 2023
d5f4db2
Fix c style casting
snehaa8 Jul 15, 2023
afad42a
Initial commit - Image mean Reduction HIP kernel
snehaa8 Jul 15, 2023
b17df8d
Implement PKD3 and PLN3 for Image mean Tensor HIP
snehaa8 Jul 15, 2023
15e4a57
Cleanup
snehaa8 Jul 16, 2023
5c8e5c9
Initial commit - Image stddev Reduction HOST kernel
snehaa8 Jul 18, 2023
dd98eb8
Support i8, f16 and f32 datatypes
snehaa8 Jul 18, 2023
e7e8a51
Fix stddev compute for channels
snehaa8 Jul 19, 2023
a52158c
Initial commit - Image stddev Reduction HIP kernel
snehaa8 Jul 20, 2023
7522aaa
Implement PLN3 and PKD3
snehaa8 Jul 21, 2023
b8347a0
Fix 3 channel outputs for Stddev HIP Kernel
snehaa8 Jul 25, 2023
21c78da
Fix issue in copy_param() in HIP
snehaa8 Jul 26, 2023
12ea912
Modify HIP Stddev to output stddev based on flag
snehaa8 Jul 26, 2023
3f5d676
Modify HOST Stddev to output stddev based on flag
snehaa8 Jul 31, 2023
a73d988
Make testsuite changes to support flag in HOST
snehaa8 Aug 3, 2023
b49e831
Modify api naming from image_ to tensor_
snehaa8 Aug 16, 2023
b04de07
Optimize U8 and I8 datatype
snehaa8 Aug 16, 2023
8814e60
Cleanup and optimize HOST
snehaa8 Aug 30, 2023
dab6fc8
Modify naming of shared variable used in HIP
snehaa8 Aug 30, 2023
5df9c3f
Cleanup testsuite
snehaa8 Aug 31, 2023
5742f31
Merge branch 'master' of https://github.com/ROCm/rpp into sn/reductio…
snehaa8 Feb 12, 2024
c33af22
Bump rocm-docs-core[api_reference] from 0.35.0 to 0.35.1 in /docs/sph…
dependabot[bot] Mar 6, 2024
5885760
Merge branch 'master' into sn/reduction_mean_stddev
snehaa8 Mar 8, 2024
14f6334
Bump rocm-docs-core[api_reference] from 0.35.1 to 0.36.0 in /docs/sph…
dependabot[bot] Mar 12, 2024
95c3272
Merge branch 'master' into develop
kiritigowda Mar 12, 2024
1bc0fe8
Change all maskArr to scratchBufferHip
r-abishek Mar 13, 2024
62a772f
Change all tempFloatmem to scratchBufferHost
r-abishek Mar 13, 2024
f0890c2
Merge branch 'develop' of https://github.com/ROCm/rpp into ar/scratch…
r-abishek Mar 16, 2024
641f653
Docs - Bump rocm-docs-core[api_reference] from 0.36.0 to 0.37.0 in /d…
dependabot[bot] Mar 20, 2024
5568573
Link cleanup (#326)
LisaDelaney Mar 20, 2024
a6749ba
Update notes
LisaDelaney Mar 20, 2024
a255906
Docs - Bump rocm-docs-core[api_reference] from 0.37.0 to 0.37.1 in /d…
dependabot[bot] Mar 22, 2024
b92231f
Fix build errors
snehaa8 Mar 22, 2024
05af4fe
Merge branch 'sn/reduction_mean_stddev' of https://github.com/snehaa8…
snehaa8 Mar 22, 2024
1fbd624
Merge branch 'develop' of https://github.com/r-abishek/rpp into sn/re…
snehaa8 Mar 22, 2024
c007a66
Include copyright info
snehaa8 Mar 22, 2024
c4a480c
Cleanup and fixed for reduction mean HIP kernel
snehaa8 Mar 22, 2024
a3afbd4
Cleanup and fixes for reduction stddev HIP kernel
snehaa8 Mar 22, 2024
6b8b2b1
Cleanup by removing oneliner functions as inline
snehaa8 Mar 22, 2024
6a39a37
Fix build errors
snehaa8 Mar 22, 2024
d3df761
RPP Voxel Flip on HIP and HOST (#285)
r-abishek Mar 23, 2024
ebecb42
RPP Vignette Tensor on HOST and HIP (#311)
r-abishek Mar 23, 2024
682ede8
Use map to store reference outputs
snehaa8 Mar 26, 2024
eeb68c2
Merge branch 'ar/scratch_buffers_rename' of https://github.com/r-abis…
snehaa8 Mar 27, 2024
f8d2920
Modify HIP buffer to match latest changes
snehaa8 Mar 27, 2024
fc1410b
Bump rocm-docs-core[api_reference] from 0.37.1 to 0.38.0 in /docs/sph…
dependabot[bot] Mar 27, 2024
3ebd7c3
RPP Tensor Audio Support - Resample (#310)
r-abishek Apr 3, 2024
76f31df
Docs - Missing input and output images for Doxygen (#331)
r-abishek Apr 3, 2024
b83f910
Scratch buffers rename for HOST and HIP (#324)
r-abishek Apr 3, 2024
ebeb131
Update CMakeLists.txt
kiritigowda Apr 3, 2024
e19dea8
Merge branch 'develop' into sn/reduction_mean_stddev
r-abishek Apr 4, 2024
6930465
RPP BitwiseAND and BitwiseOR Tensor on HOST and HIP (#318)
r-abishek Apr 9, 2024
2ffcba9
Merge branch 'develop' into sn/reduction_mean_stddev
snehaa8 Apr 11, 2024
2b57062
Fix doxygen comments
snehaa8 Apr 11, 2024
a6d7546
Cleanup
snehaa8 Apr 11, 2024
1147bfe
Update CMakeLists.txt
kiritigowda Apr 12, 2024
e890abb
Merge branch 'develop' into sn/reduction_mean_stddev
snehaa8 Apr 16, 2024
cd350bc
Improve readability
snehaa8 Apr 16, 2024
b910dbf
Cleanup and Improve readability
snehaa8 Apr 18, 2024
5db7c2a
Use templated executors for HIP stddev reduction kernel
snehaa8 Apr 25, 2024
d0d8187
Revert changes in other reduction kernels
snehaa8 Apr 26, 2024
ea2eaff
Cleanup
snehaa8 Apr 26, 2024
3be1973
Cleanup and further optimize
snehaa8 Apr 29, 2024
6ede534
Cleanup
snehaa8 Apr 30, 2024
307d673
Use reinterpret_cast instead of c style cast
snehaa8 Apr 30, 2024
9d513c6
Fix indentation
snehaa8 Apr 30, 2024
fd6e7e9
Replace handle->batchSize with srcDescPtr->n
snehaa8 Apr 30, 2024
06e426f
Revert store4, store8, store16 placement
r-abishek Apr 30, 2024
8236cf3
Add braces for default case
r-abishek Apr 30, 2024
aa5174f
Remove additional line gap
r-abishek Apr 30, 2024
ff92af7
Clarify status enum
r-abishek Apr 30, 2024
9bc170e
Merge branch 'develop' into sn/reduction_mean_stddev
r-abishek May 1, 2024
4650b23
Update rppdefs.h
r-abishek May 1, 2024
2a56b20
Bugfix for pkd3/pln3 segfault
r-abishek May 1, 2024
e57e68c
combined kernels for flag = 2 for PKD3 and PLN3 variants
sampath1117 May 3, 2024
4e0d094
restructured PLN3 and PKD3 code with helper functions
sampath1117 May 3, 2024
da50ca7
added conditional execution of code based on flag value
sampath1117 May 3, 2024
015c9c3
reverted back to version that does not use flag
sampath1117 May 6, 2024
f43418b
removed flag parameter from HIP stddev kernel
sampath1117 May 6, 2024
30baa24
removed use of temporary variable for xDiff calculation
sampath1117 May 6, 2024
f29fbf5
minor modification in description of stddev api
sampath1117 May 6, 2024
f05c73f
removed flag parameter and dependent code on flag parameter in stddev…
sampath1117 May 6, 2024
b21d942
fixed the order of subtraction in HOST kernel
sampath1117 May 6, 2024
065193a
Update tensor_stddev.hpp
r-abishek May 6, 2024
04cff7c
align inline comments
r-abishek May 6, 2024
b9a65d0
remove additional blank lines
r-abishek May 6, 2024
a4c411e
modified stddev_hip_compute function
sampath1117 May 7, 2024
cef19bd
mofdified xDiff calculation for mean hip kernel
sampath1117 May 7, 2024
c6bbb50
removed HIP kernels for int, uint for tensor mean
sampath1117 May 7, 2024
77e14ef
Minor common-fixes for HIP (#345)
r-abishek May 7, 2024
921683c
Merge pull request #143 from snehaa8/sn/reduction_mean_stddev
r-abishek May 7, 2024
25f9ae7
Merge branch 'develop' of https://github.com/ROCm/rpp into ar/opt_ten…
r-abishek May 7, 2024
34f3f6d
Readme Updates: --usecase=rocm (#349)
kiritigowda May 8, 2024
ab52683
RPP Tensor Audio Support - Spectrogram (#312)
r-abishek May 8, 2024
ee0d6fe
Update CHANGELOG.md (#352)
r-abishek May 8, 2024
2decd32
RPP Tensor Audio Support - Slice (#325)
r-abishek May 8, 2024
30ce1d6
RPP Tensor Audio Support - MelFilterBank (#332)
r-abishek May 8, 2024
64ae74f
RPP Tensor Normalize ND on HOST and HIP (#335)
r-abishek May 9, 2024
1a3015c
SWDEV-459739 - Remove the package obsolete setting (#353)
raramakr May 9, 2024
4cb8d4b
Audio support merge commit fixes (#354)
r-abishek May 9, 2024
dc75b98
Merge branch 'develop' into ar/opt_tensor_mean_tensor_stddev
sampath1117 May 17, 2024
ec4c563
Merge pull request #273 from sampath1117/sr/tensor_mean_std_merge_cha…
r-abishek May 17, 2024
51e1539
Merge branch 'develop' of https://github.com/ROCm/rpp into ar/opt_ten…
r-abishek May 28, 2024
aab390e
Merge branch 'develop' into ar/opt_tensor_mean_tensor_stddev
kiritigowda May 29, 2024
d3a8d93
Merge branch 'develop' into ar/opt_tensor_mean_tensor_stddev
r-abishek Jun 5, 2024
637a969
Update tensor_mean.hpp
r-abishek Jun 6, 2024
a6b4438
Update tensor_stddev.hpp
r-abishek Jun 6, 2024
97e6982
Merge branch 'develop' into ar/opt_tensor_mean_tensor_stddev
r-abishek Jun 6, 2024
4804c75
Merge branch 'develop' into ar/opt_tensor_mean_tensor_stddev
r-abishek Jun 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion include/rppdefs.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ typedef enum
RPP_ERROR_INSUFFICIENT_DST_BUFFER_LENGTH = -14,
/*! \brief Invalid datatype \ingroup group_rppdefs */
RPP_ERROR_INVALID_PARAMETER_DATATYPE = -15,
/*! \brief Not enough memory \ingroup group_rppdefs */
/*! \brief Not enough memory to write outputs, as per dim-lengths and strides set in descriptor \ingroup group_rppdefs */
RPP_ERROR_NOT_ENOUGH_MEMORY = -16,
/*! \brief Out of bound source ROI \ingroup group_rppdefs */
RPP_ERROR_OUT_OF_BOUND_SRC_ROI = -17,
Expand Down
76 changes: 75 additions & 1 deletion include/rppt_tensor_statistical_operations.h
Original file line number Diff line number Diff line change
Expand Up @@ -193,10 +193,84 @@ RppStatus rppt_normalize_host(RppPtr_t srcPtr, RpptGenericDescPtr srcGenericDesc
RppStatus rppt_normalize_gpu(RppPtr_t srcPtr, RpptGenericDescPtr srcGenericDescPtr, RppPtr_t dstPtr, RpptGenericDescPtr dstGenericDescPtr, Rpp32u axisMask, Rpp32f *meanTensor, Rpp32f *stdDevTensor, Rpp8u computeMeanStddev, Rpp32f scale, Rpp32f shift, Rpp32u *roiTensor, rppHandle_t rppHandle);
#endif // GPU_SUPPORT

/*! \brief Tensor mean operation on HOST backend for a NCHW/NHWC layout tensor
* \details The tensor mean is a reduction operation that finds the channel-wise (R mean / G mean / B mean) and total mean for each image in a batch of RGB(3 channel) / greyscale(1 channel) images with an NHWC/NCHW tensor layout.<br>
* - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
* - dstPtr depth ranges - Will be same depth as srcPtr.
* \param [in] srcPtr source tensor in HOST memory
* \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
* \param [out] tensorMeanArr destination array in HOST memory
* \param [in] tensorMeanArrLength length of provided destination array (Restrictions - if srcDescPtr->c == 1 then tensorMeanArrLength = srcDescPtr->n, and if srcDescPtr->c == 3 then tensorMeanArrLength = srcDescPtr->n * 4)
* \param [in] roiTensorSrc ROI data in HOST memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y)) | (Restrictions - roiTensorSrc[i].xywhROI.roiWidth <= 3840 and roiTensorSrc[i].xywhROI.roiHeight <= 2160)
* \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
* \param [in] rppHandle RPP HOST handle created with <tt>\ref rppCreateWithBatchSize()</tt>
* \return A <tt> \ref RppStatus</tt> enumeration.
* \retval RPP_SUCCESS Successful completion.
* \retval RPP_ERROR* Unsuccessful completion.
*/
RppStatus rppt_tensor_mean_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t tensorMeanArr, Rpp32u tensorMeanArrLength, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);

#ifdef GPU_SUPPORT
/*! \brief Tensor mean operation on HIP backend for a NCHW/NHWC layout tensor
* \details The tensor mean is a reduction operation that finds the channel-wise (R mean / G mean / B mean) and total mean for each image in a batch of RGB(3 channel) / greyscale(1 channel) images with an NHWC/NCHW tensor layout.<br>
* - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
* - dstPtr depth ranges - Will be same depth as srcPtr.
* \param [in] srcPtr source tensor in HIP memory
* \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
* \param [out] tensorMeanArr destination array in HIP memory
* \param [in] tensorMeanArrLength length of provided destination array (Restrictions - if srcDescPtr->c == 1 then tensorMeanArrLength = srcDescPtr->n, and if srcDescPtr->c == 3 then tensorMeanArrLength = srcDescPtr->n * 4)
* \param [in] roiTensorSrc ROI data in HIP memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y)) | (Restrictions - roiTensorSrc[i].xywhROI.roiWidth <= 3840 and roiTensorSrc[i].xywhROI.roiHeight <= 2160)
* \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
* \param [in] rppHandle RPP HIP handle created with <tt>\ref rppCreateWithStreamAndBatchSize()</tt>
* \return A <tt> \ref RppStatus</tt> enumeration.
* \retval RPP_SUCCESS Successful completion.
* \retval RPP_ERROR* Unsuccessful completion.
*/
RppStatus rppt_tensor_mean_gpu(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t tensorMeanArr, Rpp32u tensorMeanArrLength, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);
#endif // GPU_SUPPORT

/*! \brief Tensor stddev operation on HOST backend for a NCHW/NHWC layout tensor
* \details The tensor stddev is a reduction operation that finds the channel-wise (R stddev / G stddev / B stddev) and total standard deviation for each image with respect to meanTensor passed.<br>
* - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
* - dstPtr depth ranges - Will be same depth as srcPtr.
* \param [in] srcPtr source tensor in HOST memory
* \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
* \param [out] tensorStddevArr destination array in HOST memory
* \param [in] tensorStddevArrLength length of provided destination array (Restrictions - if srcDescPtr->c == 1 then tensorStddevArrLength = srcDescPtr->n, and if srcDescPtr->c == 3 then tensorStddevArrLength = srcDescPtr->n * 4)
* \param [in] meanTensor mean values for stddev calculation (1D tensor of size batchSize * 4 in format (MeanR, MeanG, MeanB, MeanImage) for each image in batch)
* \param [in] roiTensorSrc ROI data in HOST memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y)) | (Restrictions - roiTensorSrc[i].xywhROI.roiWidth <= 3840 and roiTensorSrc[i].xywhROI.roiHeight <= 2160)
* \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
* \param [in] rppHandle RPP HOST handle created with <tt>\ref rppCreateWithBatchSize()</tt>
* \return A <tt> \ref RppStatus</tt> enumeration.
* \retval RPP_SUCCESS Successful completion.
* \retval RPP_ERROR* Unsuccessful completion.
*/
RppStatus rppt_tensor_stddev_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t tensorStddevArr, Rpp32u tensorStddevArrLength, Rpp32f *meanTensor, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);

#ifdef GPU_SUPPORT
/*! \brief Tensor stddev operation on HIP backend for a NCHW/NHWC layout tensor
* \details The tensor stddev is a reduction operation that finds the channel-wise (R stddev / G stddev / B stddev) and total standard deviation for each image with respect to meanTensor passed.<br>
* - srcPtr depth ranges - Rpp8u (0 to 255), Rpp16f (0 to 1), Rpp32f (0 to 1), Rpp8s (-128 to 127).
* - dstPtr depth ranges - Will be same depth as srcPtr.
* \param [in] srcPtr source tensor in HIP memory
* \param [in] srcDescPtr source tensor descriptor (Restrictions - numDims = 4, offsetInBytes >= 0, dataType = U8/F16/F32/I8, layout = NCHW/NHWC, c = 1/3)
* \param [out] tensorStddevArr destination array in HIP memory
* \param [in] tensorStddevArrLength length of provided destination array (Restrictions - if srcDescPtr->c == 1 then tensorStddevArrLength = srcDescPtr->n, and if srcDescPtr->c == 3 then tensorStddevArrLength = srcDescPtr->n * 4)
* \param [in] meanTensor mean values for stddev calculation (1D tensor of size batchSize * 4 in format (MeanR, MeanG, MeanB, MeanImage) for each image in batch)
* \param [in] roiTensorSrc ROI data in HIP memory, for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y)) | (Restrictions - roiTensorSrc[i].xywhROI.roiWidth <= 3840 and roiTensorSrc[i].xywhROI.roiHeight <= 2160)
* \param [in] roiType ROI type used (RpptRoiType::XYWH or RpptRoiType::LTRB)
* \param [in] rppHandle RPP HIP handle created with <tt>\ref rppCreateWithStreamAndBatchSize()</tt>
* \return A <tt> \ref RppStatus</tt> enumeration.
* \retval RPP_SUCCESS Successful completion.
* \retval RPP_ERROR* Unsuccessful completion.
*/
RppStatus rppt_tensor_stddev_gpu(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t tensorStddevArr, Rpp32u tensorStddevArrLength, Rpp32f *meanTensor, RpptROIPtr roiTensorPtrSrc, RpptRoiType roiType, rppHandle_t rppHandle);
#endif // GPU_SUPPORT

/*! @}
*/

#ifdef __cplusplus
}
#endif
#endif // RPPT_TENSOR_STATISTICAL_OPERATIONS_H
#endif // RPPT_TENSOR_STATISTICAL_OPERATIONS_H
40 changes: 40 additions & 0 deletions src/include/cpu/rpp_cpu_common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6120,6 +6120,46 @@ inline void compute_sum_24_host(__m256d *p, __m256d *pSumR, __m256d *pSumG, __m2
pSumB[0] = _mm256_add_pd(_mm256_add_pd(p[4], p[5]), pSumB[0]); //add 8B values and bring it down to 4
}

inline void compute_variance_8_host(__m256d *p1, __m256d *pMean, __m256d *pVar)
{
__m256d pSub = _mm256_sub_pd(p1[0], pMean[0]);
pVar[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVar[0]);
pSub = _mm256_sub_pd(p1[1], pMean[0]);
pVar[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVar[0]);
}

inline void compute_variance_channel_pln3_24_host(__m256d *p1, __m256d *pMeanR, __m256d *pMeanG, __m256d *pMeanB, __m256d *pVarR, __m256d *pVarG, __m256d *pVarB)
{
__m256d pSub = _mm256_sub_pd(p1[0], pMeanR[0]);
pVarR[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarR[0]);
pSub = _mm256_sub_pd(p1[1], pMeanR[0]);
pVarR[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarR[0]);
pSub = _mm256_sub_pd(p1[2], pMeanG[0]);
pVarG[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarG[0]);
pSub = _mm256_sub_pd(p1[3], pMeanG[0]);
pVarG[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarG[0]);
pSub = _mm256_sub_pd(p1[4], pMeanB[0]);
pVarB[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarB[0]);
pSub = _mm256_sub_pd(p1[5], pMeanB[0]);
pVarB[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarB[0]);
}

inline void compute_variance_image_pln3_24_host(__m256d *p1, __m256d *pMean, __m256d *pVarR, __m256d *pVarG, __m256d *pVarB)
{
__m256d pSub = _mm256_sub_pd(p1[0], pMean[0]);
pVarR[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarR[0]);
pSub = _mm256_sub_pd(p1[1], pMean[0]);
pVarR[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarR[0]);
pSub = _mm256_sub_pd(p1[2], pMean[0]);
pVarG[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarG[0]);
pSub = _mm256_sub_pd(pMean[0], p1[3]);
pVarG[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarG[0]);
pSub = _mm256_sub_pd(p1[4], pMean[0]);
pVarB[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarB[0]);
pSub = _mm256_sub_pd(p1[5], pMean[0]);
pVarB[0] = _mm256_add_pd(_mm256_mul_pd(pSub, pSub), pVarB[0]);
}

inline void compute_vignette_48_host(__m256 *p, __m256 &pMultiplier, __m256 &pILocComponent, __m256 &pJLocComponent)
{
__m256 pGaussianValue;
Expand Down
74 changes: 74 additions & 0 deletions src/include/cpu/rpp_cpu_simd.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1280,6 +1280,35 @@ inline void rpp_store48_f32pln3_to_u8pkd3_avx(Rpp8u *dstPtr, __m256 *p)
_mm_storeu_si128((__m128i *)(dstPtr + 36), px[3]); /* store [R13|G13|B13|R14|G14|B14|R15|G15|B15|R16|G16|B16|00|00|00|00] */
}

inline void rpp_load24_u8pln3_to_f64pln3_avx(Rpp8u *srcPtrR, Rpp8u *srcPtrG, Rpp8u *srcPtrB, __m256d *p)
{
__m128i px[3];

px[0] = _mm_loadu_si128((__m128i *)srcPtrR); /* load [R00|R01|R02|R03|R04|R05|R06|R07|R08|R09|R10|R11|R12|R13|R14|R15|R16] */
px[1] = _mm_loadu_si128((__m128i *)srcPtrG); /* load [G00|G01|G02|G03|G04|G05|G06|G07|G08|G09|G10|G11|G12|G13|G14|G15|G16] */
px[2] = _mm_loadu_si128((__m128i *)srcPtrB); /* load [B00|B01|B02|B03|B04|B05|B06|B07|B08|B09|B10|B11|B12|B13|B14|B15|B16] */
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMask00To03)); /* Contains R00-03 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMask04To07)); /* Contains R04-07 */
p[2] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMask00To03)); /* Contains G00-03 */
p[3] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMask04To07)); /* Contains G04-07 */
p[4] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[2], xmm_pxMask00To03)); /* Contains B00-03 */
p[5] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[2], xmm_pxMask04To07)); /* Contains B04-07 */
}

inline void rpp_load24_u8pkd3_to_f64pln3_avx(Rpp8u *srcPtr, __m256d *p)
{
__m128i px[2];

px[0] = _mm_loadu_si128((__m128i *)srcPtr); /* load [R00|G00|B00|R01|G01|B01|R02|G02|B02|R03|G03|B03|R04|G04|B04|R05] - Need RGB 00-03 */
px[1] = _mm_loadu_si128((__m128i *)(srcPtr + 12)); /* load [R04|G04|B04|R05|G05|B05|R06|G06|B06|R07|G07|B07|R08|G08|B08|R09] - Need RGB 04-07 */
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskR)); /* Contains R00-03 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskR)); /* Contains R04-07 */
p[2] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskG)); /* Contains G00-03 */
p[3] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskG)); /* Contains G04-07 */
p[4] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskB)); /* Contains B00-03 */
p[5] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskB)); /* Contains B04-07 */
}

inline void rpp_load16_u8_to_f32_avx(Rpp8u *srcPtr, __m256 *p)
{
__m128i px;
Expand Down Expand Up @@ -1315,6 +1344,22 @@ inline void rpp_store16_f32_to_u8_avx(Rpp8u *dstPtr, __m256 *p)
_mm_storeu_si128((__m128i *)dstPtr, px[0]);
}

inline void rpp_load8_u8_to_f64_avx(Rpp8u *srcPtr, __m256d *p)
{
__m128i px;
px = _mm_loadu_si128((__m128i *)srcPtr);
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px, xmm_pxMask00To03)); /* Contains pixels 01-04 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px, xmm_pxMask04To07)); /* Contains pixels 05-08 */
}

inline void rpp_load8_i8_to_f64_avx(Rpp8s *srcPtr, __m256d *p)
{
__m128i px;
px = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)srcPtr));
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px, xmm_pxMask00To03)); /* Contains pixels 01-04 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px, xmm_pxMask04To07)); /* Contains pixels 05-08 */
}

inline void rpp_load16_u8_to_u32_avx(Rpp8u *srcPtr, __m256i *p)
{
__m128i px;
Expand Down Expand Up @@ -1688,6 +1733,35 @@ inline void rpp_store48_f32pln3_to_i8pkd3_avx(Rpp8s *dstPtr, __m256 *p)
_mm_storeu_si128((__m128i *)(dstPtr + 36), px[3]); /* store [R13|G13|B13|R14|G14|B14|R15|G15|B15|R16|G16|B16|00|00|00|00] */
}

inline void rpp_load24_i8pln3_to_f64pln3_avx(Rpp8s *srcPtrR, Rpp8s *srcPtrG, Rpp8s *srcPtrB, __m256d *p)
{
__m128i px[3];

px[0] = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)srcPtrR)); /* add I8 conversion param to load [R01|R02|R03|R04|R05|R06|R07|R08|R09|R10|R11|R12|R13|R14|R15|R16] */
px[1] = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)srcPtrG)); /* add I8 conversion param to load [G01|G02|G03|G04|G05|G06|G07|G08|G09|G10|G11|G12|G13|G14|G15|G16] */
px[2] = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)srcPtrB)); /* add I8 conversion param to load [B01|B02|B03|B04|B05|B06|B07|B08|B09|B10|B11|B12|B13|B14|B15|B16] */
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMask00To03)); /* Contains R01-04 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMask04To07)); /* Contains R05-08 */
p[2] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMask00To03)); /* Contains G01-04 */
p[3] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMask04To07)); /* Contains G05-08 */
p[4] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[2], xmm_pxMask00To03)); /* Contains B01-04 */
p[5] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[2], xmm_pxMask04To07)); /* Contains B05-08 */
}

inline void rpp_load24_i8pkd3_to_f64pln3_avx(Rpp8s *srcPtr, __m256d *p)
{
__m128i px[2];

px[0] = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)srcPtr)); /* add I8 conversion param to load [R01|G01|B01|R02|G02|B02|R03|G03|B03|R04|G04|B04|R05|G05|B05|R06] - Need RGB 01-04 */
px[1] = _mm_add_epi8(xmm_pxConvertI8, _mm_loadu_si128((__m128i *)(srcPtr + 12))); /* add I8 conversion param to load [R05|G05|B05|R06|G06|B06|R07|G07|B07|R08|G08|B08|R09|G09|B09|R10] - Need RGB 05-08 */
p[0] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskR)); /* Contains R01-04 */
p[1] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskR)); /* Contains R05-08 */
p[2] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskG)); /* Contains G01-04 */
p[3] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskG)); /* Contains G05-08 */
p[4] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[0], xmm_pxMaskB)); /* Contains B01-04 */
p[5] = _mm256_cvtepi32_pd(_mm_shuffle_epi8(px[1], xmm_pxMaskB)); /* Contains B05-08 */
}

inline void rpp_load16_i8_to_f32_avx(Rpp8s *srcPtr, __m256 *p)
{
__m128i px;
Expand Down
Loading