-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize CPU resampling #2540
Vectorize CPU resampling #2540
Conversation
CI MESSAGE: [1885499]: BUILD STARTED |
CI MESSAGE: [1885499]: BUILD FAILED |
CI MESSAGE: [1889914]: BUILD STARTED |
CI MESSAGE: [1889914]: BUILD FAILED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
d0c76e9
to
1786c3d
Compare
CI MESSAGE: [1892417]: BUILD STARTED |
CI MESSAGE: [1892452]: BUILD STARTED |
CI MESSAGE: [1892417]: BUILD FAILED |
dali/kernels/common/simd.h
Outdated
|
||
template <typename Out> | ||
inline std::enable_if_t<std::is_integral<Out>::value> | ||
store(Out *out, float4x<sizeof(float)/sizeof(Out)> f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert vectors of float to one vector of Out and store.
dali/kernels/common/simd.h
Outdated
__m128i lo16 = _mm_unpacklo_epi8(in, zero); | ||
__m128i hi16 = _mm_unpackhi_epi8(in, zero); | ||
__m128i i32_0 = _mm_unpacklo_epi8(lo16, zero); | ||
__m128i i32_1 = _mm_unpackhi_epi8(lo16, zero); | ||
__m128i i32_2 = _mm_unpacklo_epi8(hi16, zero); | ||
__m128i i32_3 = _mm_unpackhi_epi8(hi16, zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interleave with zeros until 32-bit wide
CI MESSAGE: [1892452]: BUILD PASSED |
dali/kernels/common/simd.h
Outdated
} | ||
|
||
inline void store(uint16_t *u16, i128x2 iv) { | ||
__m128i sh = _mm_set_epi32(0, 0, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no instruction that can pack 32-bit values to 16-bit without signed saturation - neither is there a byte shuffle (well, there is: in sse 4.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this as a comment to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this function just cuts the MSB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-element, it does this:
a = (unsigned)x >> 1
b = x & 1
c = convert_saturate<int16>(a)
d = convert_saturate<int16>(b)
out = c << 1 | d
dali/kernels/common/simd.h
Outdated
return _mm_cvtps_epi32(f); // round | ||
} | ||
|
||
inline void store(int8_t *i8, i128x4 iv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline void store(int8_t *i8, i128x4 iv) { | |
inline void store_converted(int8_t *i8, i128x4 iv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that. It should actually be store_i32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments to the first file.
It would be nice to get those comments you put in this PR as docstrings.
dali/kernels/common/simd.h
Outdated
__m128i hi0 = _mm_srl_epi32(iv.v[0], sh); | ||
__m128i hi1 = _mm_srl_epi32(iv.v[1], sh); | ||
__m128i lo0 = _mm_and_si128(iv.v[0], one); | ||
__m128i lo1 = _mm_and_si128(iv.v[0], one); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__m128i lo1 = _mm_and_si128(iv.v[0], one); | |
__m128i lo1 = _mm_and_si128(iv.v[1], one); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. apparently we don't test this code path.
dali/kernels/common/simd.h
Outdated
} | ||
|
||
inline void store(uint16_t *u16, i128x2 iv) { | ||
__m128i sh = _mm_set_epi32(0, 0, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that _mm_insert_epi16(_mm_setzero_si128(), 1, 0)
could help here? Instead of setting it lane by lane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_set_epi32 is not an instruction - compiler is at liberty to implement it as setzero/insert or however else it sees fit.
dali/kernels/common/simd.h
Outdated
__m128i lo1 = _mm_and_si128(iv.v[0], one); | ||
__m128i lo = _mm_packs_epi32(lo0, lo1); | ||
__m128i hi = _mm_packs_epi32(hi0, hi1); | ||
__m128i out = _mm_or_si128(_mm_sll_epi16(hi, sh), lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, does it work in general? If you have max negative number, you will get the LSB=0, and shift to a positive number with 31st bit lit up. This is clearly bigger than 16 bit numbers, so it will be saturated into max positive 16bit number. You shift right by one, place the LSB=0, and end up saturating the max negative number into max positive -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't advertise that it can handle out-of-range arguments. The problem is that there's no easy way (in SSE2 at least) to just narrow 32 bits to 16 bits without signed saturation. This will work for all 16-bit values, but no, it won't work for values outside 0..0xffff range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this function also doesn't advertise that it doesn't work for those ranges. I assumed that if you used instructions with saturation (I know there is no much choice there), you meant saturation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten it so it properly clamps.
dali/kernels/common/simd.h
Outdated
__m128i i32_0 = _mm_unpacklo_epi8(lo16, zero); | ||
__m128i i32_1 = _mm_unpackhi_epi8(lo16, zero); | ||
__m128i i32_2 = _mm_unpacklo_epi8(hi16, zero); | ||
__m128i i32_3 = _mm_unpackhi_epi8(hi16, zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be _epi16 suffix here? Or it doesn't matter for the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it should be, but it doesn't matter when we're inserting zeros.
} | ||
|
||
template <typename In> | ||
DALI_FORCEINLINE static multivec load(const In *in) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have to specify the correct num_vecs
based on the input type?
Can't you compute it based on the input type instead? It seems error prone to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It will load as many values from input as is necessary to fill the multivec.
I can add
static_assert(load_vecs > 0, "Too few lanes to use vector load for this input type");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be
static_assert(num_vecs % load_vecs == 0);
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the implementation of multivec is tied to its usage that assumes the correct storage size used for loading. And this is supposed to be generic header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
|
||
if (static_channels < 0) { | ||
// we don't know how many channels we have at compile time - inner loop over filter | ||
for (int c = 0; c < channels; c++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What difference it would make if we have a variants for predefined number of channels - 1 and 3 for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it goes for the branch below (static_channels) and then channels
is effectively a constant expression. It's quite optimized. We could use a specialized variant for 1 channel, 2, and 4 channels - 3 not so much, interleaving and deinterleaving triples on X86 is a pain.
We could use it on ARM NEON, though, with its beautiful interleaved load and store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when the number of channels is not static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's for some weird cases not covered by static switch for the number of channels, like 7 or 16.
for (int l = 0; l < kNumLanes; l++) | ||
tmp_coeffs[l] = coeffs[(x + l) * support + k]; // interleave per-column coefficients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea, do you think that interleaving that ahead of this could help a bit?
It than could be done once for all the rows, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for another perf. complaint ;)
} | ||
|
||
template <typename In> | ||
DALI_FORCEINLINE static multivec load(const In *in) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be
static_assert(num_vecs % load_vecs == 0);
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resampling looks fine, I would gladly see docs in the simd.h if you want to treat it as generic utility.
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
ee69c71
to
ec6ad53
Compare
… (clamping done in float). Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
!build |
CI MESSAGE: [1893906]: BUILD STARTED |
CI MESSAGE: [1893906]: BUILD FAILED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [1894120]: BUILD STARTED |
CI MESSAGE: [1894120]: BUILD FAILED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
CI MESSAGE: [1899460]: BUILD STARTED |
} | ||
|
||
/** | ||
* @brief Convert 2 vectors of float and convert to uint16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Convert 2 vectors of float and convert to uint16 | |
* @brief Convert 2 vectors of float to uint16 and store them |
} | ||
|
||
template <typename In> | ||
DALI_FORCEINLINE static multivec load(const In *in) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
CI MESSAGE: [1899460]: BUILD PASSED |
Signed-off-by: Michał Zientkiewicz mzient@gmail.com
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
JIRA TASK: DALI-1776