Vectorize CPU resampling #2540

mzient · 2020-12-10T18:21:03Z

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Why we need this PR?

Pick one, remove the rest

It fixes performance issues when using CPU resize after introducing 3D resampling

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
- Add SIMD utility header to kernels/common
- Use multi-vector approach to utilize full width of input and output vectors when resampling vertically
- Use multi-vector approach to utilize full with of the output when resampling horizontally
- Tune tile size for vertical resampling
Affected modules and functionalities:
- CPU resampling backend
Key points relevant for the review:
- N/A
Validation and testing:
- Existing tests siffuce
Documentation (including examples):
- N/A

JIRA TASK: DALI-1776

dali-automaton · 2020-12-10T19:21:35Z

CI MESSAGE: [1885499]: BUILD STARTED

dali-automaton · 2020-12-10T20:20:35Z

CI MESSAGE: [1885499]: BUILD FAILED

dali-automaton · 2020-12-10T21:40:56Z

CI MESSAGE: [1889914]: BUILD STARTED

dali-automaton · 2020-12-10T22:19:24Z

CI MESSAGE: [1889914]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2020-12-11T08:05:54Z

CI MESSAGE: [1892417]: BUILD STARTED

dali-automaton · 2020-12-11T08:21:07Z

CI MESSAGE: [1892452]: BUILD STARTED

dali-automaton · 2020-12-11T08:37:58Z

CI MESSAGE: [1892417]: BUILD FAILED

dali/kernels/common/simd.h

mzient · 2020-12-11T09:59:12Z

dali/kernels/common/simd.h

+
+template <typename Out>
+inline std::enable_if_t<std::is_integral<Out>::value>
+store(Out *out, float4x<sizeof(float)/sizeof(Out)> f) {


Convert vectors of float to one vector of Out and store.

mzient · 2020-12-11T10:02:09Z

dali/kernels/common/simd.h

+  __m128i lo16 = _mm_unpacklo_epi8(in, zero);
+  __m128i hi16 = _mm_unpackhi_epi8(in, zero);
+  __m128i i32_0 = _mm_unpacklo_epi8(lo16, zero);
+  __m128i i32_1 = _mm_unpackhi_epi8(lo16, zero);
+  __m128i i32_2 = _mm_unpacklo_epi8(hi16, zero);
+  __m128i i32_3 = _mm_unpackhi_epi8(hi16, zero);


interleave with zeros until 32-bit wide

dali/kernels/common/simd.h

dali-automaton · 2020-12-11T10:17:55Z

CI MESSAGE: [1892452]: BUILD PASSED

mzient · 2020-12-11T10:18:35Z

dali/kernels/common/simd.h

+}
+
+inline void store(uint16_t *u16, i128x2 iv) {
+  __m128i sh = _mm_set_epi32(0, 0, 0, 1);


There's no instruction that can pack 32-bit values to 16-bit without signed saturation - neither is there a byte shuffle (well, there is: in sse 4.1)

Can you add this as a comment to this function?

So this function just cuts the MSB?

Per-element, it does this:

a = (unsigned)x >> 1 b = x & 1 c = convert_saturate<int16>(a) d = convert_saturate<int16>(b) out = c << 1 | d

JanuszL · 2020-12-11T12:51:25Z

dali/kernels/common/simd.h

+  return _mm_cvtps_epi32(f);  // round
+}
+
+inline void store(int8_t *i8, i128x4 iv) {


Suggested change

inline void store(int8_t *i8, i128x4 iv) {

inline void store_converted(int8_t *i8, i128x4 iv) {

True that. It should actually be store_i32.

dali/kernels/common/simd.h

klecki

Comments to the first file.
It would be nice to get those comments you put in this PR as docstrings.

dali/kernels/common/simd.h

klecki · 2020-12-11T10:38:11Z

dali/kernels/common/simd.h

+  __m128i hi0 = _mm_srl_epi32(iv.v[0], sh);
+  __m128i hi1 = _mm_srl_epi32(iv.v[1], sh);
+  __m128i lo0 = _mm_and_si128(iv.v[0], one);
+  __m128i lo1 = _mm_and_si128(iv.v[0], one);


Suggested change

__m128i lo1 = _mm_and_si128(iv.v[0], one);

__m128i lo1 = _mm_and_si128(iv.v[1], one);

Hmm.. apparently we don't test this code path.

klecki · 2020-12-11T10:44:55Z

dali/kernels/common/simd.h

+}
+
+inline void store(uint16_t *u16, i128x2 iv) {
+  __m128i sh = _mm_set_epi32(0, 0, 0, 1);


Do you think that _mm_insert_epi16(_mm_setzero_si128(), 1, 0) could help here? Instead of setting it lane by lane?

_mm_set_epi32 is not an instruction - compiler is at liberty to implement it as setzero/insert or however else it sees fit.

klecki · 2020-12-11T10:54:04Z

dali/kernels/common/simd.h

+  __m128i lo1 = _mm_and_si128(iv.v[0], one);
+  __m128i lo = _mm_packs_epi32(lo0, lo1);
+  __m128i hi = _mm_packs_epi32(hi0, hi1);
+  __m128i out = _mm_or_si128(_mm_sll_epi16(hi, sh), lo);


Thinking about this, does it work in general? If you have max negative number, you will get the LSB=0, and shift to a positive number with 31st bit lit up. This is clearly bigger than 16 bit numbers, so it will be saturated into max positive 16bit number. You shift right by one, place the LSB=0, and end up saturating the max negative number into max positive -1.

This function doesn't advertise that it can handle out-of-range arguments. The problem is that there's no easy way (in SSE2 at least) to just narrow 32 bits to 16 bits without signed saturation. This will work for all 16-bit values, but no, it won't work for values outside 0..0xffff range.

I mean, this function also doesn't advertise that it doesn't work for those ranges. I assumed that if you used instructions with saturation (I know there is no much choice there), you meant saturation.

I've rewritten it so it properly clamps.

klecki · 2020-12-11T10:56:37Z

dali/kernels/common/simd.h

+  __m128i i32_0 = _mm_unpacklo_epi8(lo16, zero);
+  __m128i i32_1 = _mm_unpackhi_epi8(lo16, zero);
+  __m128i i32_2 = _mm_unpacklo_epi8(hi16, zero);
+  __m128i i32_3 = _mm_unpackhi_epi8(hi16, zero);


Shouldn't it be _epi16 suffix here? Or it doesn't matter for the result.

In general it should be, but it doesn't matter when we're inserting zeros.

klecki · 2020-12-11T13:19:08Z

dali/kernels/common/simd.h

+  }
+
+  template <typename In>
+  DALI_FORCEINLINE static multivec load(const In *in) noexcept  {


Don't you have to specify the correct num_vecs based on the input type?
Can't you compute it based on the input type instead? It seems error prone to me.

No. It will load as many values from input as is necessary to fill the multivec.
I can add

static_assert(load_vecs > 0, "Too few lanes to use vector load for this input type");

I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be

static_assert(num_vecs % load_vecs == 0);

or something similar.

Now the implementation of multivec is tied to its usage that assumes the correct storage size used for loading. And this is supposed to be generic header.

dali/kernels/common/simd.h

dali/kernels/imgproc/resample/resampling_impl_cpu.h

JanuszL · 2020-12-11T14:14:51Z

dali/kernels/imgproc/resample/resampling_impl_cpu.h

+
+      if (static_channels < 0) {
+        // we don't know how many channels we have at compile time - inner loop over filter
+        for (int c = 0; c < channels; c++) {


What difference it would make if we have a variants for predefined number of channels - 1 and 3 for example?

Then it goes for the branch below (static_channels) and then channels is effectively a constant expression. It's quite optimized. We could use a specialized variant for 1 channel, 2, and 4 channels - 3 not so much, interleaving and deinterleaving triples on X86 is a pain.
We could use it on ARM NEON, though, with its beautiful interleaved load and store.

So when the number of channels is not static?

I guess it's for some weird cases not covered by static switch for the number of channels, like 7 or 16.

dali/kernels/imgproc/resample/resampling_impl_cpu.h

klecki · 2020-12-11T14:50:23Z

dali/kernels/imgproc/resample/resampling_impl_cpu.h

+          for (int l = 0; l < kNumLanes; l++)
+            tmp_coeffs[l] = coeffs[(x + l) * support + k];  // interleave per-column coefficients


Just an idea, do you think that interleaving that ahead of this could help a bit?
It than could be done once for all the rows, wouldn't it?

Let's wait for another perf. complaint ;)

klecki · 2020-12-11T15:08:21Z

dali/kernels/common/simd.h

+  }
+
+  template <typename In>
+  DALI_FORCEINLINE static multivec load(const In *in) noexcept  {


I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be

static_assert(num_vecs % load_vecs == 0);

or something similar.

klecki

The resampling looks fine, I would gladly see docs in the simd.h if you want to treat it as generic utility.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

… (clamping done in float). Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2020-12-11T16:14:43Z

!build

dali-automaton · 2020-12-11T16:20:39Z

CI MESSAGE: [1893906]: BUILD STARTED

dali-automaton · 2020-12-11T17:12:17Z

CI MESSAGE: [1893906]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2020-12-11T17:38:42Z

CI MESSAGE: [1894120]: BUILD STARTED

dali-automaton · 2020-12-11T19:21:22Z

CI MESSAGE: [1894120]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2020-12-14T07:54:40Z

CI MESSAGE: [1899460]: BUILD STARTED

klecki · 2020-12-11T16:29:16Z

dali/kernels/common/simd.h

+}
+
+/**
+ * @brief Convert 2 vectors of float and convert to uint16


Suggested change

* @brief Convert 2 vectors of float and convert to uint16

* @brief Convert 2 vectors of float to uint16 and store them

klecki · 2020-12-11T16:30:54Z

dali/kernels/common/simd.h

+  }
+
+  template <typename In>
+  DALI_FORCEINLINE static multivec load(const In *in) noexcept  {


dali-automaton · 2020-12-14T11:17:45Z

CI MESSAGE: [1899460]: BUILD PASSED

mzient requested a review from a team December 10, 2020 18:21

Vectorize CPU resampling.

1786c3d

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the VectorizeCPUResampling branch from d0c76e9 to 1786c3d Compare December 11, 2020 08:01

klecki self-assigned this Dec 11, 2020

mzient commented Dec 11, 2020

View reviewed changes

dali/kernels/common/simd.h Show resolved Hide resolved

mzient commented Dec 11, 2020

View reviewed changes

dali/kernels/common/simd.h Show resolved Hide resolved

mzient commented Dec 11, 2020

View reviewed changes

awolant assigned JanuszL Dec 11, 2020

JanuszL reviewed Dec 11, 2020

View reviewed changes

dali/kernels/common/simd.h Outdated Show resolved Hide resolved

klecki reviewed Dec 11, 2020

View reviewed changes

JanuszL reviewed Dec 11, 2020

View reviewed changes

dali/kernels/common/simd.h Show resolved Hide resolved

JanuszL reviewed Dec 11, 2020

View reviewed changes

dali/kernels/imgproc/resample/resampling_impl_cpu.h Show resolved Hide resolved

JanuszL reviewed Dec 11, 2020

View reviewed changes

dali/kernels/imgproc/resample/resampling_impl_cpu.h Show resolved Hide resolved

klecki reviewed Dec 11, 2020

View reviewed changes

Add tests for SIMD converting load and store.

ec6ad53

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the VectorizeCPUResampling branch from ee69c71 to ec6ad53 Compare December 11, 2020 16:02

Use unsafe conversion from int32 to uint16 when converting from float…

d5945cd

… (clamping done in float). Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Make clang STFU about cast from int to float.

d30b22d

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

JanuszL approved these changes Dec 11, 2020

View reviewed changes

Fix handling of input fp values outside int32 range.

d32b871

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

klecki approved these changes Dec 14, 2020

View reviewed changes

mzient merged commit 2a9927a into NVIDIA:master Dec 14, 2020

	inline void store(int8_t *i8, i128x4 iv) {
	inline void store_converted(int8_t *i8, i128x4 iv) {

	__m128i lo1 = _mm_and_si128(iv.v[0], one);
	__m128i lo1 = _mm_and_si128(iv.v[1], one);

		for (int l = 0; l < kNumLanes; l++)
		tmp_coeffs[l] = coeffs[(x + l) * support + k]; // interleave per-column coefficients

	* @brief Convert 2 vectors of float and convert to uint16
	* @brief Convert 2 vectors of float to uint16 and store them

Vectorize CPU resampling #2540

Vectorize CPU resampling #2540

Conversation

mzient commented Dec 10, 2020

Why we need this PR?

What happened in this PR?

dali-automaton commented Dec 10, 2020

dali-automaton commented Dec 10, 2020

dali-automaton commented Dec 10, 2020

dali-automaton commented Dec 10, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Dec 11, 2020

mzient Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

mzient commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 11, 2020

dali-automaton commented Dec 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Dec 14, 2020

mzient Dec 11, 2020 •

edited

Loading

mzient Dec 11, 2020 •

edited

Loading

klecki Dec 11, 2020 •

edited

Loading