Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize CPU resampling #2540

Merged
merged 5 commits into from
Dec 14, 2020
Merged

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Dec 10, 2020

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Why we need this PR?

Pick one, remove the rest

  • It fixes performance issues when using CPU resize after introducing 3D resampling

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    • Add SIMD utility header to kernels/common
    • Use multi-vector approach to utilize full width of input and output vectors when resampling vertically
    • Use multi-vector approach to utilize full with of the output when resampling horizontally
    • Tune tile size for vertical resampling
  • Affected modules and functionalities:
    • CPU resampling backend
  • Key points relevant for the review:
    • N/A
  • Validation and testing:
    • Existing tests siffuce
  • Documentation (including examples):
    • N/A

JIRA TASK: DALI-1776

@mzient mzient requested a review from a team December 10, 2020 18:21
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1885499]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1885499]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1889914]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1889914]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892417]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892452]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892417]: BUILD FAILED

@klecki klecki self-assigned this Dec 11, 2020

template <typename Out>
inline std::enable_if_t<std::is_integral<Out>::value>
store(Out *out, float4x<sizeof(float)/sizeof(Out)> f) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert vectors of float to one vector of Out and store.

Comment on lines 107 to 112
__m128i lo16 = _mm_unpacklo_epi8(in, zero);
__m128i hi16 = _mm_unpackhi_epi8(in, zero);
__m128i i32_0 = _mm_unpacklo_epi8(lo16, zero);
__m128i i32_1 = _mm_unpackhi_epi8(lo16, zero);
__m128i i32_2 = _mm_unpacklo_epi8(hi16, zero);
__m128i i32_3 = _mm_unpackhi_epi8(hi16, zero);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interleave with zeros until 32-bit wide

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892452]: BUILD PASSED

}

inline void store(uint16_t *u16, i128x2 iv) {
__m128i sh = _mm_set_epi32(0, 0, 0, 1);
Copy link
Contributor Author

@mzient mzient Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no instruction that can pack 32-bit values to 16-bit without signed saturation - neither is there a byte shuffle (well, there is: in sse 4.1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this as a comment to this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function just cuts the MSB?

Copy link
Contributor Author

@mzient mzient Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-element, it does this:

a = (unsigned)x >> 1
b = x & 1
c = convert_saturate<int16>(a)
d = convert_saturate<int16>(b)
out = c << 1 | d

return _mm_cvtps_epi32(f); // round
}

inline void store(int8_t *i8, i128x4 iv) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline void store(int8_t *i8, i128x4 iv) {
inline void store_converted(int8_t *i8, i128x4 iv) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that. It should actually be store_i32.

dali/kernels/common/simd.h Outdated Show resolved Hide resolved
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments to the first file.
It would be nice to get those comments you put in this PR as docstrings.

dali/kernels/common/simd.h Show resolved Hide resolved
__m128i hi0 = _mm_srl_epi32(iv.v[0], sh);
__m128i hi1 = _mm_srl_epi32(iv.v[1], sh);
__m128i lo0 = _mm_and_si128(iv.v[0], one);
__m128i lo1 = _mm_and_si128(iv.v[0], one);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__m128i lo1 = _mm_and_si128(iv.v[0], one);
__m128i lo1 = _mm_and_si128(iv.v[1], one);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. apparently we don't test this code path.

}

inline void store(uint16_t *u16, i128x2 iv) {
__m128i sh = _mm_set_epi32(0, 0, 0, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that _mm_insert_epi16(_mm_setzero_si128(), 1, 0) could help here? Instead of setting it lane by lane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_set_epi32 is not an instruction - compiler is at liberty to implement it as setzero/insert or however else it sees fit.

__m128i lo1 = _mm_and_si128(iv.v[0], one);
__m128i lo = _mm_packs_epi32(lo0, lo1);
__m128i hi = _mm_packs_epi32(hi0, hi1);
__m128i out = _mm_or_si128(_mm_sll_epi16(hi, sh), lo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, does it work in general? If you have max negative number, you will get the LSB=0, and shift to a positive number with 31st bit lit up. This is clearly bigger than 16 bit numbers, so it will be saturated into max positive 16bit number. You shift right by one, place the LSB=0, and end up saturating the max negative number into max positive -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't advertise that it can handle out-of-range arguments. The problem is that there's no easy way (in SSE2 at least) to just narrow 32 bits to 16 bits without signed saturation. This will work for all 16-bit values, but no, it won't work for values outside 0..0xffff range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this function also doesn't advertise that it doesn't work for those ranges. I assumed that if you used instructions with saturation (I know there is no much choice there), you meant saturation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten it so it properly clamps.

Comment on lines 89 to 92
__m128i i32_0 = _mm_unpacklo_epi8(lo16, zero);
__m128i i32_1 = _mm_unpackhi_epi8(lo16, zero);
__m128i i32_2 = _mm_unpacklo_epi8(hi16, zero);
__m128i i32_3 = _mm_unpackhi_epi8(hi16, zero);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be _epi16 suffix here? Or it doesn't matter for the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it should be, but it doesn't matter when we're inserting zeros.

}

template <typename In>
DALI_FORCEINLINE static multivec load(const In *in) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have to specify the correct num_vecs based on the input type?
Can't you compute it based on the input type instead? It seems error prone to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It will load as many values from input as is necessary to fill the multivec.
I can add

static_assert(load_vecs > 0, "Too few lanes to use vector load for this input type");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be

static_assert(num_vecs % load_vecs == 0);

or something similar.

Copy link
Contributor

@klecki klecki Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the implementation of multivec is tied to its usage that assumes the correct storage size used for loading. And this is supposed to be generic header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.


if (static_channels < 0) {
// we don't know how many channels we have at compile time - inner loop over filter
for (int c = 0; c < channels; c++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What difference it would make if we have a variants for predefined number of channels - 1 and 3 for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it goes for the branch below (static_channels) and then channels is effectively a constant expression. It's quite optimized. We could use a specialized variant for 1 channel, 2, and 4 channels - 3 not so much, interleaving and deinterleaving triples on X86 is a pain.
We could use it on ARM NEON, though, with its beautiful interleaved load and store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when the number of channels is not static?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's for some weird cases not covered by static switch for the number of channels, like 7 or 16.

Comment on lines +188 to +189
for (int l = 0; l < kNumLanes; l++)
tmp_coeffs[l] = coeffs[(x + l) * support + k]; // interleave per-column coefficients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, do you think that interleaving that ahead of this could help a bit?
It than could be done once for all the rows, wouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for another perf. complaint ;)

}

template <typename In>
DALI_FORCEINLINE static multivec load(const In *in) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if you try to load uint16 into the float4x1, it will just probably segfault. There is no check for that, IMO it should be

static_assert(num_vecs % load_vecs == 0);

or something similar.

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resampling looks fine, I would gladly see docs in the simd.h if you want to treat it as generic utility.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
… (clamping done in float).

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient
Copy link
Contributor Author

mzient commented Dec 11, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1893906]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1893906]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1894120]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1894120]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1899460]: BUILD STARTED

}

/**
* @brief Convert 2 vectors of float and convert to uint16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Convert 2 vectors of float and convert to uint16
* @brief Convert 2 vectors of float to uint16 and store them

}

template <typename In>
DALI_FORCEINLINE static multivec load(const In *in) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1899460]: BUILD PASSED

@mzient mzient merged commit 2a9927a into NVIDIA:master Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants