NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping. #1387

mzient · 2019-10-15T18:27:46Z

Signed-off-by: Michal Zientkiewicz michalz@nvidia.com

Why we need this PR?

It replaces old WarpAffine with new implementation

What happened in this PR?

Renaming
Adjusted tests
Added specialized implementation for CPU warp affine (strength reduction)
Rename output_type to output_dtype and border to fill_value to match old API

JIRA TASK: [DALI-1095]

mzient · 2019-10-15T18:35:43Z

dali/test/dali_test_single_op.h

@@ -34,6 +34,12 @@ namespace dali {

 #define SAVE_TMP_IMAGES 0

+#if SAVE_TMP_IMAGES
+namespace {
+  int tmp_img_idx = 0;


global indexing - avoid overwriting results from different template instance

Do we consider any thread safety here?

mzient · 2019-10-15T18:36:47Z

dali/kernels/imgproc/warp_cpu.h

+
+    Sampler<static_interp, InputType> sampler(in);
+
+    vec2 dsdx = mapping.transform.col(0);


Since affine mapping is linear, we can do this trick to have one vector addition instead of matrix multiplication per pixel.

I'd write that as a comment here

Will do. Also, I'll add tiling to prevent excessive accumulation of error in case of very wide images.

mzient · 2019-10-15T18:38:33Z

!build

dali-automaton · 2019-10-15T19:32:02Z

CI MESSAGE: [945947]: BUILD STARTED

dali-automaton · 2019-10-16T02:34:04Z

CI MESSAGE: [945947]: BUILD FAILED

mzient · 2019-10-16T15:21:16Z

!build

dali-automaton · 2019-10-16T15:26:48Z

CI MESSAGE: [947389]: BUILD STARTED

dali-automaton · 2019-10-16T15:49:31Z

CI MESSAGE: [947389]: BUILD FAILED

jantonguirao · 2019-10-17T07:19:28Z

dali/kernels/imgproc/warp_cpu.h

@@ -85,38 +87,56 @@ class WarpCPU {
  }

 private:
-  template <DALIInterpType static_interp>
+  template <DALIInterpType static_interp, typename Mapping_>


why not Mapping (no underscore) ?

Because when Mapping is AffineMapping2D, then we've got multiple function definition.

jantonguirao · 2019-10-17T07:20:14Z

dali/kernels/imgproc/warp_cpu.h

+
+    Sampler<static_interp, InputType> sampler(in);
+
+    vec2 dsdx = mapping.transform.col(0);


I'd write that as a comment here

jantonguirao · 2019-10-17T07:38:54Z

dali/test/python/test_pipeline.py

-                                         fill_value = 128,
-                                         interp_type = types.INTERP_LINEAR,
-                                         use_image_center = True)
+                                            matrix = [1.0, 0.8, -0.8*112, 0.0, 1.2, -0.2*112],


can you fix the indentation here?

If there's something more substantial, I'll do that, otherwise it'll go with rotate PR. Is that OK?

jantonguirao · 2019-10-17T07:39:50Z

dali/test/python/test_operator_warp.py

@@ -63,9 +63,9 @@ def warp_fixed(img):
  return warp_fn


-class NewWarpPipeline(Pipeline):
+class WarpPipeline(Pipeline):


could you run (locally is enough) a simple test that compares WarpPipeline to NewWarpPipeline (use compare_pipelines for that)

There's a test in test_pipeline.py which compares against OpenCV and it still passes after the switch - I only needed to remove the use_image_center flag, which now is not supported (it makes little sense for WarpAffine).

jantonguirao · 2019-10-17T07:40:29Z

dali/benchmark/CMakeLists.txt

@@ -25,6 +25,7 @@ if (BUILD_BENCHMARK)
    "${CMAKE_CURRENT_SOURCE_DIR}/displacement_cpu_bench.cc"
    "${CMAKE_CURRENT_SOURCE_DIR}/crop_bench.cc"
    "${CMAKE_CURRENT_SOURCE_DIR}/crop_mirror_normalize_bench.cc"
+    "${CMAKE_CURRENT_SOURCE_DIR}/warp_affine_gpu_bench.cc"


TODO: warp affine CPU benchmark

jantonguirao · 2019-10-17T07:45:04Z

dali/operators/displacement/displacement_test.cc

  const OpArg params = {"matrix", "1.0, 0.8, 0.0, 0.0, 1.2, 0.0", DALI_FLOAT_VEC};
-  this->RunTest("WarpAffine", &params, 1);
+  this->RunTest("OldWarpAffine", &params, 1, false, 0.1 /* 0.1 percent */);


After we:

verify (with compare_pipelines in python) that both WarpAffine and OldWarpAffine produce the same results (with a reasonable eps)

Fix and run the CPU benchmarks and there is no performance degradation
I'd say we remove OldWarpAffine completely

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient · 2019-10-17T15:09:39Z

!build

dali-automaton · 2019-10-17T15:15:04Z

CI MESSAGE: [949161]: BUILD STARTED

Removed OldWarpAffine. Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient · 2019-10-17T15:32:08Z

!build

dali-automaton · 2019-10-17T15:35:13Z

CI MESSAGE: [949199]: BUILD STARTED

dali-automaton · 2019-10-17T16:39:50Z

CI MESSAGE: [949199]: BUILD PASSED

jantonguirao · 2019-10-18T07:25:48Z

include/dali/core/geom/transform.h

@@ -97,6 +97,7 @@ DALI_HOST_DEV
 constexpr vec<out_n> affine(const mat<out_n, in_n + 1> &transform, const vec<in_n> &v) {
  vec<out_n> out = {};
  for (int i = 0; i < out_n; i++) {
+    // NOTE: accumulating directly in out[i] prodced noticeably slower code in GCC 7.4


Suggested change

// NOTE: accumulating directly in out[i] prodced noticeably slower code in GCC 7.4

// NOTE: accumulating directly in out[i] produced noticeably slower code in GCC 7.4

Nice catch. Thanks! Will do the same as with indent.

mzient requested review from jantonguirao, klecki, JanuszL and awolant October 15, 2019 18:27

mzient force-pushed the ReplaceWarp branch from 221a545 to 910074e Compare October 15, 2019 18:35

mzient commented Oct 15, 2019

View reviewed changes

mzient force-pushed the ReplaceWarp branch 2 times, most recently from 1fb6d43 to fc9644b Compare October 15, 2019 18:38

mzient force-pushed the ReplaceWarp branch from fc9644b to e23d71c Compare October 16, 2019 15:19

jantonguirao reviewed Oct 17, 2019

View reviewed changes

mzient added 4 commits October 17, 2019 09:52

NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping.

1e3e638

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Improved warp performance.h

615cb23

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix benchmark.

0060732

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix augmentation gallery.

8bab154

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the ReplaceWarp branch 3 times, most recently from dc82910 to 4edfca4 Compare October 17, 2019 15:08

mzient force-pushed the ReplaceWarp branch from 4edfca4 to 78b3f2f Compare October 17, 2019 15:14

mzient force-pushed the ReplaceWarp branch from 78b3f2f to a336e92 Compare October 17, 2019 15:26

CPU benchmarks restored.

96ba69f

Removed OldWarpAffine. Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the ReplaceWarp branch from a336e92 to 96ba69f Compare October 17, 2019 15:31

jantonguirao reviewed Oct 18, 2019

View reviewed changes

jantonguirao approved these changes Oct 18, 2019

View reviewed changes

JanuszL approved these changes Oct 18, 2019

View reviewed changes

mzient merged commit 1e5b845 into NVIDIA:master Oct 18, 2019

JanuszL mentioned this pull request Oct 18, 2019

How is WarpAffine supposed to be used? #1404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping. #1387

NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping. #1387

mzient commented Oct 15, 2019 •

edited

Loading

mzient Oct 15, 2019

JanuszL Oct 18, 2019

mzient Oct 15, 2019

jantonguirao Oct 17, 2019

mzient Oct 17, 2019

mzient Oct 17, 2019

mzient commented Oct 15, 2019

dali-automaton commented Oct 15, 2019

dali-automaton commented Oct 16, 2019

mzient commented Oct 16, 2019

dali-automaton commented Oct 16, 2019

dali-automaton commented Oct 16, 2019

jantonguirao Oct 17, 2019

mzient Oct 17, 2019

jantonguirao Oct 17, 2019

jantonguirao Oct 17, 2019

mzient Oct 18, 2019

jantonguirao Oct 17, 2019

mzient Oct 18, 2019

jantonguirao Oct 17, 2019

jantonguirao Oct 17, 2019

mzient commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

mzient commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

jantonguirao Oct 18, 2019

mzient Oct 18, 2019 •

edited

Loading


		Sampler<static_interp, InputType> sampler(in);

		vec2 dsdx = mapping.transform.col(0);

	// NOTE: accumulating directly in out[i] prodced noticeably slower code in GCC 7.4
	// NOTE: accumulating directly in out[i] produced noticeably slower code in GCC 7.4

NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping. #1387

NewWarpAffine -> WarpAffine; optimize CPU warp for affine mapping. #1387

Conversation

mzient commented Oct 15, 2019 • edited Loading

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented Oct 15, 2019

dali-automaton commented Oct 15, 2019

dali-automaton commented Oct 16, 2019

mzient commented Oct 16, 2019

dali-automaton commented Oct 16, 2019

dali-automaton commented Oct 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

mzient commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

dali-automaton commented Oct 17, 2019

Choose a reason for hiding this comment

mzient Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

mzient commented Oct 15, 2019 •

edited

Loading

mzient Oct 18, 2019 •

edited

Loading