-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiPaste operator #2583
MultiPaste operator #2583
Conversation
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
It would be useful to split this into two PRs - one per added operator. |
…d few TODO items Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
dali/kernels/imgproc/paste/paste.h
Outdated
#define X_AXIS 0 | ||
#define Y_AXIS 1 | ||
#define C_AXIS 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong (reversed). Also, #define
? Really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed X and Y order and changed #define to const int
dali/kernels/imgproc/paste/paste.h
Outdated
using Image = InTensorCPU<InputType, 3>; | ||
using OutImage = OutTensorCPU<OutputType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why limit ourselves to 2D?
Why does the input have fixed dimensionality, but not the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limited to 2D for simplicity's sake. Missing dimensionality in Out was a typo, fixed now.
dali/kernels/imgproc/paste/paste.h
Outdated
void copyRoi(const OutTensorCPU<OutputType> &out, const Image &in, int inXAnchor, int inYAnchor, | ||
int inXShape, int inYShape, int outXAnchor, int outYAnchor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void copyRoi(const OutTensorCPU<OutputType> &out, const Image &in, int inXAnchor, int inYAnchor, | |
int inXShape, int inYShape, int outXAnchor, int outYAnchor) { | |
void CopyROI(const OutImage &out. int outXAnchor, int outYAnchor, | |
const Image &in, int inXAnchor, int inYAnchor, int inXShape, int inYShape) { |
I think that output anchors should go next to output and input anchors/shapes next to input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
.InputDox(1, "in_ids", "1D TesorList of shape [K] and type int", | ||
R"code(Indexes from what inputs to paste data in each iteration.)code") | ||
.InputDox(2, "in_anchors", "2D TesorList of shape [K, 2] and type int", | ||
R"code(Absolute values of LU corner of the selection for each iteration.)code") | ||
.InputDox(3, "in_shapes", "2D TesorList of shape [K, 2] and type int", | ||
R"code(Absolute values of size of the selection for each iteration.)code") | ||
.InputDox(4, "out_anchors", "2D TesorList of shape [K, 2] and type int", | ||
R"code(Absolute values of LU corner of the paste for each iteration.)code") | ||
.InputDox(5, "out_ids", "1D TesorList of shape [K] and type int", | ||
R"code(Indexes to what outputs to paste data in each iteration. | ||
If ommitted, i-th tensor pastes to i-th output. )code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these should be tensor arguments, not regular inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in the new commit, tests were modified accordingly and pass.
.AddArg("output_width", | ||
R"code(Output width.)code", DALI_INT32, true) | ||
|
||
.AddArg("output_height", | ||
R"code(Output height.)code", DALI_INT32, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use one "size" argument instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in the new commit
.AddOptionalArg("input_out_ids", | ||
R"code(If true, the operator takes the last, out_ids input.)code", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary when using named arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as suggested
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
.InputDox(4, "out_anchors", "2D TesorList of shape [K, 2] and type int", | ||
R"code(Absolute values of LU corner of the paste for each iteration.)code") | ||
.InputDox(5, "out_ids", "1D TesorList of shape [K] and type int", | ||
R"code(Indexes to what outputs to paste data in each iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of mapping is both redundant and dangerous.
Redundancy stems from the fact that you can just permute the input to get the desired result.
Danger comes in two flavors:
- race condition: two samples write to the same index
- missing sample: some output is never specified
It also complicates batch size reduction - and when we finally allow the batch size to change across operators in the pipeline, just specifying as many input parameter sets as there are desired output samples is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very much possible to have a specific use case, where one would like to perform no paste to one of the outputs - it should be left blank in such case, as no pastes target it. As for race condition, there are 3 worker strategies:
- if no_intersections is assumed, a separate work item is launched for each paste
- if no_intersections is not assumed, but out_ids is not given, one work item runs all the pastes for a given output
- if no_intersections is not assumed and out_ids is given, there is no easy way to parallelize, so one worker runs all the pastes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[EDIT After reading last comments] Ok now I understand. So the shapes of the tensors in TensorList need not be uniform, so I can put any number (for example 0) pastes for each index separately, and that's how many pastes will run to the corresponding output.
Got rid of this argument and fixed tests accordingly
{ | ||
using Kernel = TheKernel<OutputType, InputType>; | ||
if (no_intersections_) { | ||
for (int i = 0; i < batch_size; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should iterate over output batch, not input - see my comments about out_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use named argument inputs instead of inputs for indices and parameters
- remove out_ids unless absolutely necessary
- iterate over output batch size
- minor stuff in the comments
I think that current assumption of having fixed K inputs is a strong one. In gerenal, each output image can be composed of an arbitrary number of source images. Invarians:
|
How can I fix DCO for a commit that was made by auto-applying suggested changes (I did not know we should not do it)? |
Try interactive rebase, edit commit message and force push. |
I have noticed one issue while testing. Outputs are not zero-ed when declared. Do I have to run another kernel to zero out the output before running pastes (current paste only writes where the paste should be), or is there a way to signify (in SetupImpl in OutputDesc maybe?) that allocated memory should be zeroed before we get it? For now python tests just skip checking pixels where no paste should output. |
Zeroing memory is just running a code that sets 0 there (cudaMemset). As GPU is about the perf it is up to the user to initialize it if needed. In this case I would test what is faster - zero the whole memory ahead, or paste and then zero the remaining memory (I guess the second one). |
… output_size. Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
out_size=None, | ||
even_paste_count=False, | ||
k=4, | ||
dtype=types.DALIDataType.UINT8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype=types.DALIDataType.UINT8, | |
dtype=types.UINT8, |
dali/kernels/imgproc/paste/paste.h
Outdated
using Coords = InTensorCPU<const int, 1>; | ||
|
||
/** | ||
* Pastes regions of inputs onto the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Pastes regions of inputs onto the output. | |
* @brief Pastes regions of inputs onto the output. |
Nitpick (not 100% sure is needed)
DALI_REGISTER_OPERATOR(MultiPaste, MultiPasteCpu, CPU) | ||
|
||
bool MultiPasteCpu::SetupImpl(std::vector<OutputDesc> &output_desc, | ||
const workspace_t<CPUBackend> &ws) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Fix indentation here
{ | ||
using Kernel = kernels::PasteCPU<OutputType, InputType>; | ||
auto in_view = view<const InputType, 3>(images); | ||
auto out_view = view<OutputType, 3>(output); | ||
for (int i = 0; i < batch_size; i++) { | ||
auto paste_count = in_idx_[i].shape[0]; | ||
memset(out_view[i].data, 0, | ||
out_view[i].shape[0] * out_view[i].shape[1] * out_view[i].shape[2]); | ||
|
||
if (no_intersections_[i]) { | ||
for (int iter = 0; iter < paste_count; iter++) { | ||
int from_sample = in_idx_[i].data[iter]; | ||
int to_sample = i; | ||
|
||
tp.AddWork( | ||
[&, i, iter, from_sample, to_sample, in_view, out_view](int thread_id) { | ||
kernels::KernelContext ctx; | ||
auto tvin = in_view[from_sample]; | ||
auto tvout = out_view[to_sample]; | ||
|
||
auto in_anchor_view = GetInAnchors(i, iter); | ||
auto in_shape_view = GetShape(i, iter, Coords( | ||
raw_input_size_mem_.data() + 2 * from_sample, dali::TensorShape<>(2))); | ||
auto out_anchor_view = GetOutAnchors(i, iter); | ||
kernel_manager_.Run<Kernel>(thread_id, to_sample, ctx, tvout, tvin, | ||
in_anchor_view, in_shape_view, out_anchor_view); | ||
}, | ||
out_shape.tensor_size(to_sample)); | ||
} | ||
} else { | ||
tp.AddWork( | ||
[&, i, paste_count, in_view, out_view](int thread_id) { | ||
for (int iter = 0; iter < paste_count; iter++) { | ||
int from_sample = in_idx_[i].data[iter]; | ||
int to_sample = i; | ||
|
||
kernels::KernelContext ctx; | ||
auto tvin = in_view[from_sample]; | ||
auto tvout = out_view[to_sample]; | ||
|
||
auto in_anchor_view = GetInAnchors(i, iter); | ||
auto in_shape_view = GetShape(i, iter, Coords( | ||
raw_input_size_mem_.data() + 2 * from_sample, dali::TensorShape<>(2))); | ||
auto out_anchor_view = GetOutAnchors(i, iter); | ||
kernel_manager_.Run<Kernel>(thread_id, to_sample, ctx, tvout, tvin, | ||
in_anchor_view, in_shape_view, out_anchor_view); | ||
} | ||
}, | ||
paste_count * out_shape.tensor_size(0)); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to suggest the same :)
DALI_ENFORCE(in_anchors_[i].shape[0] == n_paste, | ||
"in_anchors must be same length as in_idx"); | ||
DALI_ENFORCE(in_anchors_[i].shape[1] == spatial_ndim, | ||
"in_anchors must have number of coordinates equal to that of input images - 1 (channel)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, I think it'd be better for the user to have something like:
"in_anchors must have number of coordinates equal to that of input images - 1 (channel)"); | |
make_string("Unexpected number of dimensions for ``in_anchors``. Expected ", spatial_dim, ", got ", in_anchors_[i].shape[1])); |
|
||
output_size_.Acquire(spec, ws, curr_batch_size, true); | ||
in_idx_.Acquire(spec, ws, curr_batch_size, false); | ||
if (out_anchors_.IsDefined()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a suggestion so that you don't need to do manual validation below:
TensorListShape<2> expected_sh(curr_batch_size);
for (int i = 0; I < curr_batch_size; i++) {
expected_sh.set_tensor_shape(i, TensorShape<2>{in_idx_[i].shape[0], static_ndim});
}
if (out_anchors_.IsDefined()) {
out_anchors_.Acquire(spec, ws, curr_batch_size, expected_sh);
}
if (in_anchors_.IsDefined()) {
in_anchors_.Acquire(spec, ws, curr_batch_size, expected_sh);
}
if (in_shapes_.IsDefined()) {
in_shapes_.Acquire(spec, ws, curr_batch_size, expected_sh);
}
If you do this, the shape is already enforced to be expected_sh
so you can remove the validation done below (lines 120-136).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this only works when enforcing uniform shapes, the last argument of this variant of Acquire is a const TensorShape, not a TensorListShape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, my mistake
in_idx_l, in_anchors_l, shapes_l, out_anchors_l = prepare_cuts( | ||
k, batch_size, in_size, out_size, even_paste_count, | ||
no_intersections, full_input, in_anchor_top_left, out_anchor_top_left) | ||
in_idx = fn.external_source(numpyize(in_idx_l)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why the function is called numpyize. The input is already a numpy array. In my opinion lambda: in_idx_l
would read better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was left over from the time when it would do much more logic and the input was not in numpy array. Applying this
import math | ||
import os | ||
import cv2 | ||
#from test_utils import get_dali_extra_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before we merge we can revert to use the one from test_utils.
|
||
|
||
def test_operator_multipaste(): | ||
tests = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you at least add a comment here mentioning the order of those tests arguments. Right now I have to scroll up to check what those numbers and bools mean.
…emoved intersection checking bug and memory alloc bug. Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
template<typename InputType, typename OutputType> | ||
void MultiPasteCPU::RunImplExplicitlyTyped(workspace_t<CPUBackend> &ws) { | ||
const auto &images = ws.template InputRef<CPUBackend>(0); | ||
auto &output = ws.template OutputRef<CPUBackend>(0); | ||
|
||
output.SetLayout(images.GetLayout()); | ||
auto out_shape = output.shape(); | ||
|
||
auto& tp = ws.GetThreadPool(); | ||
|
||
auto batch_size = output.shape().num_samples(); | ||
|
||
using Kernel = kernels::PasteCPU<OutputType, InputType>; | ||
auto in_view = view<const InputType, 3>(images); | ||
auto out_view = view<OutputType, 3>(output); | ||
for (int i = 0; i < batch_size; i++) { | ||
auto paste_count = in_idx_[i].shape[0]; | ||
memset(out_view[i].data, 0, out_view[i].num_elements() * sizeof(OutputType)); | ||
|
||
if (no_intersections_[i]) { | ||
for (int iter = 0; iter < paste_count; iter++) { | ||
int from_sample = in_idx_[i].data[iter]; | ||
int to_sample = i; | ||
|
||
tp.AddWork( | ||
[&, i, iter, from_sample, to_sample, in_view, out_view](int thread_id) { | ||
kernels::KernelContext ctx; | ||
auto tvin = in_view[from_sample]; | ||
auto tvout = out_view[to_sample]; | ||
|
||
auto in_anchor_view = GetInAnchors(i, iter); | ||
auto in_shape_view = GetShape(i, iter, Coords( | ||
raw_input_size_mem_.data() + 2 * from_sample, | ||
dali::TensorShape<>(2))); | ||
auto out_anchor_view = GetOutAnchors(i, iter); | ||
kernel_manager_.Run<Kernel>( | ||
thread_id, to_sample, ctx, tvout, tvin, | ||
in_anchor_view, in_shape_view, out_anchor_view); | ||
}, | ||
out_shape.tensor_size(to_sample)); | ||
} | ||
} else { | ||
tp.AddWork( | ||
[&, i, paste_count, in_view, out_view](int thread_id) { | ||
for (int iter = 0; iter < paste_count; iter++) { | ||
int from_sample = in_idx_[i].data[iter]; | ||
int to_sample = i; | ||
|
||
kernels::KernelContext ctx; | ||
auto tvin = in_view[from_sample]; | ||
auto tvout = out_view[to_sample]; | ||
|
||
auto in_anchor_view = GetInAnchors(i, iter); | ||
auto in_shape_view = GetShape(i, iter, Coords( | ||
raw_input_size_mem_.data() + 2 * from_sample, | ||
dali::TensorShape<>(2))); | ||
auto out_anchor_view = GetOutAnchors(i, iter); | ||
kernel_manager_.Run<Kernel>( | ||
thread_id, to_sample, ctx, tvout, tvin, | ||
in_anchor_view, in_shape_view, out_anchor_view); | ||
} | ||
}, | ||
paste_count * out_shape.tensor_size(0)); | ||
} | ||
} | ||
tp.RunAll(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tab size is 4. Should be 2.
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
!build |
CI MESSAGE: [2076628]: BUILD STARTED |
in_anchor_view, in_shape_view, out_anchor_view); | ||
} | ||
}, | ||
paste_count * out_shape.tensor_size(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paste_count * out_shape.tensor_size(0)); | |
paste_count * out_shape.tensor_size(i)); |
You should use the size of the current tensor, I think
|
||
|
||
template<typename InputType, typename OutputType> | ||
void MultiPasteCPU::RunImplExplicitlyTyped(workspace_t<CPUBackend> &ws) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: The name is a bit too long. Perhaps RunImplExplicitlyTyped
?
(You can disregard this suggestion)
|
||
output_size_.Acquire(spec, ws, curr_batch_size, true); | ||
in_idx_.Acquire(spec, ws, curr_batch_size, false); | ||
if (out_anchors_.IsDefined()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, my mistake
CI MESSAGE: [2076628]: BUILD FAILED |
Failed test in CI:
|
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
!build |
CI MESSAGE: [2079881]: BUILD STARTED |
CI MESSAGE: [2079881]: BUILD FAILED |
Still:
|
Can I get exact commands used to run those tests? I used nosetests as suggested and it does not reproduce |
Please download DALI_extra (it requires git-lfs). |
Signed-off-by: Piotr Kowalewski <piotr.kowalewski.main@gmail.com>
It seems that "Python 3.6 does not support unpacking without parenthesis" and I was running it with 3.8. Tried with 3.6, reproduced the error and fixed by adding parenthesis. |
!build |
CI MESSAGE: [2083312]: BUILD STARTED |
CI MESSAGE: [2083312]: BUILD PASSED |
Why we need this PR?
What happened in this PR?
Currently, MultiPaste simply runs (new) paste kernel K * batch_size times and is for now only implemented for CPU. To support multiple pastes, tensor describing regions and indexes have one additional dimension accounting for K iterations.
Nothing was changed nor removed. New operator - Multipaste was added performing K * batch_size pastes of regions of images.
There were a few issues I am aware of, described in TODO comments, now mostly fixed.
Manually tested in Mosaic implementation ([NEW VERSION] https://pastebin.com/DPuBFEUu, not sure where to put it), automatic python tests added to test/python directory, unsure if they should be registered elsewhere
Examples were not updated. Documentation for the operator has been created in-code.
JIRA TASK: NA