Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter-to-iter variable batch size #2481

Merged
merged 2 commits into from
Jan 22, 2021
Merged

Conversation

szalpal
Copy link
Member

@szalpal szalpal commented Nov 18, 2020

TODO:

Signed-off-by: szalpal mszolucha@nvidia.com

What happened in this PR:

  1. Update C API to support i2i variable batch size
  2. Update nvJpegDecoder and its flavours to support variable BS
  3. Update Constant op for the same reason
  4. Partially split Executor into h and cc file, to make development convenient
  5. Also add variable batch size routines to Executor (batch size queues, PreRun with BS inferring from the graph)
  6. Introduce BatchSizeProvider interface and add it to ExternalSource
  7. Add a feature to CachingList (inner ExternalSource memory), so that it can traverse over the data list asynchronously w.r.t current head and tail of this list
  8. Modify constraint that stipulates constant batch size to stipulate uniform batch size within single iteration
  9. Python API adjustment
  10. Adding enormous test for i2i variable batch size (every op is tested)

Why we need this PR?

Pick one, remove the rest

  • To enable the i2i variable batch size

JIRA TASK: [Use DALI-XXXX or NA]

@szalpal
Copy link
Member Author

szalpal commented Dec 18, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1924708]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1924708]: BUILD FAILED

@szalpal szalpal marked this pull request as ready for review December 18, 2020 04:01
@szalpal szalpal changed the title Test for iter-to-iter variable batch size iter-to-iter variable batch size Dec 18, 2020
@szalpal
Copy link
Member Author

szalpal commented Dec 18, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1926219]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1926219]: BUILD FAILED

@jantonguirao jantonguirao self-assigned this Jan 7, 2021
dali/operators/ssd/random_crop.cc Show resolved Hide resolved
dali/c_api/c_api.cc Outdated Show resolved Hide resolved
dali/c_api/c_api.cc Show resolved Hide resolved
dali/python/nvidia/dali/pipeline.py Outdated Show resolved Hide resolved
dali/test/python/test_dali_variable_batch_size.py Outdated Show resolved Hide resolved
dali/test/python/test_dali_variable_batch_size.py Outdated Show resolved Hide resolved
dali/test/python/test_dali_variable_batch_size.py Outdated Show resolved Hide resolved
@szalpal
Copy link
Member Author

szalpal commented Jan 7, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1963736]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1963736]: BUILD FAILED

exec_pipelined=False)
with pipe:
data = fn.external_source(source=input_data, cycle=False, device=device)
processed = fn.python_function(data, function=resize, num_outputs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
processed = fn.python_function(data, function=resize, num_outputs=1)
processed = fn.python_function(data, function=resize)

DALI_ENFORCE(bsps[0]->NextBatchSize() == bsps[i]->NextBatchSize(),
"Batch size must be uniform across an iteration");
}
auto batch_size = bsps[0]->NextBatchSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: move the variable to line 317 and use it in the loop

@szalpal
Copy link
Member Author

szalpal commented Jan 8, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1964913]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1964913]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1965134]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1965142]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1965140]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1965143]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1965151]: BUILD FAILED

@szalpal
Copy link
Member Author

szalpal commented Jan 11, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1969460]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1969460]: BUILD PASSED

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few files still missing, I would like to check a bit more about the side-casts on my own.
My major gripe is mixing the C-API with the C++-API, not enough tests for the former one.

Also the prophet part probably doesn't happen in any scenario, but looking at the isolated CachingList it looks like a bug.


template <typename Backend>
struct is_backend {
static constexpr bool value = std::is_same<Backend, CPUBackend>::value ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nitpick, but probably 3 specializations and default for falls could be faster (but that is just guessing), I rarely measure perf for such things.


namespace dali {

class BatchSizeProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some doc what this class/interface is supposed to represent instead of usage patterns only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


template <typename WorkspacePolicy, typename QueuePolicy>
template <typename Workspace>
void Executor<WorkspacePolicy, QueuePolicy>::RunHelper(OpNode &op_node, Workspace &ws) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the RunHelper at the top of the functions that use it, but whatever.

Comment on lines 256 to 257
make_string("Expected batch size lower or equal to max batch size. Actual: ",
ws.GetInputBatchSize(i), " <= ", max_batch_size_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if printing 500 < 128 is the best idea, maybe:

Suggested change
make_string("Expected batch size lower or equal to max batch size. Actual: ",
ws.GetInputBatchSize(i), " <= ", max_batch_size_));
make_string("Expected batch size lower or equal to max batch size. Expected at most: ",
max_batch_size_, ", got: ", ws.GetInputBatchSize(i)));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


template <typename WorkspacePolicy, typename QueuePolicy>
void Executor<WorkspacePolicy, QueuePolicy>::RunCPU() {
PreRun();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to suggest that PreRun shouldn't be a member of RunCPU, but probably it's complicated by the fact that it would need to be adjusted everywhere as those RunStage functions are used directly by users (in C++).

The implementation looks a bit weird for me with those push/pops, I must analyze it a bit more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided, that PreRun belongs here after all. It is weird, but we don't have better idea ATM

#include <string>
#include <vector>
#include <memory>
#include <chrono>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chrrrono

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

"tensor with ", batch_size, " elements. Got:\n", shape));
}
return valid_shape;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. Why don't you do any verification? Where is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added removed lines back and added one more check to operator.cc::EnforceUniformBatchSize

"Attempted to step over the last element in the list. This operation is forbidden. Add "
"more elements to CachingList before calling AdvanceProphet.");
apprentice_ = prophet_;
advance(prophet_, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with prophet_++?
Also, what if our prophet (or apprentice_) is popped out of the list and put in the empty list? Wouldn't we start advancing in the set of empty nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -415,9 +494,10 @@ class ExternalSource : public Operator<Backend> {
*/
struct ExternalSourceState {
bool copied_shared_data = false;
size_t batch_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only modifications putting the current bs in, but no access to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

0-255, with shape like [640, 480, 3]) and you want to test default arguments
only, please add a record to the `ops_image_default_args` list
2. If the operator is typically processing image-like data (i.e. 3-dim, uint8,
0-255, with shape like [640, 480, 3]) and you want to specify any number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about more demanding scenarios (where maybe the shape is a bit more fuzzy?).
Also, didn't look for it yet, but do you test stuff like FCHW sequences? There might be some weirdness in sequence processing operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sequence tests

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished the rest, there appear to be some issues in ImageDecoder and I would like to see the tests for sequences and with non-uniform input batches.

Comment on lines 77 to 82
if (spec_.GetSchema().HasArgument("hw_decoder_load")) {
hw_decoder_load_ = spec.GetArgument<float>("hw_decoder_load");
try_init_hw_decoder = true;
} else {
hw_decoder_load_ = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part makes it optional without default. HasArgument is true only for arguments that were specified by the user.

}

if (hw_decoder_bs_ > 0 &&
if (try_init_hw_decoder &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the part of code that sets the hw_decoder_bs_ you are making sure that the nvjpegDecodeBatchedPreAllocate uses batch=0.
Probably needs to be adjusted to use max_batch_size * hw_decoder_load_ in the nvjpegDecodeBatchedPreAllocate call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -909,6 +894,36 @@ class nvJPEGDecoder : public Operator<MixedBackend>, CachedDecoderImpl {
RegisterDiagnostic("using_hw_decoder", &using_hw_decoder_);
}

int CalcHwDecoderBatchSize(float hw_decoder_load, int curr_batch_size) {
if (hw_decoder_load == .0f) return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (hw_decoder_load == .0f) return 0;
if (hw_decoder_load == 0.f) return 0;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if isinstance(sample_shape, tuple):
size = sample_shape
elif inspect.isgeneratorfunction(sample_shape):
size = sample_shape()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it will be called for every sample instead of once. It would allow to pass a random shape generator and have some variability in shapes.

Maybe it's worth to have some more randomness in the input (non uniform cases)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

shape of every sample.
:param lo:
:param hi:
:param dtype: 'int' for uint8 or 'float' for float32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are just numpy types the doc is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"`sample_shape` shall be either a tuple or a callable. Provide `(val,)` tuple for 1D shape")

if np.issubdtype(dtype, np.integer):
return [np.random.randint(lo, hi, size=(bs,) + size, dtype=dtype) for bs in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want non-uniform shapes, we probably need to add another nested list of Tensors instead of generating a numpy array of shape (bs, ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, differently, but there's a non-uniformity

Comment on lines 424 to 490
data = data * 2
data = data + 3
data = data - 4
data = data / 5
data = data // 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check unary -, ternary math.clamp, and throw in a binary op with two DALI tensor arguments, so something like data + data2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@szalpal
Copy link
Member Author

szalpal commented Jan 21, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2000529]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2000529]: BUILD FAILED

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks for subshape and enforces.

@@ -183,14 +205,20 @@ void daliPrefetchSeparate(daliPipelineHandle *pipe_handle,
}


void daliSetExternalInputBatchSize(daliPipelineHandle *pipe_handle, const char *name,
int batch_size) {
auto bs_map = reinterpret_cast<batch_size_map_t *>(pipe_handle->batch_sizes_map);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto bs_map = reinterpret_cast<batch_size_map_t *>(pipe_handle->batch_sizes_map);
auto *bs_map = reinterpret_cast<batch_size_map_t *>(pipe_handle->batch_sizes_map);

I like the auto *. :P

assert(begin >= 0);
auto div_ceil = [](int x, int y) { return 1 + ((x - 1) / y); };
int nsamples = div_ceil(end - begin, step);
std::vector<TensorShape<ndims>> dst_shapes(nsamples);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not create a TensorListShape directly, as far as I remember it's movable, and you already have sample_dim() (ndims can be negative), and you can just .set_tensor_shape(j, in[i]);

static_assert(!std::is_same<Backend, MixedBackend>::value,
"MixedBatch doesn't have an accessible output");
for (int i = 0; i < ws.NumOutput(); i++) {
auto ref_batch_size =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this out of the loop.

@@ -293,6 +296,7 @@ class Operator<CPUBackend> : public OperatorBase {
SetupSharedSampleParams(ws);
RunImpl(ws);
ws.GetThreadPool().WaitForWork();
// EnforceUniformOutputBatchSize<CPUBackend>(ws);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commentedo out?

return SetupImpl(output_desc, ws);
}

void Run(DeviceWorkspace &ws) override {
CheckInputLayouts(ws, spec_);
SetupSharedSampleParams(ws);
RunImpl(ws);
// EnforceUniformOutputBatchSize<GPUBackend>(ws);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?


template <typename Backend>
void OperatorBase::EnforceUniformOutputBatchSize(const workspace_t<Backend> &ws) const {
static_assert(!std::is_same<Backend, MixedBackend>::value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible for Mixed AFAIK.

@@ -428,7 +429,8 @@ class DLL_PUBLIC Pipeline {
DLL_PUBLIC int num_outputs() const;

/**
* @brief Returns a string describing the type device type backing the output specified by given id.
* @brief Returns a string describing the type device type backing the output specified by given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the type device type

Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
@szalpal
Copy link
Member Author

szalpal commented Jan 21, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2002764]: BUILD STARTED

Signed-off-by: Michał Szołucha <mszolucha@nvidia.com>
@szalpal
Copy link
Member Author

szalpal commented Jan 21, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2002862]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [2002862]: BUILD PASSED

@szalpal szalpal merged commit 3dd70d6 into NVIDIA:master Jan 22, 2021
@szalpal szalpal deleted the variable_bs_test branch February 9, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants