Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VideoReaderDecoder GPU #3668

Merged
merged 36 commits into from
Feb 18, 2022
Merged

Conversation

awolant
Copy link
Contributor

@awolant awolant commented Feb 10, 2022

Category:

New feature: Adds VideoReaderDecoderGpu op. This operator reads and decodes video files using NVDECODE API. It supports both CFR and VFR videos.

It provides basic functionality for now. Additional features: more formats, codecs, output types, input variants will be added in subsequent tasks

Description:

Additional information:

Affected modules and functionalities:

Added new operator and loader. Adjusted FramesDecoderGpu as some minor changes were needed (ability to set the stream after construction).

Key points relevant for the review:

Does this operator properly interface with FramesDeocderGpu?

Does this operator properly implement DALI Reader abstraction, given that it does not exactly fit to it?

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2593

Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
@JanuszL JanuszL self-assigned this Feb 10, 2022

// TODO(awolant): Extract decoding outside of ReadSample (ReaderDecoder abstraction)
for (int i = 0; i < sequence_len_; ++i) {
// TODO(awolant): This seek can be optimized - for consecutive frames not needed etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can optimize the seek itself and keep it here even with the optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the comment to Seek. This will be done as DALI-2320.

}

void VideoLoaderDecoderGpu::PrepareMetadataImpl() {
video_files_.reserve(filenames_.size());
Copy link
Contributor

@JanuszL JanuszL Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So we have input files number amount of FramesDecoderGpu instances (including decoder instances inside).
I'm not sure how many of them we can have in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solving this properly is part of DALI-2321 to be done when we have benchmark (DALI-2594). Before it is hard too tell anything about performance impact of any possible solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not about the perf, rather about resource constrains. I think creating 1000 decoders and parsers will consume a lot of resources.
Also we have already hit a maximum amount of files opened in parallel in the old VideoReader (libaviutil).

output_shape.resize(batch_size);

for (int sample_id = 0; sample_id < batch_size; ++sample_id) {
auto &sample = current_batch[sample_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto &sample = current_batch[sample_id];
auto &sample = GetSample(sample_id);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

void VideoReaderDecoderGpu::RunImpl(DeviceWorkspace &ws) {
auto &video_output = ws.Output<GPUBackend>(0);
auto &current_batch = prefetched_batch_queue_[curr_batch_consumer_];
int batch_size = current_batch.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int batch_size = current_batch.size();
int batch_size = GetCurrBatchSize();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

video_output.Resize(output_shape, current_batch[0]->data_.type());

for (int sample_id = 0; sample_id < batch_size; ++sample_id) {
auto &sample = current_batch[sample_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto &sample = current_batch[sample_id];
auto &sample =GetSample(sample_id);;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

output_shape.set_tensor_shape(sample_id, sample->data_.shape());
}

video_output.Resize(output_shape, current_batch[0]->data_.type());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
video_output.Resize(output_shape, current_batch[0]->data_.type());
video_output.Resize(output_shape, GetSample(0);->data_.type());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

output_shape.set_tensor_shape(sample_id, sample->data_.shape());
}

video_output.Resize(output_shape, current_batch[0]->data_.type());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add SetupImpl and deal with shapes there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved decoding the the RunImpl. This means that we do not have the shape of the output ready during SetupImpl.
We could pass the size from the FramesDecoder, through VideoSampleDesc but I don't want to do this as this will limit us in the future, when we optimize index building and might not know size without decoding.
When this is more or less feature complete, we can revisit this refactoring, ok?

Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Comment on lines 235 to 251
this->SaveFrame(
frame_cpu.data(),
i,
sample_id,
sequence_id,
"/home/wazka/Downloads/frames/reader/",
this->Width(video_idx),
this->Height(video_idx));

this->SaveFrame(
this->GetVfrFrame(video_idx, gt_frame_id + i * stride),
i,
sample_id,
sequence_id,
"/home/wazka/Downloads/frames/gt/",
this->Width(video_idx),
this->Height(video_idx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have that stashed for debugging. Removed

@JanuszL JanuszL self-requested a review February 16, 2022 17:59
Signed-off-by: Albert Wolant <awolant@nvidia.com>
@@ -66,8 +66,12 @@ class DLL_PUBLIC FramesDecoderGpu : public FramesDecoder {

int NextFramePts() { return index_[NextFrameIdx()].pts; }

void SetCudaStream(cudaStream_t stream) { stream_ = stream; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use it anywhere now? Or we keep going on the default stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, removed this. Stream is set during the construction now.

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@awolant
Copy link
Contributor Author

awolant commented Feb 17, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3985100]: BUILD STARTED

// TODO(awolant): Check per decoder stream
cudaStream_t stream;
DeviceGuard dg(device_id_);
CUDA_CALL(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using CUDAStream, or even better, just lease one from the pool:

dali::CUDAStreamPool::instance().Get(device_id_);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use CUDAStream as this is derived from UniqueHandle and I want to share this stream between decoders for now.

}

auto &labels_output = ws.Output<GPUBackend>(1);
vector<int> labels_cpu(batch_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmallVector<int, 256> will save you an allocation for most batch size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I used smaller value, as batch sizes in video use cases tend to be smaller.

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3985100]: BUILD PASSED

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@awolant
Copy link
Contributor Author

awolant commented Feb 17, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3985950]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3985950]: BUILD FAILED

@awolant
Copy link
Contributor Author

awolant commented Feb 17, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3987291]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3987291]: BUILD FAILED

@awolant
Copy link
Contributor Author

awolant commented Feb 17, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3988097]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3988097]: BUILD PASSED

@awolant awolant merged commit a407c54 into NVIDIA:main Feb 18, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022
* Add VideoReaderDecoderGpu op

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@JanuszL JanuszL mentioned this pull request Mar 30, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
* Add VideoReaderDecoderGpu op

Signed-off-by: Albert Wolant <awolant@nvidia.com>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
* Add VideoReaderDecoderGpu op

Signed-off-by: Albert Wolant <awolant@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants