[#1381] Change cuMemFreeAsync to cuMemFree in vmaf_cuda_picture_free() by ha7sh17 · Pull Request #1382 · Netflix/vmaf

ha7sh17 · 2024-07-15T11:12:46Z

This PR is a fix to #1381

Description

We have identified that when using the libvmaf_cuda(vmaf-3.0.0) filter with ffmpeg in a constrained virtual memory environment, an assertion failure occurs in vmaf_cuda_picture_free() when ffmpeg terminates.
You can easily reproduce the issue by running the ffmpeg command with the libvmaf_cuda filter along with ulimit -v 16777216. (16777216 represents 16GB, which is a sufficient size for virtual memory.)

ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda[main];[1:v]hwupload_cuda[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -

code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted (core dumped)

According to the API documentation, cuMemFreeAsync() does not return CUDA_ERROR_OUT_OF_MEMORY. However, in this abnormal situation, it is returning CUDA_ERROR_OUT_OF_MEMORY.

We have confirmed that using the synchronous memory free API, cuMemFree(), resolves the issue. We propose modifying the code to use cuMemFree() for freeing memory until the underlying cause of the issue with the CUDA asynchronous API is resolved.

…e_free()

nilfm99 · 2024-07-15T15:52:55Z

Tagging @gedoensmax

gedoensmax · 2024-07-17T06:55:10Z

When does this crash occur ? During processing or only after processing for a while ? When using CUDA the internally preallocated pictures should be used in FFmpeg so that this free should only be called after processing all the pictures when closing the context.
EDIT: Could you check if this reproduces with the env variable CUDA_LAUNCH_BLOCKING=1 set ?

ha7sh17 · 2024-07-19T05:41:10Z

Dear @gedoensmax

When does this crash occur ? During processing or only after processing for a while ? When using CUDA the internally preallocated pictures should be used in FFmpeg so that this free should only be called after processing all the pictures when closing the context.

The VMAF score is output correctly, but an assertion occurs when ffmpeg exits.

Reproducing the issue is very easy.
I can reproduce it 100% by setting the virtual memory with ulimit -v and using the libvmaf_cuda filter with ffmpeg on an CentOS 7 & Nvidia T4 server

The key point for reproducing the issue is setting the virtual memory limit using ulimit -v.

ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[main];[1:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -

[Parsed_libvmaf_cuda_4 @ 0x5273d80] VMAF score: 99.723612
code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted

EDIT: Could you check if this reproduces with the env variable CUDA_LAUNCH_BLOCKING=1 set ?

I followed your guide and set the environment variables as shown below, but the same issue continues to occur.

export CUDA_LAUNCH_BLOCKING=1;ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[main];[1:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -

[Parsed_libvmaf_cuda_4 @ 0x5273d80] VMAF score: 99.723612
code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted

gedoensmax · 2024-07-22T15:42:55Z

Ok so when using the internal memory pool correctly it should not have any performance impact to move to the synchronous version of this API. In case this is not possible though and pictures are allocated and free'd dynamically it will introduce a CUDA synchronization.
Is there any reason that the virtual men limit is set that low ? The driver will grow a memory pool for these async allocations as far as I understand which requires some virtual addresses.

ha7sh17 · 2024-07-23T01:15:06Z

Is there any reason that the virtual men limit is set that low ?

We were limiting the virtual memory before running the ffmpeg process due to an unresolved memory leak issue in ffmpeg.
By doing this, only the specific ffmpeg process would terminate abnormally without affecting the entire system.
If we do not limit the virtual memory, the ffmpeg process would continue to consume virtual memory, eventually causing the system to hang.
(Of course, this is not a common situation but a very special case, and you can think of it as a preventive measure for such special cases.)

Ok so when using the internal memory pool correctly it should not have any performance impact to move to the synchronous version of this API. In case this is not possible though and pictures are allocated and free'd dynamically it will introduce a CUDA synchronization.

We understand what you mean.
If you, like us, must limit the virtual memory, how many GB would you set the virtual memory to ensure that ffmpeg + libvmaf_cuda operates correctly?
I understand that this is not an easy question to answer. :)

gedoensmax · 2024-07-23T04:16:41Z

As you are saying this is not an easy questionand i think if your change is fixing the problem you shall use it. Do you need the change to be in main branch though ? Or maybe you can expose the change to synchronous free through an option ?

ha7sh17 · 2024-07-23T04:35:29Z

As you mentioned, we can apply this fix only in our branch and it does not need to be applied to the main branch.
We just wanted to inform you about this issue as dedicated users of ffmpeg + libvmaf_cuda.
Therefore, it is okay to close this PR without merging.
Thank you for your kind response.

Replace cuMemFreeAsync(ptr, priv->cuda.str) with synchronous cuMemFree(ptr) at libvmaf/src/cuda/picture_cuda.c:247 to fix the assertion-0 crash in vmaf_cuda_picture_free() reported upstream as Netflix#1381 and addressed by the open upstream PR Netflix#1382. The fork already issues cuStreamSynchronize(priv->cuda.str) two lines earlier, so the async variant offered no overlap benefit - cuMemFree is both correct and non-regressive for performance. Fixes T0-1 in .workingdir2/BACKLOG.md. First commit of Batch-A upstream small-fix sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…4/0135) (#72) * fix(cuda): port Netflix#1382 cuMemFreeAsync -> cuMemFree (ADR-0131) Replace cuMemFreeAsync(ptr, priv->cuda.str) with synchronous cuMemFree(ptr) at libvmaf/src/cuda/picture_cuda.c:247 to fix the assertion-0 crash in vmaf_cuda_picture_free() reported upstream as Netflix#1381 and addressed by the open upstream PR Netflix#1382. The fork already issues cuStreamSynchronize(priv->cuda.str) two lines earlier, so the async variant offered no overlap benefit - cuMemFree is both correct and non-regressive for performance. Fixes T0-1 in .workingdir2/BACKLOG.md. First commit of Batch-A upstream small-fix sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(feature): port Netflix#1406 mount/unmount model-list bugfix (ADR-0132) Fix vmaf_feature_collector_mount_model list corruption on >=3 mounts (the previous **head traversal mutated the list instead of walking a local cursor) and align vmaf_feature_collector_unmount_model to return -ENOENT for not-found instead of -EINVAL. Test coverage extended to a 3-element mount/unmount sequence with insertion-order verification. Upstream duplicated the setup across both tests; refactored into a shared load_three_test_models / destroy_three_test_models helper to keep each test body under the JPL Power-of-10 rule-4 size threshold. T4-4 in .workingdir2/BACKLOG.md. Second commit of Batch-A. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * build(meson): port Netflix#1451 declare_dependency + override_dependency (ADR-0134) Appends declare_dependency(link_with: libvmaf, include_directories: [libvmaf_inc]) + meson.override_dependency('libvmaf', libvmaf_dep) to libvmaf/src/meson.build so consumers can use the fork as a meson subproject with the standard dependency('libvmaf') idiom. Fork uses trailing-comma style to match fork build-file conventions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(api): port Netflix#1424 built-in model version iterator (ADR-0135) Add public vmaf_model_version_next(prev, &version) iterator. Ports upstream PR#1424's API surface verbatim while correcting three latent defects during port: - NULL-pointer arithmetic UB on first call (upstream missed an else between the two if-branches, so NULL+1 on the second). - Off-by-one returning the {0} sentinel at end of iteration — condition must be idx+1 < CNT, not idx < CNT. - const-qualifier mismatches in the test (upstream used char* / void* against a const-qualified API, not allowed in C11). Early-returns NULL when BUILT_IN_MODEL_CNT == 0 so zero-models build configurations link cleanly. Test asserts the iterator both hands out the stored version pointer and visits every model exactly once. Doxygen header-doc replaces upstream's one-line comment. docs/api/index.md gains a programmatic-discovery example per the CLAUDE §12 r10 per-surface doc rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(batch-a): ADR-0108 deep-dive deliverables for upstream port sweep Bundles the five in-tree deep-dive deliverables for the Batch-A PR (ADR-0108 §Per-surface minimum bars; sixth deliverable — the reproducer/smoke-test — lives in the PR description). - docs/research/0009-batch-a-upstream-port-strategy.md: research digest covering why port OPEN upstream PRs now, per-PR defect analysis (three latent bugs corrected in Netflix#1424, test refactor in Netflix#1406), port-from-PR-tip rationale, and batch-shape justification. - CHANGELOG.md: Added entries for T4-5 (meson declare_dependency) and T4-6 (vmaf_model_version_next iterator); Fixed entries for T0-1 (CUDA cuMemFree) and T4-4 (feature_collector list corruption). - docs/rebase-notes.md §0031: Touches / Invariant / Re-test for all four ports with explicit "keep fork version on upstream conflict" resolution policy and file-by-file conflict predictions. - libvmaf/src/feature/AGENTS.md: new rebase-sensitive invariant for feature_collector mount/unmount traversal + shared test helpers. - libvmaf/src/cuda/AGENTS.md: new Rebase-sensitive invariants section for picture_cuda.c synchronous cuMemFree contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(feature-collector): clang-format test helpers from Batch-A T4-4 port The `load_three_test_models` signature and one `mu_assert` split arguments across two lines where .clang-format prefers a single line — caught by pre-commit on the Batch-A deep-dive pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

[Netflix#1381] Change cuMemFreeAsync to cuMemFree in vmaf_cuda_pictur…

13d190b

…e_free()

nilfm99 requested a review from kylophone July 15, 2024 15:52

lusoris mentioned this pull request Apr 20, 2026

feat(batch-a): port four OPEN Netflix upstream PRs (ADR-0131/0132/0134/0135) lusoris/vmaf#72

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1381] Change cuMemFreeAsync to cuMemFree in vmaf_cuda_picture_free()#1382

[#1381] Change cuMemFreeAsync to cuMemFree in vmaf_cuda_picture_free()#1382
ha7sh17 wants to merge 1 commit intoNetflix:masterfrom
ha7sh17:feature/#1381-VMAF-CUDA-vmaf_cuda_picture_free()-Assertion-0-failed

ha7sh17 commented Jul 15, 2024

Uh oh!

nilfm99 commented Jul 15, 2024

Uh oh!

gedoensmax commented Jul 17, 2024 •

edited

Loading

Uh oh!

ha7sh17 commented Jul 19, 2024 •

edited

Loading

Uh oh!

gedoensmax commented Jul 22, 2024

Uh oh!

ha7sh17 commented Jul 23, 2024

Uh oh!

gedoensmax commented Jul 23, 2024

Uh oh!

ha7sh17 commented Jul 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ha7sh17 commented Jul 15, 2024

Description

Uh oh!

nilfm99 commented Jul 15, 2024

Uh oh!

gedoensmax commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ha7sh17 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gedoensmax commented Jul 22, 2024

Uh oh!

ha7sh17 commented Jul 23, 2024

Uh oh!

gedoensmax commented Jul 23, 2024

Uh oh!

ha7sh17 commented Jul 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gedoensmax commented Jul 17, 2024 •

edited

Loading

ha7sh17 commented Jul 19, 2024 •

edited

Loading