Video Neva Pretraining + Inference Implementation #9095

paul-gibbons · 2024-05-02T17:45:59Z

What does this PR do ?

This PR enables users to train NeVa models on video data by slicing videos into specific number of frames.

Co-authored-by: Pratyush Muthukumar pmuthukumar@nvidia.com, Vivian Chen xuanzic@nvidia.com, Slyne Deng slyned@nvidia.com.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

for more information, see https://pre-commit.ci

…ain-infer

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

yaoyu-33 · 2024-05-02T18:07:53Z

examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml

@@ -11,7 +11,9 @@ inference:
  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
  end_strings: ["<extra_id_1>","<extra_id_7>",]  # generation will stop when one of these tokens is generated
  images_base_path: /pwd/images
-  insert_image_token: null # `left` or `right` or `null`
+  videos_base_path: null #/pwd/videos


is it necessary? if we don't have a mixture of video + image, maybe just media base path?

[Resolved] changed to media_base_path in the latest commit.

yaoyu-33 · 2024-05-02T18:09:26Z

examples/multimodal/multimodal_llm/neva/inference.md

@@ -0,0 +1,92 @@
+## Inference with multimodal


Move to docs to https://github.com/NVIDIA/NeMo/tree/main/docs/source/multimodal/mllm

[resolved] move under /docs

yaoyu-33 · 2024-05-02T18:11:03Z

examples/multimodal/multimodal_llm/neva/neva_evaluation.py

@@ -126,6 +135,22 @@ def forward_loop():
    if responses is None:
        return

+    results = []


Duplicated code! remove this part and merge into L154 below.

yaoyu-33 · 2024-05-02T18:21:57Z

nemo/collections/multimodal/parts/utils.py

+        else:
+            frames = processor.preprocess(frames, return_tensors='pt')['pixel_values']
+
+        if neva_cfg.precision in [16, '16', '16-mixed']:


use

NeMo/nemo/collections/nlp/parts/utils_funcs.py

Line 43 in e329575

def torch_dtype_from_precision(precision: Union[int, str], megatron_amp_O2: Optional[bool] = None) -> torch.dtype:

yaoyu-33 · 2024-05-02T18:23:50Z

nemo/collections/nlp/modules/common/text_generation_strategy.py

-        sources = preprocess_multimodal(
-            copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents
-        )  # HARDCODED FOR NOW
+        num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)


what are these changes mean? we don't want to hard code 576 here

this line is because for video neva, the actual image_token_len we need to pass is 256, so we need some logic to make original model_config.data.image_token_len from 224 to 256. However, if we do not hard coded 576 here, it will break neva v1.5 model inference as it uses image_token_len = 576. Would appreciate some suggestions to deal with this more elegantly.

yaoyu-33 · 2024-05-02T18:24:22Z

nemo/collections/nlp/modules/common/text_generation_strategy.py

@@ -372,6 +379,8 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c
                turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value'])
        list_data_dict.append(record)

+        num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)


remove this line? already add above

same comment as above

yaoyu-33 · 2024-05-02T18:24:38Z

nemo/collections/nlp/modules/common/text_generation_strategy.py

@@ -385,6 +394,7 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c
            if turn.get('value') is not None:
                turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value'])
        list_data_dict.append(record)
+        num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)


remove this line? already add above

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

for more information, see https://pre-commit.ci

yaoyu-33 · 2024-05-02T19:38:23Z

examples/multimodal/multimodal_llm/neva/assets/video_test.mp4

Plz remove this file. if you need it we can add to assets later instead of in github source code

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

This reverts commit 1a02ccd.

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

This reverts commit 80af9a4.

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

This reverts commit 8c885c7.

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

This reverts commit 94aba65.

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* video_neva pretrain * support video neva inference Signed-off-by: Vivian Chen <xuanzic@nvidia.com> * yaml update, adding media_type * yaml update, adding media_type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * modify neva inference config Signed-off-by: Vivian Chen <xuanzic@nvidia.com> * modify based on review Signed-off-by: Vivian Chen <xuanzic@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove video test asset Signed-off-by: Vivian Chen <xuanzic@nvidia.com> * video_neva doc, describing config changes. Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * Revert "video_neva doc, describing config changes." This reverts commit 1a02ccd. * vneva brief doc Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * vneva doc update Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * doc update Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * Revert "doc update" This reverts commit 80af9a4. * doc update Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * Revert "doc update" This reverts commit 8c885c7. * doc update Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * Revert "doc update" This reverts commit 94aba65. * doc update Signed-off-by: paul-gibbons <paul@gibbonspaul.com> * add inference doc to docs, resolve review Signed-off-by: Vivian Chen <xuanzic@nvidia.com> * modify inference config for other mlm Signed-off-by: Vivian Chen <xuanzic@nvidia.com> --------- Signed-off-by: Vivian Chen <xuanzic@nvidia.com> Signed-off-by: paul-gibbons <paul@gibbonspaul.com> Co-authored-by: Vivian Chen <xuanzic@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

paul-gibbons and others added 4 commits May 1, 2024 23:03

video_neva pretrain

63d8415

support video neva inference

e795456

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

yaml update, adding media_type

101d989

yaml update, adding media_type

a8d1990

github-actions bot added NLP Multi Modal labels May 2, 2024

pre-commit-ci bot and others added 3 commits May 2, 2024 17:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

97b6b49

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/vneva-train-infer' into vneva-tr…

5a42d6c

…ain-infer

modify neva inference config

e49c479

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

yaoyu-33 reviewed May 2, 2024

View reviewed changes

xuanzic and others added 2 commits May 2, 2024 19:33

modify based on review

3c52d74

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac6b51f

for more information, see https://pre-commit.ci

yaoyu-33 reviewed May 2, 2024

View reviewed changes

xuanzic and others added 13 commits May 2, 2024 19:41

remove video test asset

55feca6

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

video_neva doc, describing config changes.

1a02ccd

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

Revert "video_neva doc, describing config changes."

7629a27

This reverts commit 1a02ccd.

vneva brief doc

7ae739b

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

vneva doc update

14837e6

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

doc update

80af9a4

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

Revert "doc update"

28dcee2

This reverts commit 80af9a4.

doc update

8c885c7

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

Revert "doc update"

a590e52

This reverts commit 8c885c7.

doc update

94aba65

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

Revert "doc update"

ae0519f

This reverts commit 94aba65.

doc update

b159696

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

add inference doc to docs, resolve review

f991aef

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

yaoyu-33 added the Run CICD label May 2, 2024

modify inference config for other mlm

8c6d0e4

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

yaoyu-33 added Run CICD and removed Run CICD labels May 2, 2024

yaoyu-33 approved these changes May 3, 2024

View reviewed changes

pablo-garay merged commit c5a5a79 into NVIDIA:main May 3, 2024
129 of 130 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video Neva Pretraining + Inference Implementation #9095

Video Neva Pretraining + Inference Implementation #9095

paul-gibbons commented May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024 •

edited

Loading

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

yaoyu-33 May 2, 2024

xuanzic May 2, 2024

Video Neva Pretraining + Inference Implementation #9095

Video Neva Pretraining + Inference Implementation #9095

Conversation

paul-gibbons commented May 2, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanzic May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanzic May 2, 2024 •

edited

Loading