Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video Neva Pretraining + Inference Implementation #9095

Merged
merged 23 commits into from
May 3, 2024

Conversation

paul-gibbons
Copy link
Collaborator

What does this PR do ?

This PR enables users to train NeVa models on video data by slicing videos into specific number of frames.

Co-authored-by: Pratyush Muthukumar pmuthukumar@nvidia.com, Vivian Chen xuanzic@nvidia.com, Slyne Deng slyned@nvidia.com.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@@ -11,7 +11,9 @@ inference:
compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: ["<extra_id_1>","<extra_id_7>",] # generation will stop when one of these tokens is generated
images_base_path: /pwd/images
insert_image_token: null # `left` or `right` or `null`
videos_base_path: null #/pwd/videos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary? if we don't have a mixture of video + image, maybe just media base path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved] changed to media_base_path in the latest commit.

@@ -0,0 +1,92 @@
## Inference with multimodal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved] move under /docs

@@ -126,6 +135,22 @@ def forward_loop():
if responses is None:
return

results = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code! remove this part and merge into L154 below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Resolved]

else:
frames = processor.preprocess(frames, return_tensors='pt')['pixel_values']

if neva_cfg.precision in [16, '16', '16-mixed']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

def torch_dtype_from_precision(precision: Union[int, str], megatron_amp_O2: Optional[bool] = None) -> torch.dtype:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved]

sources = preprocess_multimodal(
copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents
) # HARDCODED FOR NOW
num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these changes mean? we don't want to hard code 576 here

Copy link
Contributor

@xuanzic xuanzic May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is because for video neva, the actual image_token_len we need to pass is 256, so we need some logic to make original model_config.data.image_token_len from 224 to 256. However, if we do not hard coded 576 here, it will break neva v1.5 model inference as it uses image_token_len = 576. Would appreciate some suggestions to deal with this more elegantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved]

@@ -372,6 +379,8 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c
turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value'])
list_data_dict.append(record)

num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line? already add above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved]

@@ -385,6 +394,7 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c
if turn.get('value') is not None:
turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value'])
list_data_dict.append(record)
num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line? already add above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved]

xuanzic and others added 2 commits May 2, 2024 19:33
Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz remove this file. if you need it we can add to assets later instead of in github source code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resolved]

xuanzic and others added 13 commits May 2, 2024 19:41
Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
This reverts commit 80af9a4.
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
This reverts commit 8c885c7.
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
This reverts commit 94aba65.
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
@yaoyu-33 yaoyu-33 added Run CICD and removed Run CICD labels May 2, 2024
@pablo-garay pablo-garay merged commit c5a5a79 into NVIDIA:main May 3, 2024
129 of 130 checks passed
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* video_neva pretrain

* support video neva inference

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* yaml update, adding media_type

* yaml update, adding media_type

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* modify neva inference config

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* modify based on review

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove video test asset

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* video_neva doc, describing config changes.

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* Revert "video_neva doc, describing config changes."

This reverts commit 1a02ccd.

* vneva brief doc

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* vneva doc update

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* doc update

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* Revert "doc update"

This reverts commit 80af9a4.

* doc update

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* Revert "doc update"

This reverts commit 8c885c7.

* doc update

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* Revert "doc update"

This reverts commit 94aba65.

* doc update

Signed-off-by: paul-gibbons <paul@gibbonspaul.com>

* add inference doc to docs, resolve review

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

* modify inference config for other mlm

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>

---------

Signed-off-by: Vivian Chen <xuanzic@nvidia.com>
Signed-off-by: paul-gibbons <paul@gibbonspaul.com>
Co-authored-by: Vivian Chen <xuanzic@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants