update codec evaluation#36
Conversation
- Rename Docker image to `onevision-encoder:2601` - Add `--rm` flag to docker run command for cleaner cleanup - Refactor Dockerfile to improve layer caching and reduce size: - Install system deps and static ffmpeg binary - Use requirements.txt for Python dependencies - Set environment variables for better container behavior - Add `torchmetrics` to requirements.txt - Minor formatting fixes in README.md
There was a problem hiding this comment.
Pull request overview
This PR updates the codec evaluation infrastructure by modernizing naming conventions and streamlining the codebase. The changes rename "llava_vit" references to more descriptive names ("chunk_wise_sampling" and "ov_encoder"), remove obsolete evaluation scripts, update the Docker base image, and improve documentation formatting.
Key Changes:
- Renamed model family from "llava_vit_sampling" to "chunk_wise_sampling" and "llava_vit_codec" to "ov_encoder_codec" across all evaluation scripts and Python code
- Removed 7 obsolete evaluation shell scripts for llava_vit variants (base and large models with various frame configurations)
- Updated Docker base image from PyTorch 2.7.0 to NVIDIA PyTorch 25.04 and added required system dependencies
- Improved README documentation formatting and updated evaluation examples to reflect new naming conventions
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| eval_encoder/shells_eval_ap/eval_ov_encoder_large_8frames.sh | Updated MODEL_FAMILY from "llava_vit_sampling" to "chunk_wise_sampling" |
| eval_encoder/shells_eval_ap/eval_ov_encoder_large_16frames.sh | Updated MODEL_FAMILY from "llava_vit_sampling" to "chunk_wise_sampling" |
| eval_encoder/shells_eval_ap/eval_llava_vit_large_8frames_hf.sh | Removed obsolete evaluation script for HuggingFace llava_vit_large model with 8 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_large_8frames.sh | Removed obsolete evaluation script for llava_vit_large model with 8 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_large_16frames_hf.sh | Removed obsolete evaluation script for HuggingFace llava_vit_large model with 16 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_base_8frames.sh | Removed obsolete evaluation script for llava_vit_base model with 8 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_base_64frames.sh | Removed obsolete evaluation script for llava_vit_base model with 64 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_base_32frames.sh | Removed obsolete evaluation script for llava_vit_base model with 32 frames |
| eval_encoder/shells_eval_ap/eval_llava_vit_base_16frames.sh | Removed obsolete evaluation script for llava_vit_base model with 16 frames |
| eval_encoder/attentive_probe_codec.py | Updated model family names throughout, changed default arguments to ov_encoder variants, removed llava_vit_si from supported models list, removed obsolete dataset configurations (COIN, jester), and cleaned up whitespace |
| eval_encoder/attentive_probe.py | Updated model family from "llava_vit_sampling" to "chunk_wise_sampling", reorganized imports, removed llava_vit_si from supported models, and removed obsolete comment |
| dockerfile | Updated base image from pytorch/pytorch:2.7.0 to nvcr.io/nvidia/pytorch:25.04-py3 and added libgl1 and libglib2.0-0 dependencies |
| README.md | Reformatted video processing pipeline documentation for better readability, simplified evaluation commands to use shell scripts, and updated codec evaluation example with new model family names |
| .gitignore | Added ckpts directory patterns and a misspelled .gitginore entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -272,25 +276,14 @@ Training configurations and hyperparameters will be documented soon. For now, pl | |||
| To evaluate the encoder with uniform frame sampling, first navigate to the evaluation directory: | |||
There was a problem hiding this comment.
The parameter description is misleading. The section is titled "Chunk-wise Sampling Evaluation" but the description says "To evaluate the encoder with uniform frame sampling". Based on the code changes where "llava_vit_sampling" was renamed to "chunk_wise_sampling", this should describe "chunk-wise sampling" or the actual sampling strategy used, not "uniform frame sampling".
| To evaluate the encoder with uniform frame sampling, first navigate to the evaluation directory: | |
| To evaluate the encoder with chunk-wise sampling, first navigate to the evaluation directory: |
|
|
||
| **Codec-Specific Parameters:** | ||
| - `K_keep`: Number of patches to keep. | ||
| - `cache_dir`: Directory for cached codec patches. This is where the codec-selected patches will be stored/loaded. |
There was a problem hiding this comment.
The documentation mentions cache_dir parameter as a codec-specific parameter, but this parameter is not present in the example command shown above (lines 303-319). Either the parameter should be added to the example command, or the documentation should clarify that it's optional.
| - `cache_dir`: Directory for cached codec patches. This is where the codec-selected patches will be stored/loaded. | |
| - `cache_dir` (optional): Directory for cached codec patches. Use this to specify where codec-selected patches are stored/loaded when you want to persist or reuse them. |
| --model_name hf_llava_vit_large_ln \ | ||
| --embedding_size 1024 \ | ||
| --frames_token_num 256 | ||
| bash eval_encoder/shells_eval_ap/eval_ov_encoder_large_16frames.sh |
There was a problem hiding this comment.
The command references a file path that doesn't match the actual location. The command shows bash eval_encoder/shells_eval_ap/eval_ov_encoder_large_16frames.sh but since the instructions say to first navigate to the eval_encoder directory with cd eval_encoder, the path should be bash shells_eval_ap/eval_ov_encoder_large_16frames.sh (without the eval_encoder/ prefix).
| bash eval_encoder/shells_eval_ap/eval_ov_encoder_large_16frames.sh | |
| bash shells_eval_ap/eval_ov_encoder_large_16frames.sh |
| from model_factory.layers import (Siglip2MultiheadAttentionPoolingHead, | ||
| Siglip2TransformerAttentionPoolingHead) |
There was a problem hiding this comment.
Import of 'Siglip2TransformerAttentionPoolingHead' is not used.
| from model_factory.layers import (Siglip2MultiheadAttentionPoolingHead, | |
| Siglip2TransformerAttentionPoolingHead) | |
| from model_factory.layers import Siglip2MultiheadAttentionPoolingHead |
No description provided.