## About

This colab notebook showcases how easy it is to finetune a large multimodal model (LMM; or multi-modal LLM), with the codebase of **[lmms-finetune](https://github.com/zjysteven/lmms-finetune)**. Specially we will finetune the powerful LLaVA-NeXT-Video model on some ego4d video clips to generate detailed video captions.

This notebook is written on 2024/07/30. lmms-finetune is undergoing active development. So for more details and updates, please don't hesistate to check out https://github.com/zjysteven/lmms-finetune

*Note:* Running this notebook requires sufficient GPU resource (A100 would be the best but L4 also works).

In [None]:
# clone the codebase
!git clone https://github.com/zjysteven/lmms-finetune

# install dependencies
%cd lmms-finetune
!pip install -r requirements.txt

Cloning into 'lmms-finetune'...
remote: Enumerating objects: 400, done.[K
remote: Counting objects: 100% (400/400), done.[K
remote: Compressing objects: 100% (310/310), done.[K
remote: Total 400 (delta 182), reused 286 (delta 87), pack-reused 0[K
Receiving objects: 100% (400/400), 13.22 MiB | 19.88 MiB/s, done.
Resolving deltas: 100% (182/182), done.
/content/lmms-finetune
Collecting transformers@ git+https://github.com/huggingface/transformers.git (from -r requirements.txt (line 3))
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-install-rt2g_ftv/transformers_382087e6e4a643908aa49bb8e0653231
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-install-rt2g_ftv/transformers_382087e6e4a643908aa49bb8e0653231
  Resolved https://github.com/huggingface/transformers.git to commit 62c60a30181a65e1a3a7f19c3055a240a6a21335
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build whee

## Step 0: Check if the target model is supported

In [1]:
!python supported_models.py

Supported models:
  Model ID                      : HuggingFace Path
  ------------------------------------------------
  llava-1.5-7b                  : llava-hf/llava-1.5-7b-hf
  llava-1.5-13b                 : llava-hf/llava-1.5-13b-hf
  llava-1.6-vicuna-7b           : llava-hf/llava-v1.6-vicuna-7b-hf
  llava-1.6-vicuna-13b          : llava-hf/llava-v1.6-vicuna-13b-hf
  llava-next-video-7b           : llava-hf/LLaVA-NeXT-Video-7B-hf
  llava-next-video-7b-32k       : llava-hf/LLaVA-NeXT-Video-7B-32K-hf
  llava-next-video-34b          : llava-hf/LLaVA-NeXT-Video-34B-hf
  llava-interleave-qwen-0.5b    : llava-hf/llava-interleave-qwen-0.5b-hf
  llava-interleave-qwen-7b      : llava-hf/llava-interleave-qwen-7b-hf
  llava-onevision-0.5b-ov       : llava-hf/llava-onevision-qwen2-0.5b-ov-hf
  llava-onevision-7b-ov         : llava-hf/llava-onevision-qwen2-7b-ov-hf
  llava-onevision-72b-ov        : llava-hf/llava-onevision-qwen2-72b-ov-hf
  qwen-vl-chat                  : Qwen/Qwen-VL-Chat
  

You can see from the displayed information which models are supported by lmms-finetune. Here we will finetune `LLaVA-NeXT-Video-7B` whose model ID is `llava-next-video-7b`.

## Step 1: Prepare finetuning data

For more details please see https://github.com/zjysteven/lmms-finetune. Essentially lmms-finetune expects a human-friendly format of json file which couldn't be more readable in my opinion. Below we show the data we will be using later to give you a sense.

In [4]:
import json

data = json.load(open("/home/rilyn/project-files/test/vsi-ft-dataset/data/qa_pairs/all_qa/fixed_dataset.json", "r"))
print(data[0])

{'system_prompt': '', 'video': '/data_new/spatial/Training/videos/scannet/datasets/scans_videos/scene0191_01.mp4', 'conversations': [{'from': 'human', 'value': "<video>These are frames of a video.\nIf I am standing by the table and facing the door, is the backpack to my left, right, or back?\nAn object is to my back if I would have to turn at least 135 degrees in order to face it.\nOptions:\nA. left\nB. back\nC. right\nAnswer with the option's letter from the given choices directly."}, {'from': 'gt', 'value': 'A'}]}


In [None]:
# download the video clips we will be using
from huggingface_hub import hf_hub_download

hf_hub_download(
    "ShareGPT4Video/ShareGPT4Video",
    "zip_folder/ego4d/ego4d_videos_4.zip",
    repo_type="dataset",
    local_dir="./example_data/videos"
)

!unzip example_data/videos/zip_folder/ego4d/ego4d_videos_4.zip -d example_data/videos/ego4d
!rm example_data/videos/zip_folder/ego4d/ego4d_videos_4.zip

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ego4d_videos_4.zip:   0%|          | 0.00/14.4G [00:00<?, ?B/s]

Archive:  example_data/videos/zip_folder/ego4d/ego4d_videos_4.zip
  inflating: example_data/videos/ego4d/007cb0df-4f4f-4810-b246-8ba6639f53e1.mp4  
  inflating: example_data/videos/ego4d/0219ad48-8f54-4f61-b22f-4d1e8173e584.mp4  
  inflating: example_data/videos/ego4d/019a251b-f3fb-4fc9-82dc-ca1b9fe42e12.mp4  
  inflating: example_data/videos/ego4d/029532a0-3b50-457d-a790-f9dcabf93101.mp4  
  inflating: example_data/videos/ego4d/042d40a2-f450-4322-8d4e-e5d5f8864475.mp4  
  inflating: example_data/videos/ego4d/056db3f1-f957-46c8-b16b-c8fce22e78f9.mp4  
  inflating: example_data/videos/ego4d/0386e502-b034-4cb3-ab3e-f44c154f18dc.mp4  
  inflating: example_data/videos/ego4d/07309684-1f6e-4977-ab74-f3e63c361f36.mp4  
  inflating: example_data/videos/ego4d/06899020-dca3-4612-92cf-3427d6dac6e3.mp4  
  inflating: example_data/videos/ego4d/01d32889-b5c3-4b2f-9a37-f751b7f818d4.mp4  
  inflating: example_data/videos/ego4d/045451d6-2916-4c07-8e47-7cfdaa579086.mp4  
  inflating: example_data/videos

## Step 2: Finetuning!

Here we will be using LORA to finetune the LLM part of the LMM. The running script has been prepared for you (see `example_scripts/example_video.sh`). Note that lmms-finetune also supports finetuning the vision encoder and projector. You can explore these options by looking at the bash script, where there are arguments that configure these.

In [1]:
!bash example_scripts/example_video.sh

[2025-01-30 21:11:49,017] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
  def forward(ctx, input, weight, bias=None):
  def backward(ctx, grad_output):
[2025-01-30 21:11:49,866] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-30 21:11:49,866] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading model, tokenizer, processor...
[2025-01-30 21:11:51,416] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 687, num_elems = 7.06B
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:03<00:00,  1.11s/it]
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. 
Vision encoder is freezed... including:
	vision_tower
Vision projector is freezed... including:
	multi_modal_projector
Other multimodal component is freezed... including:
	image_newline
	vision_resampler
LoRA for LLM enabled...
Trainable paramet

## Step 3: Inference

The finetuned model is saved locally under `checkpoints/`. In this example the checkpoint folder is `checkpoints/llava-next-video-7b_lora-True_qlora-False`. To perform inference, the key is to correctly load the model, which is almost exactly the same as how you would typically load a huggingface model (see [inference.md](https://github.com/zjysteven/lmms-finetune/blob/main/docs/inference.md) for details). After loading the model, the inference is exactly the same as the original model. The way to inference is often documented in the huggingface model card. For example for `LLaVA-NeXT-Video`, you can find it here https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf.

In [None]:
import av
import numpy as np
import torch
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor


def inference(model, processor, video_path):
    def read_video_pyav(container, indices):
        '''
        Decode the video with PyAV decoder.
        Args:
            container (`av.container.input.InputContainer`): PyAV container.
            indices (`List[int]`): List of frame indices to decode.
        Returns:
            result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
        '''
        frames = []
        container.seek(0)
        start_index = indices[0]
        end_index = indices[-1]
        for i, frame in enumerate(container.decode(video=0)):
            if i > end_index:
                break
            if i >= start_index and i in indices:
                frames.append(frame)
        return np.stack([x.to_ndarray(format="rgb24") for x in frames])


    # define a chat histiry and use `apply_chat_template` to get correctly formatted prompt
    # Each value in "content" has to be a list of dicts with types ("text", "image", "video")
    conversation = [
        {

            "role": "user",
            "content": [
                {"type": "text", "text": "Please provide a detailed description of the video."},
                {"type": "video"},
            ],
        },
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

    container = av.open(video_path)

    # sample uniformly 8 frames from the video, which is the number of frames used in training
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / 8).astype(int)
    clip = read_video_pyav(container, indices)
    inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

    output = model.generate(**inputs_video, max_new_tokens=512, do_sample=True)
    print(processor.decode(output[0], skip_special_tokens=True))


processor = LlavaNextVideoProcessor.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf"
)

# this is an evaluation video that hasn't been used in training
video_path = "./example_data/videos/ego4d/1e85d8b5-5ca8-4bbf-be51-21741ac8a694.mp4"

# the original model before finetuning
# we load and inference with it just for comparison
old_model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)
inference(old_model, processor, video_path)
del old_model

# the new model after finetuning
# notice that it's exactly the same as before
# (unless you used Q-LoRA training; see https://github.com/zjysteven/lmms-finetune/blob/main/docs/inference.md)
new_model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "./checkpoints/llava-next-video-7b_lora-True_qlora-False",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)
inference(new_model, processor, video_path)
del new_model

You are using a model of type llava_next to instantiate a model of type llava_next_video. This is not supported for all configurations of models and can yield errors.
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

You are using a model of type llava_next to instantiate a model of type llava_next_video. This is not supported for all configurations of models and can yield errors.
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


USER: 
Please provide a detailed description of the video. ASSISTANT: This is a video featuring a person wearing a flowered garment that has many different patterns throughout. The person appears to be sewing or adjusting the garment, with various tools and materials scattered around. As the camera zooms in, the person shows the different patterns on closer parts of their clothing. The person shows off the intricate details of the design and colors of the garment's pattern. The focus is on the fine details of the garment, showcasing the craftsmanship behind it.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

USER: 
Please provide a detailed description of the video. ASSISTANT: In the video, we see a person with a right-handed, right-dominant hand sewing skillfully, creating a series of stiches that secure a pattern or decorative element to a piece of fabric. They hold the fabric with their left hand. To the right of the sewing frame, there are tools and accessories such as a scissors and a needle, which are likely used for cutting and managing threads, respectively. The person is wearing a watch and a bracelet, and a black handbag can be seen close by, suggesting they may be in a personal or domestic setting. The sewing area is bright, well-lit, and we can see the background is a white wall with some decorative elements. On the right side of the frame, we can also notice a few personal items like a cellphone and a paperweight resting on what appears to be a desk or workspace. The camera angle and lighting focus on the sewing process allow us to see the intricate handiwork and techniques in

old model's description of the video:

- This is a video featuring a person wearing a flowered garment that has many different patterns throughout. The person appears to be sewing or adjusting the garment, with various tools and materials scattered around. As the camera zooms in, the person shows the different patterns on closer parts of their clothing. The person shows off the intricate details of the design and colors of the garment's pattern. The focus is on the fine details of the garment, showcasing the craftsmanship behind it.

new model's description of the video:

- In the video, we see a person with a right-handed, right-dominant hand sewing skillfully, creating a series of stiches that secure a pattern or decorative element to a piece of fabric. They hold the fabric with their left hand. To the right of the sewing frame, there are tools and accessories such as a scissors and a needle, which are likely used for cutting and managing threads, respectively. The person is wearing a watch and a bracelet, and a black handbag can be seen close by, suggesting they may be in a personal or domestic setting. The sewing area is bright, well-lit, and we can see the background is a white wall with some decorative elements. On the right side of the frame, we can also notice a few personal items like a cellphone and a paperweight resting on what appears to be a desk or workspace. The camera angle and lighting focus on the sewing process allow us to see the intricate handiwork and techniques involved in the task, although some background details may be less clear.

We see that the new model gives more detailed description, which indicates that our finetuning works since the training data is from ShareGPT4Video which features exactly detailed and long description of videos.