<a href="https://colab.research.google.com/github/AdaptiveMotorControlLab/LLaVAction/blob/release_iccv/llavaction_video_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLaVAction: Evaluating and Training Multi-Modal Large Language Models for Action Recognition

- This repository contains the implementation for our ICCV 2025 submission on evaluating and training multi-modal large language models for action recognition.



**Please download the shared folder to your google drive and name it llavaction_demo_data https://drive.google.com/drive/folders/1ql8MSWTK-2_uGH1EzPOrifauwUNg4E6i?usp=sharing**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Installing flash attention, which is important for fast inference.



In [None]:
!pip install ninja
!pip install flash-attn --no-build-isolation

Creating a folder for caching the library files

In [None]:
!mkdir -p /content/drive/MyDrive/python_packages

Installing LLaVAction from the github.

In [None]:
from getpass import getpass

GITHUB_TOKEN = getpass("Enter your GitHub token: ")  # Hidden input
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/AdaptiveMotorControlLab/LLaVAction.git@release_iccv"

!pip install git+{REPO_URL}


Install decord for efficient video reading

In [None]:
!pip install decord


Adding the library into the system path

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/python_packages')

Import inference and visualization functions from LLaVAction

In [None]:
from llavaction.action.selective_inference import SelectiveInferencer
from llavaction.action.make_visualizations import visualize_with_uid
import os

Speciy where to load the EPIC-KITCHENS-100 videos and the LLaVAction checkpoint for the inference.
You can adjust n_frames to higher numbers for better performance (we observe it empirically), with the cost of using more compute.


In [None]:
data_root = '/content/drive/MyDrive/llavaction_demo_data/EK100_512/EK100'
checkpoint_folder = '/content/drive/MyDrive/llavaction_demo_data/checkpoint/dev_ov_0.5b_16f_top5_full'
inferencer = SelectiveInferencer(data_root,
                                     checkpoint_folder,
                                     include_time_instruction = False,
                                     n_frames = 16,
                                    use_flash_attention = True)

Define the 'caption' mode of the inference.

In [None]:
def get_caption(inferencer,
                uid,
                checkpoint_folder):
    caption =  inferencer.inference('',
                                     uid,
                                     'caption')
    return caption

Define the video id and the timestamp in that video for visual inspection.
Note that P01-P01_01 represents the video id. 3.00_4.00 denotes the start in second and end in second respectively.

In [None]:
uid = 'P01-P01_01_3.00_4.00'

visualize_with_uid(data_root, uid, 'vis_folder')

import IPython.display as display
from PIL import Image
import os
import matplotlib.pyplot as plt
import cv2

folder_path = f"vis_folder/{uid}"  # Change this to your actual filename


# List all image files
image_files = sorted([f for f in os.listdir(folder_path) if f.endswith((".jpg", ".png", ".jpeg"))])

# Set grid dimensions
cols = 4  # Adjust this for the number of images per row
rows = (len(image_files) + cols - 1) // cols  # Calculate the required number of rows

# Create a figure with subplots
fig, axes = plt.subplots(rows, cols, figsize=(12, 3 * rows))  # Adjust figure size
plt.subplots_adjust(wspace=0.05, hspace=0.05)  # Reduce horizontal & vertical spacing

# Loop through images and display them in the grid
for ax, img_file in zip(axes.flatten(), image_files):
    img_path = os.path.join(folder_path, img_file)
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB for proper display
    ax.imshow(img)
    ax.set_title(img_file, fontsize=8)  # Display filename in smaller font
    ax.axis("off")  # Hide axis labels

# Hide unused subplots (if any)
for ax in axes.flatten()[len(image_files):]:
    ax.axis("off")

plt.show()





Run the caption inference using llavaction on the video (with the specified timestamps)

In [None]:
caption = get_caption(inferencer, uid, checkpoint_folder)
caption