LITA: Language Instructed Temporal-Localization Assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz

[arXiv] [Project] [BibTeX]

Install

The environment requirements are mostly the same as LLaVA. In addition, install ffmpeg.
Clone this repository and navigate to LITA folder

git clone https://github.com/NVlabs/LITA.git
cd LITA

Install Package

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Dataset

See Preparing Datasets for LITA.

Weights

Model Name	LLM version	Weights
LITA-13B-v1.3	Vicuna-13B-v1.3	Link

Gradio Demo

First, downlaod the LITA weights from above.

python -m lita.serve.gradio_web_server \
    --model-path <weights-dir>/lita-vicuna-v1-3-13b-finetune

To create a public link, append --share to the above command. You can also launch the demo with quantized bits (4-bit, 8-bit) by appending --load-4bit or --load-8bit. Note that inference with quantized bits may not be as accurate as the full-precision model.

CLI Inference

We also provide inference using CLI without the need of Gradio interface.

python -m lita.serve.cli \
    --model-path <weights-dir>/lita-vicuna-v1-3-13b-finetune \
    --visual-path <video-path> --visual-data-type video

<video-path> is the path to the input video. Inference with quantized bits (--load-4bit or --load-8bit) also works here.

Train

The LITA model only uses one stage supervised fine-tuning. The linear projection is initialized by the LLaVA pretrained weights. The training uses 8 A100 GPUs with 80GB memory.

Prepare public checkpoints from Vicuna, LLaVA

git clone https://huggingface.co/lmsys/vicuna-13b-v1.3
git clone https://huggingface.co/liuhaotian/llava-pretrain-vicuna-13b-v1.3
mv vicuna-13b-v1.3 vicuna-v1-3-13b
mv llava-pretrain-vicuna-13b-v1.3 llava-vicuna-v1-3-13b-pretrain

Similarly for 7B checkpoints. Replace 13b with 7b in the above commands.

Supervised Fine-tuning

The LITA model can be trained using the supervised fine-tuning script here. First update information in the script such as dataset directory (--data_path) and checkpoint directory (./checkpoints).

cd LITA
sh scripts/finetune_vid.sh

Evaluation

We provide the evaluation pipeline for the ActivityNet-RTL dataset. Please first follow the dataset instruction and refer to our paper for more details.

Generate LITA responses and evaluate temporal localization metrics (mIOU and P@0.5)

python lita/eval/eval_model_rtl.py \
    --model-path <weights-dir>/lita-vicuna-v1-3-13b-finetune  \
    --question-file \
    <datasets-dir>/temporal_reasoning/annot_val_1_q229.json \
    --image-folder \
    <datasets-dir>/activitynet-captions/activitynet_frames \
    --output-dir \
    <result-dir>/lita-vicuna-v1-3-13b-finetune

Evaluate the generated responses using GPT-4

OPENAI_API_KEY="sk-***********************************" python lita/eval/eval_gpt_review_rtl.py \
    --context <datasets-dir>/activitynet-captions/val_1.json \
    --answer \
    <result-dir>/lita-vicuna-v1-3-13b-finetune/answers.json \
    --rule lita/eval/table/rule.txt \
    --output <result-dir>/reviews/lita-vicuna-v1-3-13b-finetune.jsonl

Summarize the evaluation results

python lita/eval/summarize_gpt_review.py -f <result-dir>/reviews/lita-vicuna-v1-3-13b-finetune.jsonl

License

This work is made available under the Nvidia Source Code License-NC. Click here to view a copy of this license.

The pre-trained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

Citation

If you find LITA useful for your research and applications, please cite using this BibTeX:

@article{huang2024lita,
  title={LITA: Language Instructed Temporal-Localization Assistant},
  author={De-An Huang and Shijia Liao and Subhashree Radhakrishnan and Hongxu Yin and Pavlo Molchanov and Zhiding Yu and Jan Kautz},
  journal={arXiv preprint arXiv:2403.19046},
  year={2024}
}

Acknowledgement

LLaVA: the codebase we built upon

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
lita		lita
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

lita

lita

llava

llava

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

LITA: Language Instructed Temporal-Localization Assistant

Contents

Install

Dataset

Weights

Gradio Demo

CLI Inference

Train

Prepare public checkpoints from Vicuna, LLaVA

Supervised Fine-tuning

Evaluation

License

Citation

Acknowledgement

About

Releases

Packages

Contributors 3

Languages

License

NVlabs/LITA

Folders and files

Latest commit

History

Repository files navigation

LITA: Language Instructed Temporal-Localization Assistant

Contents

Install

Dataset

Weights

Gradio Demo

CLI Inference

Train

Prepare public checkpoints from Vicuna, LLaVA

Supervised Fine-tuning

Evaluation

License

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages